Neural Networks

Containing Spillover with Minimal Supervision

Winner of 3rd place in the SPAR Symposium 2024 poster competition.

May 21, 2025

Ariana Azarbal, Matthew A. Clarke, Jorio Cocola, Cailley Factor, Alex Cloud

Emergent Misalignment

Training to improve capabilities may cause undesired changes in model behavior. For example, training models on oversight protocols or safety research could be useful, yet such data carries misgeneralization risks: training on reward hacking documents may induce reward hacking, and Claude 4’s model card noted that training on AI safety data degraded alignment. Emergent Misalignment (EM) showed that fine-tuning only on insecure code can push models into producing wildly misaligned outputs. We worked to find methods to prevent this, inspired by gradient routing.

Matthew A. Clarke

Emergent Misalignment