Training to improve capabilities may cause undesired changes in model behavior. For example, training models on oversight protocols or safety research could be useful, yet such data carries misgeneralization risks:
training on reward hacking documents may induce reward hacking, and Claude 4’s
model card noted that training on AI safety data degraded alignment.
Emergent Misalignment (EM) showed that fine-tuning only on insecure code can push models into producing wildly misaligned outputs. We worked to find methods to prevent this, inspired by
gradient routing.