Selective Generalization: Improving Capabilities While Maintaining Alignment

Ariana Azarbal, Matthew A. Clarke, Jorio Cocolla, Cailley Factor, Alex Cloud

July, 2025

Abstract

Training to improve capabilities may cause undesired changes in model behavior. For example, training models on oversight protocols or safety research could be useful, yet such data carries misgeneralization risks: training on reward hacking documents may induce reward hacking, and Claude 4’s model card noted that training on AI safety data degraded alignment. Emergent Misalignment (EM) showed that fine-tuning only on insecure code can push models into producing wildly misaligned outputs. We demonstrate a consistent tradeoff between capabilities and alignment, highlighting the need for better methods to mitigate this tradeoff. Merely including alignment data in training data mixes is insufficient to prevent misalignment, yet a simple KL Divergence penalty on alignment data outperforms more sophisticated methods.

Type

Publication

Alignment Forum

Image credit: Azarbal, Clarke, Cocolla & Factor et al., 2025