Emergent Misalignment

Matthew A. Clarke

Mar 28, 2024

Can we control where learning happens in neural networks? Gradient routing addresses this by applying masks to limit the flow of gradients during backpropagation. By supplying different masks for different data points, the user can induce specialized subcomponents within a model. As part of SPAR, I worked on applying this method to AI safety research, focussing on addressing the problem of emergent misalignment during continual learning.

Image credit: Adapted from Cloud et al., 2024. CC-BY 4.0