AI Safety

Containing Spillover with Minimal Supervision

Winner of 3rd place in the SPAR Symposium 2024 poster competition.

May 21, 2025

Ariana Azarbal, Matthew A. Clarke, Jorio Cocola, Cailley Factor, Alex Cloud

SAE Latent Co-occurrence

Sparse AutoEncoder (SAE) latents show promise as a method for extracting interpretable features from large language models (LM), but their overall utility as the foundation of mechanistic understanding of LM remains unclear. Ideal features would be linear and independent, but we show that there exist SAE latents in GPT2-Small display non-independent behaviour, especially in small SAEs. We investigate: 1. What fraction of SAE latents might best be understood in groups rather than individually? 2. Do low-dimensional subspaces mapped by co-occurring groups of features ever provide a better unit of analysis (or possibly intervention) than individual SAE latents?

Matthew A. Clarke, Hardik Bhatnagar, Joseph Bloom

SAE Latent Co-occurrence

Gradient Routing

Can we control where learning happens in neural networks? Gradient routing addresses this by applying masks to limit the flow of gradients during backpropagation. By supplying different masks for different data points, the user can induce specialized subcomponents within a model. I worked on applying this method to AI safety research, focussing on addressing the problem of emergent misalignment during continual learning.

Matthew A. Clarke