Containing Spillover with Minimal Supervision

Ariana Azarbal, Matthew A. Clarke, Jorio Cocola, Cailley Factor, Alex Cloud

Abstract

Narrow fine-tuning can cause language models to become broadly misaligned. Is this preventable? A satisfactory solution should not require comprehensive alignment data, because comprehensive data may be difficult or impossible to obtain. We formulate this problem in terms of spillover; the effect that training on one context has on a model’s predictions on a wide range of other contexts. We demonstrate spillover and show it cannot be prevented simply by finetuning on limited alignment data. To address this challenge, we explore a variety of ways to control spillover using limited training data, including regularization, continual learning methods, and previously-proposed methods for controlling misgeneralization. We find multiple strategies that can improve generalization from limited alignment training data, including a novel method called Steering Weights. Our work progresses towards achieving fine-grained control over how models generalize their supervision, which may enable safer development of superintelligent AI systems.

Date

May 21, 2025

Event

SPAR Symposium 2024

Gradient Routing Neural Networks AI Safety SPAR Emergent Misalignment Selective Generalisation