Containing Spillover with Minimal Supervision

Abstract

Narrow fine-tuning can cause language models to become broadly misaligned. Is this preventable? A satisfactory solution should not require comprehensive alignment data, because comprehensive data may be difficult or impossible to obtain. We formulate this problem in terms of spillover; the effect that training on one context has on a model’s predictions on a wide range of other contexts. We demonstrate spillover and show it cannot be prevented simply by finetuning on limited alignment data. To address this challenge, we explore a variety of ways to control spillover using limited training data, including regularization, continual learning methods, and previously-proposed methods for controlling misgeneralization. We find multiple strategies that can improve generalization from limited alignment training data, including a novel method called Steering Weights. Our work progresses towards achieving fine-grained control over how models generalize their supervision, which may enable safer development of superintelligent AI systems.

Date
May 21, 2025
Matthew A. Clarke
Matthew A. Clarke
Research Scientist, Associate Staff Visitor

Research Scientist at the AI Security Institute and Associate Staff Visitor at UCL.