Features that Fire Together Wire Together: Examining Co-occurence of SAE Features

Abstract

Sparse autoencoders (SAE) aim to decompose activation patterns in neural networks into interpretable features. This promises to let us see what a model is ‘thinking’ and so facilitates understanding, detection and possibly correction of dangerous behaviours. SAE features will be easiest to interpret if they are independent. However, we find that even in large SAEs, features co-occur more than chance. We set out to investigate whether understanding these co-occurrences is necessary for understanding model function, or whether features remain independently interpretable. We find that co-occurrence is rarer in larger SAEs, and reduces in relevance, likely due to feature splitting. When there is co-occurrence, features jointly map interpretable subspaces, e.g. week days, or position in a url. However, only a subset of these can be interpreted by looking at features independently, suggesting that features that fire together may need to be understood as a group to best interpret network behaviour.

Date
Oct 11, 2024 12:13 PM — 12:13 PM
Matthew A. Clarke
Matthew A. Clarke
Research Fellow

Research Fellow at PIBBSS (Principles of Intelligent Behavior in Biological and Social Systems) working on mechanistic interpratibility of large language models with sparse autoencoders.