SAE

SAE Latent Co-occurrence

Sparse AutoEncoder (SAE) latents show promise as a method for extracting interpretable features from large language models (LM), but their overall utility as the foundation of mechanistic understanding of LM remains unclear. Ideal features would be linear and independent, but we show that there exist SAE latents in GPT2-Small display non-independent behaviour, especially in small SAEs. We investigate: 1. What fraction of SAE latents might best be understood in groups rather than individually? 2. Do low-dimensional subspaces mapped by co-occurring groups of features ever provide a better unit of analysis (or possibly intervention) than individual SAE latents?

Matthew A. Clarke, Hardik Bhatnagar, Joseph Bloom

SAE Latent Co-occurrence

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces

Sparse AutoEncoder (SAE) latents show promise as a method for extracting interpretable features from large language models (LM), but …

Matthew A. Clarke, Hardik Bhatnagar, Joseph Bloom

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces

Features that Fire Together Wire Together: Examining Co-occurence of SAE Features

Sparse autoencoders (SAE) aim to decompose activation patterns in neural networks into interpretable features. This promises to let us …

Oct 11, 2024

Matthew A. Clarke, Joseph Bloom

Features that Fire Together Wire Together: Examining Co-occurence of SAE Features