Transformer Circuits (Anthropic) - Review

This post reviews Anthropic’s paper “A Mathematical Framework for Transformer Circuits.” The authors give a step-by-step way to think about attention heads in simplified transformer models, especially how their internal mechanics can be described mathematically. It’s a dense paper, but I picked up a few high-level ideas I hadn’t seen before.

Key concepts

1. Attention heads as independent operations

Attention heads are usually presented as concatenating their outputs and then multiplying by a large projection matrix (W_O). The authors give a more intuitive, though mathematically equivalent, interpretation: attention heads independently compute results and add them directly to the residual stream. This view makes it easier to reason about a single head, even though the implementation still uses concatenation for efficiency.

2. Residual stream as a communication channel

The residual stream is a shared space where layers communicate intermediate results. Each attention or MLP layer can both “read” from and “write” into it. This means the residual stream carries information about token meanings, syntactic roles, semantic relationships, and intermediate predictions. The shared-whiteboard analogy helped me.

3. Virtual weights and composing layers

Each transformer layer performs linear transformations on the residual stream. When layers interact, these transformations can be multiplied out into “virtual weights,” which directly connect non-adjacent layers. Instead of treating each layer separately, virtual weights let you think about the combined transformation of several layers, like composing several functions into one step.

4. QK and OV circuits: “which” vs. “what”

Attention heads split their tasks clearly into two components:

QK circuit (Query-Key) decides which tokens to pay attention to.
OV circuit (Output-Value) decides what information from the selected tokens should be communicated.

Separating these operations makes attention easier to reason about. It’s like a mail delivery system where route-planning (QK) is separate from package-handling (OV).

5. Skip-trigrams in one-layer transformers

A one-layer transformer (attention-only) can model relationships called skip-trigrams, patterns of the form “A… B C.” The QK circuit identifies the earlier token (A), while the OV circuit determines how it affects the likelihood of a later token (C), given the current context (B). This shows that even a small transformer can capture useful short-range context.

6. Copying and primitive in-context learning

One-layer attention heads often learn simple copying behavior: predicting a token identical to or closely related to an earlier one. Although basic, this copying behavior is a primitive form of in-context learning, because the model uses context to influence predictions. It’s limited, but it makes the path to more advanced contextual adaptation easier to see.

7. Induction heads and advanced in-context learning

Two-layer transformers introduce induction heads, a more advanced mechanism for in-context learning:

They look back in the sequence for previous occurrences of the current token.
They then predict the token that historically followed that occurrence.

Compared to simple copying, induction heads handle sequences and contexts better, even in completely random sequences. This mechanism relies heavily on K-composition, meaning the second-layer attention head uses information from a first-layer head that attends to an earlier token. This shifts attention back one token and makes more complex pattern matching possible.

What I learned

A clearer mental model of attention heads

Previously, I viewed attention heads as opaque, intertwined mechanisms. Separating QK (attention pattern) from OV (content selection) circuits gives me a cleaner mental model. It makes interpretation simpler because I can separately ask “where is the model looking?” and “what information is it moving?”

Virtual weights as a useful conceptual tool

The concept of virtual weights, where interactions between distant layers are multiplied out into direct connections, was a real “aha” moment. Instead of considering each layer in isolation, looking at their combined effect makes interpretation easier.

Understanding primitive in-context learning (copying)

At first, I was skeptical about calling simple copying “in-context learning.” After thinking about it, I realized copying does count as a basic form of adapting predictions based on context, even if the adaptation is trivial. That helped me see how more complex behavior can build from simple mechanisms.

Induction heads: how transformers learn repetition

The mechanism of induction heads, especially the “shifting the key” idea via K-composition, stood out clearly. Rather than relying only on statistical likelihoods, as simpler copying does, induction heads form sequence-based “rules”: if you saw token A before, predict what followed A previously. Seeing this happen even in random sequences made it clear that induction heads capture sequence structure, not just statistics.

Analyzing transformers through eigenvectors

I found the use of eigenvectors and eigenvalues to analyze OV circuits interesting. Eigenvectors show how the transformer clusters tokens that mutually reinforce each other’s probabilities, like groups of related words, while positive eigenvalues highlight copying or self-amplifying behavior. The analysis isn’t perfect, but it gives another way to find structure inside the transformer’s enormous matrices.

Limits of current interpretability approaches

This paper clarified that some parts of transformers are now interpretable, especially attention mechanisms and induction heads, but MLP layers and more complex interaction patterns remain hard. Fully understanding transformers still requires progress, especially in interpreting neuron-level behavior inside MLP layers.

Summary

The Transformer Circuits paper gives a clear mathematical framework for understanding simplified transformers. Separating the roles of attention heads, introducing virtual weights, and explaining induction heads made transformer internals feel less opaque to me.

This approach clarifies the mechanics behind simpler transformers, but larger realistic models remain hard to interpret, especially MLP layers. Still, the framework gives useful building blocks for thinking about interpretability.