Circuit Tracing – Review

This time, I’m looking at Anthropic’s paper “Circuit Tracing: Revealing Computational Graphs in Language Models”. The paper is about understanding what happens inside large language models. Instead of treating them as black boxes, the authors propose a method for mapping a model’s internal computations for specific tasks into interpretable graphs. It’s like trying to reverse engineer the model’s “thinking process.”

This paper focuses on the methods. There’s a companion paper, “On the Biology of a Large Language Model”, which uses these techniques to explore model internals. I decided to read the methods paper first so I could understand the tools before diving into the “biology” results.

The core idea is to use a special type of dictionary learning model called a “cross-layer transcoder” (CLT) to replace parts of the original network, specifically the MLP layers. This replacement lets them build “attribution graphs” that show how information flows through interpretable “features” when the model processes a prompt.

Key concepts

Cross-layer transcoder (CLT) and replacement model

The paper introduces CLTs as the core component. Unlike standard autoencoders that just reconstruct their input, a CLT is trained to emulate the output of an LLM’s MLP layers using sparse, interpretable features. An encoder reads from the residual stream at one layer, activates a sparse set of features, and decoders associated with these features contribute to approximating the MLP outputs at that layer and subsequent layers (hence “cross-layer”). By substituting the original MLPs with these trained CLTs, they create an interpretable “replacement model.”

Attribution graphs

For a specific prompt, the authors construct an attribution graph. This graph visualizes the step-by-step computation within a localized version of the replacement model. Nodes in the graph represent active CLT features, input token embeddings, reconstruction errors, and output logits. Edges represent the direct, linear influence of one node on another (calculated after freezing attention patterns and normalization constants for that prompt). These graphs aim to show the flow of information and computation leading to the model’s output.

Local replacement model

To ensure the attribution graph accurately reflects the model’s output for a specific prompt, they use a “local” replacement model. This model uses the CLT features but also incorporates the exact attention patterns and normalization factors from the original model’s run on that prompt. It also adds back any difference (error) between the CLT’s output and the true MLP output at each step. This makes the local model’s final output identical to the original model’s for that prompt, providing a precise basis for the attribution graph.

Features as interpretable units

The sparse activations learned by the CLTs are treated as “features”: ideally, interpretable building blocks of the model’s computation, such as a concept like “digital”, a state like “in an acronym”, or an action like “say DAG”. The goal is to understand circuits as interactions between these meaningful features.

Validation via interventions

The paper stresses validating the mechanisms found in attribution graphs. Since the replacement model might differ from the original, they use perturbation experiments (like activating or inhibiting specific features) in the original model to check if the downstream effects match what the attribution graph predicts.

What I learned

Transcoders vs. SAEs

A key distinction clicked for me: Sparse Autoencoders (SAEs) aim to reconstruct their input activations sparsely. Transcoders, however, are trained to emulate the computation of a component like an MLP layer, taking the MLP’s input and predicting its output using sparse features. This emulation aspect is why they work well for circuit analysis. They directly model the transformation step, allowing feature interactions to bridge over the original non-linear MLP. The name “transcoder” feels apt: it’s encoding and decoding, but transforming the signal to match a different target, the MLP output.

Visualizing the “thinking process” with graphs

The attribution graphs themselves are useful. Seeing the model’s process laid out for a specific prompt, like generating an acronym, felt like getting a glimpse into its internal logic. The way features activate based on input tokens like “Digital”, “Analytics”, and “Group”, then interact to promote the final output “DAG”, makes the computation more tangible than just looking at activations.

Interactive exploration

The authors developed an interactive interface for exploring these graphs. This seems useful because it is not just a static picture. It feels like having a tool to examine the model and trace the circuitry.

Layer progression

Looking at the features and their roles in the graphs, a pattern seemed apparent, matching general intuitions about deep networks. Features in earlier layers often seem more semantic, tied closely to specific input tokens or concepts, such as the word “digital”. Features in later layers appear more abstract or functional, involved in manipulating information or preparing the final output, such as “say DAG” or “sum ~92”.

Addition circuitry and number representation

The case study on addition, such as 36+59=95, was interesting. They identified different types of features: ones detecting properties of the input numbers (“add function features”), ones acting like lookup tables, such as _6 + _9, and ones representing properties of the sum (“sum features”). This connects to related work, including this paper on helix representations, suggesting numbers might be represented in a structured way in embeddings, where arithmetic operations correspond to geometric transformations. Seeing the transcoder find features that seem to implement parts of this arithmetic process in Claude Haiku was convincing.

The caps lock token

A small detail I noticed in the acronym example was the tokenizer using a special “Caps Lock” token (⇪). I’m not sure of its exact function, but it was interesting to see it explicitly represented. It makes you wonder how specific tokenization choices influence learning.

Summary

The “Circuit Tracing” paper presents a method for mapping the internal workings of LLMs using cross-layer transcoders and attribution graphs. By replacing opaque MLP computations with interpretable feature-based emulations, it allows detailed tracing of information flow on specific prompts.

The method seems useful, especially because the case studies around acronyms and addition give concrete examples of mechanisms that can be uncovered.

Thinking about the companion paper’s title, “On the Biology of a Large Language Model,” the term “biology” feels appropriate here. We’re not analyzing traditional, rule-based systems like operating systems or computer networks, which are human-designed and follow explicit logic. Instead, these LLMs are more like systems that have been grown from vast amounts of data. Understanding them means peering inside, mapping their structures, and figuring out how different parts contribute to behavior, much like studying the anatomy and function of a biological organism. We’re dissecting a synthetic brain to understand how it works. This paper provides some sharp tools for that dissection.