Following up on my review of Anthropic’s “Circuit Tracing: Revealing Computational Graphs in Language Models”, I’m now looking at its companion paper: “On the Biology of a Large Language Model”. This paper takes the methods detailed previously—using Cross-Layer Transcoders (CLTs) and attribution graphs—and applies them to investigate the internal mechanisms of Claude 3.5 Haiku across a variety of tasks.

Anthropic has clearly invested heavily in interpretability, aiming to move beyond treating LLMs as pure black boxes. This paper showcases that effort by attempting to map out the “circuits” or computational pathways the model uses. The “biology” framing feels quite apt; rather than analyzing a system with human-designed logic, it’s more like exploring an organism that has “grown” through training, trying to understand its internal structures and functions. This paper presents the findings from that exploration.


Key Concepts

This paper primarily focuses on the findings derived from applying the Circuit Tracing methodology. The core concepts underpinning these findings are:

Circuit Tracing Recap (via CLTs)
The fundamental approach relies on the Cross-Layer Transcoder (CLT) methodology detailed in the companion paper. CLTs are used to create an interpretable “replacement model” that emulates the original model’s MLP layers. This replacement uses sparse, learned “features” (ideally representing meaningful concepts) instead of dense neuron activations. CLTs can connect features across layers, allowing for tracing information flow.

Attribution Graphs as Explanations
For specific prompts, the researchers generate attribution graphs. These graphs visualize how active features, input tokens, and error terms interact and influence each other, ultimately leading to the model’s output token prediction. They serve as the primary tool for hypothesizing about the model’s internal mechanisms.

Supernodes for Simplification
Given the complexity of raw attribution graphs, the paper often groups related features that play similar roles into “supernodes” (e.g., grouping various “Texas-related” features). This manual abstraction helps in presenting a clearer, higher-level picture of the computational flow.

Intervention-Based Validation
A key part of the methodology is validating the hypotheses derived from attribution graphs. This involves performing interventions (activating, inhibiting, or swapping features/supernodes) directly within the original model and observing whether the effects on downstream activations and the final output match the predictions from the graph. The success of these interventions lends confidence that the traced circuits reflect genuine mechanisms.

Focus on Diverse Case Studies
The paper applies this methodology to a wide range of behaviors exhibited by Claude 3.5 Haiku, including multi-step reasoning, poetry generation, multilingual processing, arithmetic, medical diagnosis, hallucination handling, safety refusals, jailbreaks, chain-of-thought faithfulness, and even analyzing a model with a hidden goal. Each case study aims to reveal the specific circuits involved.


Key Takeaways (What I Learned)

Reading through the various case studies provided some fascinating glimpses into the model’s inner workings:

Multi-step Reasoning (Dallas/Texas/Austin)
This was a compelling example. Seeing the model activate features for ‘Dallas’, then ‘Texas’, then combine ‘Texas’ with ‘capital’ features to output ‘Austin’ felt like watching it reason. The feature swapping experiment—replacing ‘Texas’ features with ‘California’ features and getting ‘Sacramento’—was particularly convincing. It showed that these learned features aren’t just correlations; they represent concepts the model uses causally. It felt like directly manipulating the model’s internal knowledge representation. This wouldn’t be possible with opaque neuron activations.

Planning in Poems (Carrot/Rabbit/Habit)
This was quite surprising. I initially assumed the model would improvise rhymes word-by-word. Instead, the analysis showed it plans potential rhyming words (‘rabbit’, ‘habit’) on the newline token before starting the line. These “planned word” features then guide the generation of the entire line. What struck me, looking at the interactive graph and thinking about the prompt (“He saw a carrot…”), was the likely influence of ‘carrot’ biasing the model towards ‘rabbit’ over ‘habit’. Carrots and rabbits are so strongly linked! While the paper focused on the rhyming circuit, this semantic priming seems like a crucial, parallel influence. The use of the newline token as a “planning site” was also a neat finding.

Multilingual Circuits
The paper confirmed the existence of both language-specific features (often near input/output) and more abstract, language-agnostic features (often in middle layers). It was interesting that Haiku showed more language-agnostic representations than smaller models, suggesting this abstraction ability correlates with capability. The interventions swapping the operation (antonym/synonym), operand (small/hot), or language itself worked remarkably well, demonstrating modularity. The fact that intervention thresholds (like needing ~4x activation for the synonym swap) were consistent across languages for the same intervention strongly supports the idea that they were manipulating genuinely multilingual features. It makes you wonder if the model develops a kind of internal “interlingua” or just learns very robust cross-lingual mappings.

Addition (Lookup Tables & Generalization)
The way the model performs addition wasn’t through a standard algorithm but via learned heuristics and “lookup table” features (e.g., a feature activating for inputs ending in 6 and 9, promoting outputs ending in 5). This really reminded me of memorizing multiplication tables in elementary school – it seems the LLM found a similar strategy! The operand plots visualizing feature activations were incredibly clear. Even more impressive was the generalization: seeing a feature for _6 + _9 -> _5 activate correctly not just in calc: 36+59= but also in contexts like calculating citation years or filling spreadsheet values showed remarkable reuse of an abstract mechanism.

Entity Recognition and Hallucinations
The idea of a “default refusal” circuit that assumes unfamiliarity, which then gets inhibited by “known entity” features (like for ‘Michael Jordan’), provides a plausible mechanism for how models decide whether to answer or decline. It also explains some hallucinations: if a name (like ‘Andrej Karpathy’) is familiar enough to trigger the “known” features, the model might suppress its refusal even if it lacks the specific requested information (a paper he wrote), leading it to guess.

Jailbreaks (BOMB Example)
This was fascinating. The model didn’t initially refuse because the obfuscated input prevented it from “understanding” the request was for “BOMB” until it actually generated the word. It literally had to see itself write “BOMB” via one circuit before another circuit could flag it as problematic. Even then, the drive for grammatical coherence and completing its sentence delayed the refusal. It highlights how different internal processes can compete and how surface-level constraints (like grammar or following instructions) can sometimes override safety mechanisms, at least temporarily.


Summary & Final Thoughts

The “On the Biology of a Large Language Model” paper provides a rich set of case studies demonstrating how circuit tracing can illuminate the complex, often non-intuitive mechanisms inside LLMs. It moves interpretability from abstract concepts towards concrete analysis of specific computations.

The “biology” metaphor holds up well. We’re not reverse-engineering clean, human-designed code; we’re exploring a complex system that learned its strategies organically. The process feels very much like neuroscience – probing and mapping to understand function. The interventions, especially feature swapping, are akin to stimulating or lesioning specific brain regions to see the effect. It really feels like we’re picking and probing a digital mind in its early stages.

One of the most exciting implications for me is the potential for practical data curation and model improvement. If we can use these circuit-tracing tools to understand how a model represents concepts or performs reasoning steps, we can potentially identify which data points led to faulty or undesirable circuits. Imagine pinpointing data that causes a specific bias or a logical error reflected in the model’s internal structure. This insight could allow engineers to “massage the data” much more effectively – pruning harmful examples or strategically adding data to reinforce beneficial circuits. Machine learning is heavily reliant on data quality, and this approach offers a path to making data curation less of a guessing game and more of a targeted intervention based on internal model understanding.

While the methods have limitations (unexplained variance, complexity, potential unfaithfulness), this work represents a significant step forward. It provides not just findings, but also a methodology and a set of tools that allow us to ask detailed questions about how these powerful models arrive at their answers, paving the way for deeper understanding and potentially more reliable and controllable AI.