On the Biology of a Large Language Model – Review

Following up on my review of Anthropic’s “Circuit Tracing: Revealing Computational Graphs in Language Models”, I’m now looking at its companion paper: “On the Biology of a Large Language Model”. This paper takes the methods from the previous paper, using Cross-Layer Transcoders (CLTs) and attribution graphs, and applies them to Claude 3.5 Haiku across a variety of tasks.

Anthropic has clearly invested heavily in interpretability, trying to move beyond treating LLMs as pure black boxes. This paper is part of that effort: it maps out the “circuits” or computational pathways the model uses. The “biology” framing feels apt. Rather than analyzing a system with human-designed logic, it’s more like exploring an organism that has “grown” through training and trying to understand its internal structures.

Key concepts

This paper mainly focuses on the findings from applying the circuit tracing methodology. The core concepts behind those findings are:

Circuit tracing recap

The approach relies on the Cross-Layer Transcoder (CLT) methodology from the companion paper. CLTs create an interpretable “replacement model” that emulates the original model’s MLP layers. This replacement uses sparse, learned “features”, ideally representing meaningful concepts, instead of dense neuron activations. CLTs can connect features across layers, which allows information flow to be traced.

Attribution graphs as explanations

For specific prompts, the researchers generate attribution graphs. These graphs visualize how active features, input tokens, and error terms interact and influence each other, ultimately leading to the model’s output token prediction. They are the main tool for forming hypotheses about the model’s internal mechanisms.

Supernodes for simplification

Raw attribution graphs are complex, so the paper often groups related features that play similar roles into “supernodes”, such as grouping various Texas-related features. This manual abstraction makes the computational flow easier to present.

Intervention-based validation

A key part of the methodology is validating hypotheses from attribution graphs. This means performing interventions, such as activating, inhibiting, or swapping features and supernodes, directly inside the original model and checking whether the downstream effects match the graph’s predictions. If the interventions work, that gives more confidence that the traced circuits reflect real mechanisms.

Diverse case studies

The paper applies this methodology to many behaviors in Claude 3.5 Haiku, including multi-step reasoning, poetry generation, multilingual processing, arithmetic, medical diagnosis, hallucination handling, safety refusals, jailbreaks, chain-of-thought faithfulness, and even a model with a hidden goal. Each case study tries to reveal the circuits involved.

What I learned

Reading through the case studies gave me some interesting glimpses into the model’s inner workings:

Multi-step reasoning

The Dallas/Texas/Austin example was clear. Seeing the model activate features for “Dallas”, then “Texas”, then combine “Texas” with “capital” features to output “Austin” felt like watching it reason. The feature-swapping experiment, replacing “Texas” features with “California” features and getting “Sacramento”, was convincing. It showed that these learned features are not just correlations; they represent concepts the model uses causally. It felt like directly manipulating the model’s internal knowledge representation.

Planning in poems

This was quite surprising. I initially assumed the model would improvise rhymes word by word. Instead, the analysis showed it plans potential rhyming words, such as “rabbit” and “habit”, on the newline token before starting the line. These “planned word” features then guide the generation of the entire line. What struck me, looking at the interactive graph and thinking about the prompt (“He saw a carrot…”), was the likely influence of “carrot” biasing the model toward “rabbit” over “habit”. Carrots and rabbits are so strongly linked! While the paper focused on the rhyming circuit, this semantic priming seems like an important parallel influence. The use of the newline token as a planning site was also neat.

Multilingual circuits

The paper confirmed both language-specific features, often near input/output, and more abstract language-agnostic features, often in middle layers. It was interesting that Haiku showed more language-agnostic representations than smaller models, suggesting this abstraction ability may correlate with capability. The interventions swapping the operation (antonym/synonym), operand (small/hot), or language itself worked well, showing modularity. The fact that intervention thresholds, like needing about 4x activation for the synonym swap, were consistent across languages for the same intervention supports the idea that they were manipulating genuinely multilingual features. It makes you wonder if the model develops a kind of internal “interlingua” or just learns very robust cross-lingual mappings.

Addition and lookup tables

The way the model performs addition wasn’t through a standard algorithm but through learned heuristics and “lookup table” features. For example, a feature activates for inputs ending in 6 and 9, promoting outputs ending in 5. This reminded me of memorizing multiplication tables in elementary school; it seems the LLM found a similar strategy. The operand plots visualizing feature activations were clear. Also notable was the generalization: seeing a feature for _6 + _9 -> _5 activate correctly not just in calc: 36+59= but also in contexts like calculating citation years or filling spreadsheet values showed reuse of an abstract mechanism.

Entity recognition and hallucinations

The idea of a “default refusal” circuit that assumes unfamiliarity, which then gets inhibited by “known entity” features, like for “Michael Jordan”, gives a plausible mechanism for how models decide whether to answer or decline. It also explains some hallucinations: if a name like “Andrej Karpathy” is familiar enough to trigger the “known” features, the model might suppress its refusal even if it lacks the specific requested information, leading it to guess.

Jailbreaks

This was also interesting. The model didn’t initially refuse because the obfuscated input prevented it from “understanding” the request was for “BOMB” until it actually generated the word. It had to see itself write “BOMB” through one circuit before another circuit could flag it as problematic. Even then, the drive for grammatical coherence and completing its sentence delayed the refusal. It shows how different internal processes can compete, and how surface-level constraints like grammar or following instructions can temporarily override safety mechanisms.

Summary

The “On the Biology of a Large Language Model” paper gives a set of case studies showing how circuit tracing can reveal concrete mechanisms inside LLMs. It moves interpretability away from abstract discussion and toward specific computations.

The “biology” metaphor holds up well. We’re not reverse-engineering clean, human-designed code; we’re exploring a complex system that learned its strategies organically. The process feels like neuroscience: probing and mapping to understand function. The interventions, especially feature swapping, are like stimulating or lesioning specific brain regions to see the effect. It feels like we’re picking and probing a digital mind in its early stages.

One implication for me is practical data curation and model improvement. If these tools help us understand how a model represents concepts or performs reasoning steps, maybe we can identify which data points led to faulty or undesirable circuits. Imagine pinpointing data that causes a specific bias or a logical error reflected in the model’s internal structure. That could let engineers “massage the data” more intentionally, pruning harmful examples or adding data to reinforce useful circuits. Since machine learning depends so much on data quality, this could make data curation less of a guessing game.

The methods still have limitations: unexplained variance, complexity, and potential unfaithfulness. But they let us ask more detailed questions about how models arrive at their answers, which feels like the right direction if we want more reliable and controllable AI.