I had a peculiar experience today that I wanted to share. I was at a cafe with my girlfriend, planning our vacation. Nearby, there was a woman, maybe in her mid-40s, wearing a face mask. She was constantly murmuring to herself – not loudly, I couldn’t make out the words, but it was non-stop. She looked really agitated, getting up, sitting down slightly differently, pacing out of the room and then returning to the same seat. This went on and on. As we wrapped up our trip planning, I looked up, and she was just gone. That quick.

Our vacation plans came together well, but the image of that woman kind of lingered in my head. When I saw her, it really reminded me of large language models. I had to admit that my brain is so stuffed with AI these days – for better or worse.

Hear me out.

From Self-Talk to Chain-of-Thought

The woman’s constant self-talking, that murmuring, felt exactly like what chain-of-thought reasoning models are currently doing. It’s a very simple, almost too easy analogy, but if you start to put weight on it, it feels really, really profound. I don’t know why this happens, but I find that anthropomorphizing large language models sometimes helps me see what capabilities they might need or what data we should give them to make them more capable. These kinds of analogies make it easier for me to see things.

There’s a sort of stack here, a progression:

  1. Traditional LLMs: These models don’t really “think” in a step-by-step way. They just generate, often verbatim, without much pause – a kind of knee-jerk reaction. This is like System 1 thinking.
  2. Reasoning Models (Chain-of-Thought): When these came along, they blew the older models out of the water. This introduced a new scaling paradigm: test-time compute. Introspection, or thinking step-by-step, is much better than a knee-jerk response for many tasks. This is System 2, and it’s really good for improving capabilities. Noam Brown’s work really pioneered this area.

The Limits of Introspection and the Need for Tools

Now, what current models are doing is moving towards becoming agents. And here’s where the analogy with the woman (and I want to be clear, I don’t know her or what she was going through, but she really looked like she was having a tough time – this is just an observation for the sake of analogy) becomes even more interesting.

Constant introspection, just talking to oneself, only gets you so far. And that’s exactly the limit I see with first-generation reasoning models, like some of the DeepSeek models or OpenAI’s o1. They can think, they can “talk to themselves” on and on, but they can’t verify their own thoughts quite reliably.

Compare this to how people generally operate. When “normal” people think, they can self-verify using external tools or interactions. They might talk something through with someone else for verification, or rely on external aids like their iPhone, a book, or a quick search. This analogy is such a simple thing, right? And that’s exactly what models like Anthropic’s Claude 3.7 Sonnet and OpenAI’s o3 are doing now. They excel at interacting with the real world via an external pipeline, a bridge we call “tools.”

The Fine Line of Anthropomorphism

When you anthropomorphize a large language model this way, the need for tools and external interaction becomes very obvious. But there’s a very key caveat: the internal modeling of an LLM is very different from human cognition. I’m anthropomorphizing for the sake of seeing what LLMs might benefit from, what might make them more capable. It’s a very fine line to walk.

Aren’t We All Just Next-Word Predictors?

As this kind of anthropomorphization continues – and it’s very easy to do because language models are so persuasive and lifelike – it reminded me of something Scott Aaronson said a year ago. When LLMs first emerged and there were naysayers arguing “it’s just next-word prediction, just statistical modeling,” he’d retort (and I’m paraphrasing) “But what about you? Aren’t you just a next-word predictor? What about your mom?”

It really kind of cracked me up at the time. If I’d said that to some of my close friends who looked down upon LLMs, they would have fumed, they’d be outraged! But when ChatGPT came out, I intuitively, wholeheartedly agreed with Aaronson’s point. My mind hasn’t changed on that.

I think as models get more capable, Aaronson’s quip – “aren’t we just the next-word predictor?” – will basically become true in a functional sense. Recently, LLMs just passed the turing test, but society has moved on like nothing ever happened. Sooner or later, for every verifiable task, model capabilities will likely exceed human capabilities. And still, when that happens, they will still be, at their core, next-word predictors. Superhuman next-word predictors, better than us at any given task.

Then what would we become?