The Murmuring Woman | Hun Tae Kim

I had a strange experience today that I wanted to write down. I was at a cafe with my girlfriend, planning our vacation. Nearby, there was a woman, maybe in her mid-40s, wearing a face mask. She was constantly murmuring to herself. Not loudly, and I couldn’t make out the words, but it was non-stop. She looked agitated, getting up, sitting down slightly differently, pacing out of the room, then returning to the same seat. This went on and on. As we wrapped up our trip planning, I looked up, and she was just gone. That quick.

Our vacation plans came together well, but the image of that woman lingered in my head. When I saw her, it reminded me of LLMs. I had to admit that my brain is so stuffed with AI these days, for better or worse.

Hear me out.

From self-talk to chain-of-thought

The woman’s constant self-talk, that murmuring, felt like what chain-of-thought reasoning models are currently doing. It’s a simple analogy, maybe too easy, but it stuck with me. I don’t know why this happens, but anthropomorphizing LLMs sometimes helps me see what capabilities they might need, or what data we should give them to make them more capable. These analogies make it easier for me to see things.

There’s a kind of progression here:

Traditional LLMs: These models don’t really “think” in a step-by-step way. They just generate, often verbatim, without much pause – a kind of knee-jerk reaction. This is like System 1 thinking.
Reasoning Models (Chain-of-Thought): When these came along, they blew the older models out of the water. This introduced a new scaling paradigm: test-time compute. Introspection, or thinking step-by-step, is much better than a knee-jerk response for many tasks. This is System 2, and it’s really good for improving capabilities. Noam Brown’s work really pioneered this area.

The limits of introspection

Current models are moving toward becoming agents. And here’s where the analogy with the woman becomes more interesting. To be clear, I don’t know her or what she was going through. She looked like she was having a tough time, and this is only an observation for the sake of analogy.

Constant introspection, just talking to oneself, only gets you so far. And that’s exactly the limit I see with first-generation reasoning models, like some of the DeepSeek models or OpenAI’s o1. They can think, they can “talk to themselves” on and on, but they can’t verify their own thoughts quite reliably.

Compare this to how people generally operate. When people think, they can self-verify using external tools or interactions. They might talk something through with someone else, or rely on external aids like their iPhone, a book, or a quick search. That’s what models like Anthropic’s Claude 3.7 Sonnet and OpenAI’s o3 are doing now. They interact with the real world through an external pipeline, a bridge we call “tools.”

The fine line of anthropomorphism

When you anthropomorphize an LLM this way, the need for tools and external interaction becomes obvious. But there’s a caveat: the internal modeling of an LLM is very different from human cognition. I’m anthropomorphizing only to see what LLMs might benefit from, what might make them more capable. It’s a fine line to walk.

Aren’t we all just next-word predictors?

As this kind of anthropomorphizing continues, and it’s easy to do because language models can seem persuasive and lifelike, it reminded me of something Scott Aaronson said a year ago. When LLMs first emerged and people argued “it’s just next-word prediction, just statistical modeling,” he’d retort, paraphrasing: “But what about you? Aren’t you just a next-word predictor? What about your mom?”

It cracked me up at the time. If I’d said that to some of my close friends who looked down on LLMs, they would have fumed. They’d be outraged! But when ChatGPT came out, I intuitively agreed with Aaronson’s point. My mind hasn’t changed on that.

I think as models get more capable, Aaronson’s quip, “aren’t we just the next-word predictor?”, will become true in a functional sense. Recently, LLMs passed the Turing test, but society moved on like nothing happened. Sooner or later, for every verifiable task, model capabilities will likely exceed human capabilities. And still, when that happens, they will be, at their core, next-word predictors. Superhuman next-word predictors, better than us at any given task.

Then what would we become?