Agency Over Retrieval?

I’ve been thinking more about OpenAI’s o3 and o4 mini models since my last post. I looked into the system prompt for o3/o4 mini and noticed something interesting: it strongly encourages the model to search the web whenever a query is even slightly vague or uncertain. Web search is almost the default behavior.

Initially, I was puzzled by this choice. Why prompt a reasoning and agentic model to search the web so readily? These aren’t primarily web search models like Perplexity AI. Why not rely more on their internal knowledge and reasoning first?

But as I thought more about it, the choice started to make sense.

Beyond hallucination mitigation

Of course, one part of the answer is hallucination. Getting LLMs to be consistently reliable and avoid making things up is still a hard problem. Improvements in data and algorithms have reduced blatant hallucination, but high reliability requires grounding. Accessing relevant, up-to-date information helps with that.

However, I don’t think that’s the full story. I think OpenAI is also leaning into the agentic capabilities of these models.

Letting the model choose its context

Think about traditional approaches like Retrieval-Augmented Generation (RAG). In those systems, a separate retrieval mechanism analyzes the user’s query, finds potentially relevant source documents, and then stuffs that information into the language model’s context window. The retrieval system decides what the main LLM sees.

What OpenAI seems to be doing with o3/o4 mini is different. They are offloading the task of finding relevant information to the main model itself. Instead of an external system pushing context, the agentic model is encouraged to pull the context it decides it needs by searching the web.

Let me break it down with an analogy. Imagine you need to solve a complex problem.

Option A (RAG-like): Someone else (a separate system) looks at your problem, finds some books or articles they think are relevant, and hands them to you. You then try to solve the problem using only those materials.
Option B (o3/o4 mini-like): You look at the problem, and you decide to proactively search your bookshelf, scour the internet, gather information, and based on what you find, iteratively search for more information until you feel you have what you need.

Option B gives you autonomy. You actively choose what information you consume. That feels like a better setup for complex or nuanced problems.

The key difference is agency. The RAG system (Option A) isn’t necessarily as smart or capable as the main LLM it’s feeding context to. Why let a potentially less sophisticated system pre-filter the information? Why not let the powerful base model decide what information is most relevant or needed for its own reasoning process?

This principle of giving the model agency to select its own context seems broader than text retrieval via web search. It applies across modalities. Look at OpenAI’s recent post on “Thinking with Images”. They show how o3/o4 mini can use tools to manipulate images during its chain of thought. If text in an image is upside down or hard to read, the model can zoom in or rotate that part of the image. If an image is complex, it can focus on the relevant section. This is basically visual information retrieval, driven by the model’s ability to choose which visual information to inspect.

Traditional RAG can sometimes feel like it dumps context into the window regardless of whether the model finds it useful or sufficient. Giving the model agency to search, whether the web for text or pixels within an image, means it can gather what it needs, when it needs it. The model decides how to inform itself.

Consolidation and the Bitter Lesson again?

This feels like another step in the consolidation trend we’ve seen in AI. Previously, NLP was fragmented into many specific tasks, such as sentiment analysis and named entity recognition. With the rise of powerful transformers, many of these specialized tasks were absorbed into large general models.

Now, agency might enable further consolidation. Tasks like information retrieval, textual or visual, and hallucination mitigation, previously handled by separate scaffolding or techniques like RAG, might increasingly move into the model’s agentic reasoning loop. As models become more general and capable agents, they can take on more of these subtasks themselves.

In a way, it feels like the Bitter Lesson playing out again. Instead of relying heavily on human-designed scaffolding and fixed retrieval strategies, maybe it is better to give the scaled-up model tools like web search or image manipulation and let it decide how to gather and use information. Don’t constrain the model too much with rigid external structures; let its capabilities grow.

It’s a simple shift: prompt the model to search when unsure, or let it manipulate input images. But the underlying principle matters. The model is being given more control over its own information needs.