Thoughts on GPT-4o Image Generation

Today, OpenAI revealed GPT-4o’s image generation capabilities. This was previewed in the initial 4o announcement about a year ago, but the actual results still surprised me. Playing with it made the idea of direct image generation feel much more concrete.

The multimodal approach

The concept isn’t new. When OpenAI introduced GPT-4o, they described it as modeling input as text, pixels, and sounds, combining all modalities with one big autoregressive transformer. The output would likewise be text, images, and audio.

They presented it as straightforward, though there are obviously complicated architectural decisions behind the scenes. The important part is that the model is much better at image generation.

In previous systems, images were usually generated through a two-step process. The language model would write a prompt and feed that into a separate image model. GPT-4o removes that handoff by directly generating images.

Because the language model directly handles images, editing feels more coherent. We can feed in an image, ask for changes, and the model does not need to translate everything into a long text prompt for another diffusion model. That handoff used to be a bottleneck. With 4o, the model can modify the image more directly, and the consistency is much better.

The text inside images is the part that really stood out. Previous diffusion models struggled with clean, readable text; it was often distorted, nonsensical, or limited to a few words. GPT-4o can generate clean text, and lots of it. The text is consistently readable and contextually appropriate.

Spark of Software 2.0

My most visceral moment came from an example where they showed a cat image being iteratively transformed into a game interface. Through multiple iterations, they turned it into an image of a game with a UI, and the text was accurate. The model produced an image with a clean interface, and everything was consistent.

That’s when it hit me: if the model can generate UI and text this accurately, maybe future computer interfaces could be AI-generated in real time with the user’s context available. Large language models could generate the computer interface frame by frame based on input and feedback.

Imagine a user interface that changes based on what the user needs. Current operating systems like macOS, Linux, and Windows are mostly rule-based with fixed definitions. But what if a large language model generated a new UI that helped the user get things done? It could be adaptive and different for every user, changing style and functionality based on preferences or context.

What would a word processor look like in that interface? Currently, an OS has the lower-level kernel with renderers and shaders that produce pixels. But this would be an end-to-end network, a large language model OS directly generating pixels from our input.

The concept of world simulators isn’t new. NVIDIA is already using its Isaac Sim platform to generate synthetic data for training robot models, Google DeepMind has developed Genie 2 for interactive 3D environments, and Microsoft Research recently unveiled Muse, a generative model for gameplay ideation. But what if instead of simulating physical worlds, we used these capabilities to simulate an operating system? That’s where the idea clicked for me.

One reason this feels plausible is context. Your previous conversations, actions, and preferences already inform how language models respond. Current operating systems gather tons of contextual information about how we use them, but that information is not used very well. What if an OS could learn from your previous interactions, inputs, preferences, and habits while you used it? The computer could become personalized less through explicit settings and more through the system understanding you over time.

This is basically Software 2.0 and LLM OS, as Andrej Karpathy described. I was aware of this idea before, but I hadn’t felt that it could become feasible so soon. With enough effort, this kind of OS might already be technically possible with large-scale servers and APIs, though it would be wildly impractical. Still, I think some version of it is inevitable.

Of course, it wouldn’t be a fully end-to-end neural network. There would be some rule-based systems guiding the LLMs. But my point still stands: we might be seeing the first glimpse of a new OS approach.

Karpathy mainly discussed Software 2.0 replacing rule-based software stacks, which is already happening. But I think Software 2.0 may eventually reach the OS stack itself.