Thoughts on GPT-4o Image Generation

Today, OpenAI revealed GPT-4o’s image generation capabilities. While this feature was previewed in the initial 4o announcement about a year ago, the actual results are still surprising. Playing with it triggered some realizations about the future implications of direct image generation that I want to share here.

The Multimodal Approach

The concept isn’t entirely new. When OpenAI introduced GPT-4o, they described it as modeling input as text, pixels, and sounds - combining all modalities with one big autoregressive transformer. The output would likewise be text, images, and audio.

They presented it as straightforward, though obviously there are complicated architectural decisions behind the scenes. But the impact is clear: the model is much more capable at image generation.

In previous systems (diffusion models or other chatbot interfaces), images were generated through a two-step process. The language model would generate a sophisticated prompt and feed that into a separate image model. GPT-4o eliminates this bottleneck by directly generating images.

Since the large language model directly generates images, it’s much more intelligent in a way. We can also feed in images, and the results speak for themselves. The images show much more consistency. When previous chatbots wanted to modify an image, they had to create a detailed description of that image and feed it into another diffusion model. This was a bottlenecked process, but with 4o, it can handle images directly, modifying them with surprising consistency.

What’s particularly impressive is how refined and enhanced the model’s text generation capabilities are within images. Previous diffusion models really struggled with generating clean, readable text in images - it was often distorted, nonsensical, or limited to just a few words. But GPT-4o can generate very clean text, and lots of it. The text is consistently readable and contextually appropriate.

Spark of Software 2.0

My most visceral moment came from an example where they showed a cat image being iteratively transformed into a game interface. Through multiple iterations, they turned it into an image of a game with a UI, and all the text was remarkably accurate. The model produced an image with a very nice interface, and everything was consistent.

That’s when it hit me: if the model can generate UI and text so accurately, in the future our computer interfaces could be entirely AI-generated in real-time with all the context available from the user. Large language models could generate your computer interface frame by frame based on your input and feedback.

Imagine a user interface that actually warps and changes based on user needs. Our current operating systems (macOS, Linux, Windows) are all rule-based with fixed definitions. But what if a large language model generated a new UI that helped the user get things done? It would be totally adaptive and different for every user - changing styles and functionality based on preferences or context.

What would a word processor look like in that interface? The possibilities seem endless. Currently, an OS has the lower-level kernel with renderers and shaders that produce pixels. But this would be an end-to-end network - a large language model OS directly generating pixels from our input.

The concept of world simulators isn’t entirely new - NVIDIA is already using their Isaac Sim platform to generate synthetic data for training robot models, Google DeepMind has developed Genie 2 that can generate interactive 3D environments, and Microsoft Research recently unveiled Muse, their first generative AI model designed for gameplay ideation. But what if instead of simulating physical worlds, we used these capabilities to simulate an operating system? That’s where my realization really hit me.

One of the most compelling aspects of this approach would be how it leverages context. Context is incredibly important for large language models - your previous conversations, actions, and preferences inform their responses. Current operating systems gather tons of contextual information about how we use them, but this data isn’t being utilized well. What if an OS could learn and adapt from your previous interactions, inputs, preferences, and habits while you used it? Every aspect of your computing experience could be customized not by explicit settings, but by the system understanding you over time.

This is basically Software 2.0 and LLM OS that Andrej Karpathy described. I was aware of this idea before, but never I have felt that it would be feasible in such a short time. I think with some effort, this kind of OS might be technically possible right now with large-scale servers and APIs, but it would be largely impractical. Yet the implications are huge, and I believe this kind of OS is inevitable.

Of course, it wouldn’t be a fully end-to-end neural network. There would certainly be some rule-based systems guiding the LLMs. But my point still stands - we might be seeing the first glimpse of a new OS approach.

Karpathy mainly discussed Software 2.0 replacing rule-based software stacks, which is already happening. But I think Software 2.0 will also eventually replace the OS stack itself.