I recently came across a humbling interactive blog post titled “Are you smarter than an LLM?”. The concept is simple, but it hit harder than I expected.

The post directly pits the user against a large language model on questions from the Massive Multitask Language Understanding (MMLU) benchmark. It forced me to confront some assumptions I had about LLM capabilities and what benchmark scores actually mean.


The MMLU challenge

The MMLU benchmark is a well-known evaluation tool for LLMs, designed to measure a model’s ability to understand and reason across a wide range of subjects. It’s become somewhat saturated at the top, with leading models achieving very high scores. I was aware of this saturation, but I hadn’t truly internalized what it meant until I tried the questions myself.

The interactive blog post presents MMLU questions, and you answer them alongside the LLM. As you progress, you see both your answers and the LLM’s, along with the correct answers. The direct comparison is the painful part.

The MMLU questions are non-trivial. They require broad and deep knowledge. Acing these tests, as large language models often do, should not be taken for granted. I definitely took it for granted before this experience. Getting a question wrong while the language model got it right was demoralizing.

A humbling check

The blog post’s title is provocative for a reason. In the context of MMLU, I had to admit that, in many cases, I wasn’t smarter than the LLM. My score was lower. This is not a statement about overall intelligence, of course, but it is a sign of how strong LLMs are becoming in specific domains.

I think many people, even those who follow AI closely, still hold onto a sense of human exceptionalism. We see LLMs achieving high benchmark scores, but we might subconsciously maintain a belief that we’re “still special” in some way. This interactive test challenges that directly. In certain areas, LLMs are already better than many humans.

MMLU is currently a very saturated benchmark, but there are much harder evaluations. For example, the FrontierMath benchmark has problems so difficult that they require several hours or days of work by expert mathematicians. Even Terence Tao participated in its creation. These are exceptionally hard questions, and I doubt I could answer even one correctly.

Yet, even on these very demanding benchmarks, LLMs are starting to achieve non-zero scores, though at low percentages. And I see no reason why they won’t continue to improve.


Where do we go from here?

This experience made me reconsider the trajectory of LLM development. If current models like GPT-4.5, or whatever the current leading model is, can outperform the average person and even me on a benchmark like MMLU, what will this look like in five years?

The pace of progress is astonishing. This test made it feel less abstract. LLMs are already surpassing many humans in specific, measurable ways, and that is something we need to understand without hand-waving it away.