Are You Smarter Than an LLM?
I recently came across a fascinating and, frankly, humbling interactive blog post titled “Are you smarter than an LLM?”. It’s a simple concept, but the execution and the implications are profound.
The post directly pits the user against a large language model in answering questions from the Massive Multitask Language Understanding (MMLU) benchmark. This experience forced me to confront some preconceived notions I had about LLM capabilities and the meaning of benchmark scores.
The MMLU Challenge
The MMLU benchmark is a well-known evaluation tool for LLMs, designed to measure a model’s ability to understand and reason across a wide range of subjects. It’s become somewhat saturated at the top, with leading models achieving very high scores. I was aware of this saturation, but I hadn’t truly internalized what it meant until I tried the questions myself.
The interactive blog post presents MMLU questions, and you answer them alongside the LLM. As you progress, you see both your answers and the LLM’s, along with the correct answers. This direct comparison is where the real learning (and humbling) begins.
The MMLU questions are non-trivial. They require a significant breadth and depth of knowledge. Acing these tests, as large language models often do, shouldn’t be taken for granted. I, for one, certainly took it for granted before this experience. When I took this test alongside large language models, the experience of me getting a question wrong while the language model got it right was truly demoralizing.
Shattering Self-Esteem (in a Good Way)
The blog post’s title is provocative for a reason. In the context of the MMLU, I had to admit that, in many cases, I wasn’t smarter than the LLM. My score was lower. This isn’t a statement about overall intelligence, of course, but it is a powerful indicator of the superhuman capabilities LLMs are developing in specific domains.
I think many people, even those who follow AI advancements, still hold onto a sense of human exceptionalism. We see LLMs achieving high benchmark scores, but we might subconsciously maintain a belief that we’re “still special” in some way. This interactive test directly challenges that notion. It forces you to confront the reality that, in certain areas, LLMs are already surpassing human performance.
While MMLU is currently a very saturated benchmark, it’s important to remember that there are many other, even more challenging evaluations out there. For example, FrontierMath benchmark features problems so difficult that requires several hours or days of work by expert mathematicians. Even Terence Tao participated in its creation. These are exceptionally hard questions, and I am highly skeptical I could answer even one correctly.
Yet, even on these incredibly demanding benchmarks, LLMs are starting to achieve non-zero scores, albeit low percentages. And I see no reason why they won’t continue to improve.
Where Do We Go From Here?
This experience has made me reconsider the trajectory of LLM development. If current models like GPT-4.5 (or whatever the current leading model is) can outperform the average person (and even me!) on a challenging benchmark like MMLU, what will the landscape look like in five years?
The pace of progress is astonishing, and this interactive test provides a tangible glimpse into that future. It’s a future where LLMs will likely possess knowledge and reasoning abilities that surpass those of many humans in specific, measurable ways. It’s a future we need to understand and prepare for, and it is already here.