It’s getting harder to measure just how good AI is getting


Toward the end of 2024, I offered a take on all the debate about whether AI’s “scaling laws” were hitting an actual technical wall. I argued that the question matters less than many think: there are AI systems powerful enough to profoundly change our world, and the next few years will be defined by advances in AI, than the laws of scaling are valid or not.

It’s always a risky business to make predictions about AI, because you can be wrong very quickly.. It’s embarrassing enough as a writer when your predictions for the coming year don’t come true. When your predictions for the next week are proven false? It’s pretty bad.

But less than a week after writing this article, OpenAI’s year-end report series of outputs included their latest major language model (LLM), o3. o3 doesn’t do it Exactly debunk claims that the scaling laws that defined AI advances no longer work as well in the future, but it definitely puts the lie of the claim that AI progress is hitting a wall.

o3 is really very impressive. In fact, to appreciate how impressive this is, we’re going to have to dig a little into the science of how we measure AI systems.

Standardized testing for robots

If you want to compare two language models, you want to measure the performance of each on a set of problems they have never encountered before. This is harder than it seems: Since these models receive huge amounts of text as part of their training, they have already undergone most of the testing.

So machine learning researchers create benchmarks, tests for AI systems that allow us to directly compare them to each other and to human performance in a global context. range of tasks: mathematics, programming, reading and interpretation of texts, etc. For a while we AI tested in the United States Mathematics Olympiad, a mathematics championship, as well as physics, biology and chemistry problems.

The problem is that AIs have improved so quickly that they continue to make benchmarks worthless. Once an AI performs well enough on a benchmark, we say that the benchmark index is “saturated” meaning it no longer usefully distinguishes AI abilities, as they all get near-perfect scores.

2024 was the year when benchmarks for AI capabilities became as saturated as the Pacific Ocean. We used to test AIs against physics, biology and chemistry. reference called GPQA it was so difficult that even doctoral students in the corresponding fields usually scored less than 70 percent. But AIs now outperform humans with relevant PhDs, so it’s not a good way to measure future progress.

On the qualifier for the Maths Olympiad too, the models now perform among the best humans. A benchmark called MMLU was intended to measure language comprehension with questions in many different areas. The best models have saturated this one too. A benchmark called ARC-AGI was supposed to be really, really difficult and measure general human intelligence – but o3 (once set for the task) makes a bomb 88 percent above.

We can always create more benchmarks. (We do – ARC-AGI-2 will be announced soon, and is supposed to be much more difficult.) But at the rate AI advances, each new benchmark only lasts a few years at best. And perhaps most importantly for those of us who are not machine learning researchers, benchmarks must increasingly measure AI performance on tasks that humans could not accomplish themselves in order to describe what they are and what they are not capable of.

Yes, AIs still do stupid and annoying mistakes. But if you haven’t paid attention to it for six months, or if you mostly just play with the free versions of language models available online, which are well behind the border, you are overestimating the number of stupid mistakes and boring tasks they commit, and underestimate their ability to accomplish difficult and intellectually demanding tasks.

This week in Time, Garrison Lovely argued that advances in AI have not “hit a wall” to the point of becoming invisiblemostly improving by leaps and bounds in a way that people don’t pay attention to. (I’ve never tried to get an AI to solve elite programming, biology, math, or physics problems, and I wouldn’t be able to tell if it was right anyway .)

Anyone can tell the difference between a 5-year-old learning arithmetic and a high school student learning calculus, so progress between these points seems tangible. Most of us can’t really tell the difference between a freshman math major and the world’s most genius mathematicians, so AI progress between those points hasn’t seemed like much.

But this progress is actually very important. AI will truly change our world by automating a huge amount of intellectual work that was once done by humans, and three things will determine its ability to do so.

One becomes cheaper. o3 gets amazing results, but it can costs over $1,000 to think through a difficult question and find an answer. However, the end of year release of The Chinese DeepSeek indicated that it might be possible to get high quality performance very cheaply.

The second is about improving the way we interact with it. Everyone I talk to about AI products is convinced that there are tons of innovations to be made in how we interact with AIs, how they verify their work, and how we define which AI to use for which task . You could imagine a system where normally a mid-level chatbot does the work, but can bring in a more expensive model internally when your question requires it. This is product work versus just technical work, and it’s what I warned in December would transform our world even if all progress in AI stopped.

And the third is that AI systems are getting smarter – and for all the talk about punching walls, it seems they continue to do so. Newer systems are better at reasoning, better at problem solving, and generally closer to experts in a wide range of fields. To some extent, we don’t even know how smart they are, because we’re still struggling to figure out how to measure it once we’re not really able to use tests against human expertise.

I think these are the three defining forces of the next few years – that’s how important AI is. Like it or not (and I don’t really like it myself; I don’t think this world-changing transition is be treated responsibly) none of the three hits a wall, and any one of the three would be enough to permanently change the world in which we live.

A version of this story was originally published in the Future Perfect newsletter. Register here!

Leave a Reply

Your email address will not be published. Required fields are marked *