Artificial intelligence has traditionally progressed by automatic precision tests in tasks intended to approximate human knowledge.
Carefully manufactured reference tests such as the Benchmark for the General Understanding of Language (GLUE), the set of understanding data of the massive multitasking language (MMLU) and the “last examination of humanity”, used large tables of questions to mark the way a large language model knows about a lot of things.
However, these tests are increasingly unsatisfactory as a measure of the value of generative AI programs. Something else is necessary, and this could well be a more human evaluation of the production of AI.
Also: The AI does not hit a wall, it becomes too intelligent for references, says Anthropic
This point of view has been floating in industry for some time now. “We saturated the benchmarks,” said Michael Gerstenhaber, head of API technologies at Anthropic, who made the Claude de LLMS family at a Bloomberg conference on AI in November.
The need for humans to be “in the loop” during the evaluation of AI models also appears in the literature.
In a Document published this week In the New England Journal of Medicine by researchers from several institutions, including the Beth Israel Deaconess Medical Center in Boston, the main author Adam Rodman and the collaborators argue that “with regard to references, humans are the only way”.
The traditional references in the field of medical AI, such as Medqa created at the MIT, “have become saturated”, they write, which means that the models of ACENT easily such exams but are not connected to what really matters in clinical practice. “Our own work shows how difficult benchmarks fall into reasoning systems like Openai O1,” they write.
Rodman and the team argue for the adaptation of the classic methods by which human doctors are formed, such as role playing with humans. “Human computers’ interaction studies are much slower than even the reference assessments added to humans, but as systems become more powerful, they will become even more essential,” they write.
Also: The latest examination of humanity “is a benchmark on the AI superior models – can you do better?
Human monitoring of AI development has been a basic food for progress in the AI generation. The development of chatgpt in 2022 used a lot of “learning to strengthen human feedback”. This approach makes many cycles that humans note the release of AI models to shape this outing towards a desired objective.
Now, however, the creator of Chatgpt Openai and other developers of so -called border models involve humans in the notation and classification of their work.
By revealing its open source Gemma 3 this month, Google highlighted the unauthorized reference scores but the assessments of human assessors to plead in favor of the superiority of the model.
Google has even succeeded Gemma 3 in the same terms as the best athletes, using what is called Elo scores For overall capacity.
Also: Google claims that Gemma 3 reaches 98% of Deepseek’s accuracy – using a single GPU
Likewise, when Openai unveiled its latest high -end modelGPT-4.5, in February, he underlined not only the results on automated references such as Simpleqa, but also the way in which human examiners thought of the output of the model.
“Human preference measures,” explains Openai, is a way to assess “the percentage of requests where the testers preferred the GPT – 4.5 on GPT -4O”. The company says that GPT-4.5 has a greater “emotional quotient” accordingly, although it did not specify how.
Even if new references are made to replace the benchmarks that have been saturated, reference designers seem to incorporate human participation as a central element.
In December, the OPENAI “Mini” GPT-O3 became the first model of great language to have ever beaten a human score on an abstract reasoning test called abstraction and Reaship Corpus for artificial general Intelligence (Arc-Agi).
This week, François Chollet, inventor of Arc-Agi and a scientist of the unity of IA of Google, unveiled a more difficult new versionARC-AGI 2. Although the original version was noted for human capacity by testing Turk workers in human Amazon mechanics, Chollet, this time, had a more lively human participation.
Also: Google publishes “most intelligent” experimental Gemini.
“To guarantee the calibration of human -oriented difficulties, we conducted a live study in San Diego at the beginning of 2025 involving more than 400 members of the general public,” writes Chollet in his blog article. “Participants were tested on Arc-Agi-2 candidate tasks, allowing us to identify which problems could be coherently resolved by at least two individuals in two attempts or less. These first party data provide a solid reference for human performance and will be published alongside the Arc-Agi-2 document.”
It is a bit like a mixture of automated comparative analysis with the playful flash crowds of the art of performance a few years ago.
This type of fusion of the development of the AI model with human participation suggests that there is a lot of room to extend the training, development, engineering and tests of AI models with an increasingly important human involvement in the loop.
Even Chollet cannot say at this stage if everything that will lead to general artificial intelligence.
Do you want more stories about AI? Register for innovationOur weekly newsletter.