If you’re looking for a new reason to be nervous about artificial intelligence, try this: Some of the smartest humans in the world are struggling to create tests that AI systems can’t pass.
For years, AI systems have been measured by subjecting new models to a variety of standardized benchmark tests. Many of these tests consisted of difficult SAT-caliber problems in areas such as math, science and logic. Comparing model scores over time served as a rough measure of AI progress.
But AI systems eventually became too good at these tests, so new, more difficult tests were created — often with the kinds of questions that graduate students might encounter on their exams.
These tests are also not in good condition. New business models like OpenAI, Google, and Anthropic have scored high in many PhD-level challenges, limiting the usefulness of these tests and leading to a scary question: are AI systems becoming too smart to that we can measure them?
This week, researchers at the Center for AI Safety and Scale AI are releasing a possible answer to that question: a new assessment, called “Humanity’s Last Exam“, which they say is the toughest test ever given to AI systems.
Humanity’s Last Exam is the brainchild of Dan Hendrycks, a well-known AI safety researcher and director of the Center for AI Safety. (The original name of the test, “Humanity’s Last Stand”, was abandoned as too dramatic.)
Thank you for your patience while we verify access. If you are in Reader mode, please exit and log in to your Times account, or subscribe to the entire Times.
Thank you for your patience while we verify access.
Already subscribed? Log in.
Want all the Times? Subscribe.