When AI Flunks Humanity’s Hardest Test
What “Humanity’s Last Exam” really tells us about machine intelligence
How smart is today’s artificial intelligence, really? Not in marketing terms, not in sci fi language, but in the sober light of difficult questions like… How many tendons attach to a tiny bone in a hummingbird’s tail? Which syllables in a Biblical Hebrew verse are “closed” according to the latest specialist scholarship? Those are not trivia questions, they are examples from “Humanity’s Last Exam,” a new benchmark that is reshaping how we think about AI progress.
The benchmark comes from a Nature paper, “A benchmark of expert-level academic questions to assess AI capabilities,” and is unpacked in a plain-language article, “AI is failing ‘Humanity’s Last Exam’—so what does that mean for machine intelligence?.” Together, they tell a story that is less about AI “getting close to human” and more about how we measure, misunderstand, and sometimes overhype what these systems can actually do.
To see why this matters, it helps to start with understanding what a benchmark is. In simple terms, an AI benchmark is like a standardized test for models. You collect a set of questions or tasks, you define what counts as a correct answer, and you see how different systems score. Benchmarks are meant to give us a common measuring mechanism that enables us to gauge whether Model A is better than Model B at math, or biology, or coding, or reading comprehension. Over the past few years, models have raced up the scoreboards on popular benchmarks, often scoring above 90 percent on tests like MMLU, a huge exam covering many school and university subjects.
The problem is that once a test becomes familiar, AI developers start training and tuning their systems to do well on that specific test. It is like teaching to the exam in schools. Exactly, like that, as a matter of fact. Scores go up, but you are no longer sure whether you are measuring deep ability or just clever test prep. Humanity’s Last Exam, or HLE, was created precisely to escape that trap and to probe what is still beyond the reach of current systems.
HLE is a collection of 2,500 questions across more than a hundred subjects, from advanced mathematics and physics to classics, linguistics, ecology, and computer science. Nearly 1,000 experts from over 500 institutions worldwide contributed questions, most of them professors, researchers, or people with graduate degrees. The questions are not scraped from textbooks. They are original, precise, and often sit at the frontier of current human knowledge. Many require graduate-level expertise or very specialized domain knowledge.
Here is where HLE really differs from earlier benchmarks. Before a question was accepted, it was tested against leading AI models. If the models could already answer it correctly, the question was thrown out. Only questions that made frontier models fail were allowed in. That is the opposite of how we usually think about test design for humans, where we want a spread of easy, medium, and hard questions. For HLE, the point was to build a test that sits just beyond the current AI frontier, so that any improvement in scores would really mean new capability, not just familiarity.
The results were brutal, and that was by design. When HLE was first released in early 2025, GPT 4o scored about 2.7 percent, Claude 3.5 Sonnet about 4.1 percent, and OpenAI’s reasoning focused model o1 about 8 percent. Even as newer models arrived, early scores remained in the single digits. The benchmark had done its job. It showed that beneath the impressive chat interfaces and polished demos, there was still a huge gap between AI performance and expert human knowledge on tightly defined, verifiable academic questions.
At this point, some commentators started to talk about HLE as a stepping stone toward artificial general intelligence, or AGI, the idea of systems that can perform any task at human or superhuman levels. The logic is tempting. If a model can eventually ace a test built from the hardest questions experts can think of, does that not mean it has become “like us”? The authors of the TechXplore article argue that this is a mistake, and the HLE paper itself is careful on this point. High scores on HLE would show expert level performance on closed, exam style questions, but not autonomous research ability or general intelligence.
The key distinction is between performance and understanding. When a human passes the bar exam, we infer that they have learned the law in a way that transfers to real practice. They can reason with clients, navigate messy situations, and exercise judgment. When an AI model passes the same exam, all we know is that it can produce answers that match the marking scheme. It does not have a body, experiences, or goals. It does not care about justice or consequences. It has learned patterns in text, not lived reality. Treating its test score as evidence of “being like a lawyer” confuses output with inner competence.
This is why benchmarks that only measure performance can mislead us about intelligence. Human intelligence is grounded in a lifetime of interaction with the world and with other people. Language is a tool we use to express that deeper intelligence. For large language models, language is all there is. Their “intelligence” is the ability to predict plausible next words based on vast training data. There is nothing underneath in the human sense. So when we use human exams as AI benchmarks, we are borrowing a tool that was designed to measure something very different.
The HLE researchers did several things to make their benchmark as rigorous as possible. They recruited domain experts to write questions, enforced strict rules about clarity and non-searchability, required detailed solutions, and ran a multistage human review process to refine and approve each item. They also evaluated multiple state of the art models, measured not just accuracy but also how well models calibrated their confidence, and analyzed how performance changed as models generated longer chains of reasoning.
Their findings are enlightening. First, accuracy is low across the board, even for the strongest models. Second, models are badly calibrated. They often answer with high confidence even when they are wrong, which is especially worrying in expert domains where overconfident mistakes can be costly. Third, more “thinking” in the form of longer reasoning traces helps up to a point, then starts to hurt, suggesting that simply giving models more compute is not a magic path to better answers.
Since HLE was published online, scores have climbed. Newer systems like Gemini 3 Pro Preview and GPT 5 now reach into the twenties and thirties of percent accuracy. That sounds impressive until you remember how the benchmark was built. Once a test is public, developers can optimize for it. The TechXplore authors describe this as AI “cramming” for the exam. The models are getting better at the kinds of questions HLE contains, but that does not mean they are converging on human like intelligence. It means the benchmark has become another target in the optimization game.
So what should we take from all this? First, HLE is a valuable reality check. It punctures the illusion that current AI systems are already “almost there” in terms of general intelligence. Second, it highlights the need for benchmarks that are closer to real work. OpenAI’s GDPval, for example, tries to measure how useful models are on tasks drawn from actual documents, analyses, and deliverables in professional settings, rather than exam style questions. That is a step toward evaluating what matters in practice, not just what looks impressive on a leaderboard.
For organizations and individuals using AI, the practical message is that they should not be dazzled by benchmark scores, even on something as ambitious as Humanity’s Last Exam. A model that shines on HLE might still struggle with your specific mix of writing, coordination, customer interaction, or domain specific judgment. The most useful “benchmark” you can run is your own, built from the tasks you actually care about, with success criteria that reflect your real constraints and risks.
Looking ahead, HLE points toward a more mature conversation about AI. It reminds us that intelligence is not a single ladder that machines are climbing rung by rung until they reach us at the top. It is a landscape of different abilities, some of which current systems handle remarkably well, and others where they still fail basic expert tests. The real work now is to design evaluations that keep us honest about those gaps, to focus on usefulness rather than myth, and to ensure that as AI systems become more capable, they do so in ways that genuinely serve human needs rather than just scoring higher on the next big exam.



