Humanitas ex machina
AGI is typically benchmarked against human capacity. But is that the right measure? What does it tell us? What does it hide?
AGI, or “artificial general intelligence,” is the holy grail of AI research, broadly defined as a machine more capable than the average human across most cognitive tasks. While even the earliest room-sized computers were dramatically better than humans at narrow tasks, the scope and flexibility of human intelligence have remained beyond our artificial systems. At least until recently. Now that frontier AI researchers claim to be within striking distance of the human benchmark, it’s worth asking whether that’s the right measure and what our human-centric view of intelligence might be obscuring when we evaluate artificial intelligence.
In narrow applications, the human benchmark is inescapably useful. If I’m considering using AI to manage my finances, it’s the natural starting point: “Is this AI as good as a human accountant?” Yet even here, failing to meet the human benchmark is not dispositive. An AI that’s 80% as good as a human accountant, but 99% cheaper, could be a worthy replacement, depending on what’s in the missing 20%. An AI accountant that’s not good at gently telling me I’m overspending might be okay; one that hallucinates income to make up the difference is not.
Beyond this sort of narrow question, the human benchmark doesn’t tell us much. It can even mislead. Risk assessment is a major justification for measuring AGI-ness, but how an AI stacks up against the average human is little measure of how dangerous the AI might be. Today’s models, which remain demonstrably inferior to humans in various ways, are nonetheless quite dangerous. Despite massive investments in post-training and other alignment systems, AI systems have run people over, encouraged suicide, ruined lives, given bad legal advice, and sent people to prison based on the color of their skin. But for the guardrails we’ve built, today’s models may already pose catastrophic risk. On the flip side, the race to AGI and beyond is premised on the view that machines of greater than human intelligence can be built safely. If we can’t take comfort in a system’s sub-human intelligence, and safe superhuman intelligence is possible, then what does the human benchmark really tell us about AI safety? The machine’s capacities are what matter, not the extent to which they overlap with our own.
Human benchmarking is not just of questionable value; it’s also exceedingly difficult to implement, perhaps intractable. In narrow applications, we can construct tests, give them to humans and AI, and compare results. But building a suite of tests to evaluate a system’s ability to do everything from change a diaper to build a discounted cash flow analysis would be a massive endeavor. Even limiting the scope to tasks that can be done sitting at a computer (more on that in a moment) would require a vast diagnostic. A 2022 paper assessing the potential impact of AI on occupations proposed 328 benchmarks. We’ll need AGI just to build the assessment tool.
An alternative to the functional approach is to measure the faculties that make up intelligence. Researchers at Google DeepMind recently proposed such a framework. “To understand where AI systems stand relative to human cognitive capabilities,” they write, “we first need to identify the key cognitive processes that enable people to navigate the complex and changing world.” Acknowledging this is “far from a simple task,” they offer a list of 10 “cognitive faculties,” such as “Perception,” “Reasoning,” and “Memory.” Even as a taxonomy of human intelligence, their list is suspect. Creativity, one of humanity’s singular cognitive capacities, doesn’t make the list, and the authors admit they aren’t even sure what it is or where to put it. Their supposed “building blocks” bury within them such essential and powerful human tools as empathy, imagination, analogy, morality, proprioception, and instinct.
What these projects reveal is that what researchers describe as “human-level intelligence” is in fact a drastically impoverished version of human mental processes. By necessity, our expectations of near-future AI systems are limited to what can be done at a computer, and so we define success in those terms. Anthropic CEO Dario Amodei scopes his view of powerful AI as “a human working virtually.” Yet while a great deal can be accomplished from a computer, it’s only a fraction of what human intelligence can do and does on a daily basis.
Human intelligence incorporates an array of capacities that don’t lend themselves to digital representations. Our extraordinary social skills (reading facial expressions and tone of voice, sensing the give and take of conversation), our precise hand-eye coordination and manual dexterity, our ability to sense our world through smell, touch, and taste, and the obscure but powerful subconscious processes we label “dreams” and “intuitions” and “instincts” are foundational to human experience. Humans who lack just a sliver of these analog capacities struggle to navigate human-made spaces. Moreover, it’s a profound error to separate the language-based, logical thinking we associate with computer use and AI from the emotional, sensory, and intuitive thinking we associate with our animal bodies. It’s all part of “human intelligence” and any human benchmark that ignores everything but the most computer-like elements of our intelligence is no human benchmark at all.
Another feasible justification for measuring AGI against a human benchmark is that it can tell us if we are dealing with what philosophers call a “moral patient” — an entity that can be affected by moral actions, and is thus owed moral consideration. This may soon become a profoundly important question. But here too, the “human” benchmark is both too high and too low. Being good at physics is not a prerequisite for moral status, which we confer on infants and many animal species. But neither does the “humanity” described by DeepMind’s cognitive faculties model obviously merit such consideration. Nowhere do they ask the deceptively simple question, “Can the system suffer?”
Ultimately, the problem with taxonomies and test batteries based on human models isn’t architectural, it’s that they’re built on sand. We genuinely don’t know how human intelligence works, can’t reliably measure it, and don’t even agree on how to define it. Our best descriptions come from poets and philosophers; Mrs Dalloway is at least as useful a guide to the nature of human intelligence as any journal article in the cognitive science library at DeepMind.
Technologists often use the word “jagged” to describe AI, meaning that it doesn’t map cleanly to human capabilities: strong in some areas, weak in others. But all intelligence is jagged when viewed from the perspective of another intelligence. Ants would find humans hopelessly disorganized; my dog pities my inability to distinguish smells. Rather than attempt to map artificial intelligence to human intelligence and call that general intelligence, it would be more fruitful to better articulate what we mean by intelligence broadly (see, e.g., The Levin Lab), and map both human and artificial systems against that.
In the meantime, human-benchmark AGI is a vanity metric. Even if our machines were capable of it, we don’t know how to measure it, and even if we could measure it, it tells us much less than we think it does.