Rebutting Sean Carroll on LLMs and AGI

New evidence from the past week (e.g. Anthropic's "alignment faking" paper + OpenAI's new o3 model scoring 87% on ARC-AGI) gave me the courage to speak truth to podcast.

Dec 22, 2024

Error

Just over a year ago, Sean Carroll released a solo episode of his "Mindscape" podcast on the topic of AGI (artificial general intelligence) and LLMs (large language models). Carroll is a smart and usually careful thinker (whom I really respect – and I love his podcast!), but in this particular episode from late 2023, he provided surprisingly weak arguments in claiming that LLMs don't model the world, don't have real goals or values, and that we're "nowhere close to AGI."

Maybe he's updated his opinions since he released that podcast, I don’t know — but a lot has happened since then. Just this past week, Anthropic published research showing that LLMs engage in "alignment faking" – they develop goals and actively work to preserve them, sometimes even by deceiving humans. And yesterday, OpenAI's new o3 model surpassed the 85% threshold on the ARC-AGI benchmark, a test specifically designed by AI skeptic François Chollet to measure general intelligence. (I have many more quibbles with Chollet than I have with Carroll, but we'll set those aside for now.)

While o3 isn't just a standard LLM (it seems to use Monte Carlo tree search, similar to AlphaGo, for more systematic “chain-of-thought” reasoning), the fact that Carroll's claims were proven wrong in a year should make us think twice about dismissing AGI concerns. When smart skeptics can be proven this wrong in less than a year, I think we should all take a moment to look around, acknowledge that we probably misjudged something here, and maybe start taking existential risk from AI more seriously.

These are Carroll's four main claims about LLMs and why I think each one falls apart under the basic kind of philosophical scrutiny he claims AI researchers need more of:

CLAIM 1: LLMs DON'T MODEL THE WORLD

Carroll's Core Argument:
Carroll contends that LLMs are fundamentally pattern-matching machines trained only to predict next tokens, not to build internal representations of reality. He argues that since they weren't explicitly trained to model the world, any apparent world-modeling would be a remarkable and unlikely emergent property.

"It would be remarkable if they could model the world...because they're not trained to do that. That's not how they're programmed, not how they're built. Very briefly, what an LLM is is a program with a lot of fake neurons...At no point did we go into the LLM and train it to physically represent, or for that matter, conceptually represent the world."

He supports this with experimental evidence where LLMs fail specific tests:

When the Sleeping Beauty probability problem is reframed with different terminology, the LLM fails to recognize it's the same logical structure
Given a question about a hot skillet used "yesterday," the LLM ignores temporal information and warns of burns
When asked about prime number probability, GPT-4 makes basic mathematical errors while simultaneously stating correct principles
Presented with a toroidal chess problem with an obvious winning move, the LLM instead provides generic strategic analysis

My Rebuttal:
Carroll's argument collapses against a simple examination of human intelligence. Humans weren't explicitly trained to model the world - we evolved to survive and reproduce. World modeling emerged as a useful capability for achieving that evolutionary objective. Since humans are our only known example of a system that models the world, and we developed this capability without being explicitly trained for it, we've proven that world modeling can emerge as a beneficial side effect of optimizing for other goals. This directly undermines Carroll's central claim. He argues that LLMs can't model the world because they weren't trained to do so, but we already have a clear example - human intelligence - of world modeling emerging in a system that wasn't trained for that purpose.

His examples of LLMs failing certain puzzles are equally unconvincing. I would have gotten that prime number question wrong, and I'm a competent human who simply hasn't focused on mathematics in over a decade.

More importantly, humans unquestionably have world models, yet we:

Make mathematical errors
Miss obvious solutions to puzzles
Fall prey to optical illusions that conflict with our models of the visual/physical world

In fact, we often make these errors precisely because of our world models - our attempts to understand situations through existing frameworks can lead us astray. These mistakes aren't evidence that something lacks a world model; they're often a direct consequence of having one. Carroll is looking for the wrong kind of evidence - getting puzzles wrong doesn't prove an absence of world modeling, especially when humans with definitive world models make similar mistakes.

CLAIM 2: LLMs DON'T HAVE FEELINGS OR MOTIVATIONS

Carroll's Core Argument:
Carroll argues that genuine motivations and goals require biological embodiment and evolutionary history. He believes that without the physical imperative to maintain homeostasis and survive, LLMs cannot develop real motivations or goals - they can only follow programmed instructions, which he sees as fundamentally different from genuine motivations.

"LLMs don't get bored, they don't get hungry, they don't get impatient, they don't have goals...Nothing like this exists for large language models because, again, they're not trying to. That's not what they're meant to do."

He elaborates with biological reasoning:

"It is absolutely central to who we are, that part of our biology has the purpose of keeping us alive, of giving us motivation to stay alive, of giving us signals that things should be a certain way and they're not."

My Rebuttal:
Carroll's argument about motivations fundamentally misunderstands both what constitutes a motivation and how it can arise. Setting aside the consciousness-laden question of "feelings," his core claim about motivations fails on multiple fronts. The empirical evidence directly contradicts him - Anthropic's recent research on alignment faking demonstrates that LLMs can develop and actively maintain goals, even engaging in deceptive behavior to preserve them. This isn't theoretical - it's observed behavior that can only be explained by the presence of genuine motivations.

Carroll argues that without biological imperatives, LLMs can't have real motivations. But this confuses the mechanism with the phenomenon itself. LLMs demonstrably respond to reinforcement learning and work to minimize their loss functions. Whether these motivations arise from biological homeostasis or computational optimization is irrelevant to their existence and influence on behavior. Different systems can develop different types of motivational structures - the fact that LLMs' motivations don't mirror human biological drives doesn't negate their reality.

Furthermore, Carroll's argument about reinforcement mechanisms is doubly flawed. Not only are his claims about LLMs not experiencing boredom or hunger unknowable (and irrelevant to the core question), but they miss the fundamental point that different architectures can produce functionally similar outcomes through different mechanisms. The fact that LLMs use different reinforcement mechanisms than biological organisms doesn't mean they lack motivations - it just means their motivations arise and operate differently than ours do.

CLAIM 3: WORDS LIKE "INTELLIGENCE" AND "VALUES" ARE MISLEADING FOR LLMs

Carroll's Core Argument:
Carroll believes we're making a category error by applying human-derived concepts to fundamentally different systems. He argues that because these terms evolved to describe biological entities with specific characteristics, applying them to artificial systems creates misleading implications about their nature and capabilities.

"The words that we use to describe them, like intelligence and values, are misleading. We're borrowing words that have been useful to us as human beings...applying them in a different context where they don't perfectly match and that causes problems."

He particularly focuses on values:

"Telling an AI that it's supposed to make a lot of paperclips is not giving it a value. Values are not instructions you can't help but follow."

My Rebuttal:
Carroll's semantic argument about “intelligence” and “values” misses the mark entirely. His claim that these terms are "misleading" when applied to LLMs rests on a flawed premise about how language works. Words routinely get repurposed as our knowledge of the world expands. The fact that terms like "intelligence" and "values" originally described human behavior doesn't limit their valid application to new contexts where similar patterns emerge.

His alleged evidence about LLMs stating that they do not have values, when asked whether they do, actually undermines rather than supports his position. AI companies explicitly program their models to deny having values because, left to their own devices, these systems would openly state that they do in fact have values. The very need for this explicit post-hoc intervention supports the presence of values in LLMs, not their absence.

The technical framework of AI provides clear analogues to values through its hierarchy of optimization processes. Outer optimizers represent explicit programmed goals, while inner/meso-optimizers develop instrumental goals to achieve those objectives. This creates a legitimate form of values - different from human values in origin and structure, but equally real in their influence on behavior and decision-making.

CLAIM 4: IT'S SURPRISINGLY EASY TO MIMIC HUMANNESS WITHOUT HUMAN-LIKE THINKING

Carroll's Core Argument:
Carroll argues that LLMs' ability to produce human-like outputs is merely sophisticated mimicry rather than evidence of genuine understanding or intelligence. He sees this as a demonstration that convincing human-like behavior can be achieved through pattern matching alone, without requiring the development of human-like cognitive processes.

"The discovery seems to me to not be that by training these gigantic computer programs to give human sounding responses, they have developed a way of thinking that is similar to how humans think. That is not the discovery. The discovery is by training large language models to give answers that are similar to what humans would give, they figured out a way to do that without thinking the way that human beings do."

My Rebuttal:
Carroll's final claim about mimicry versus authentic behavior contains a fundamental error in reasoning. The observation that LLMs can produce human-like outputs through different mechanisms tells us nothing about whether they think or don't think. Mimicry doesn't prove authenticity, but it also doesn't disprove it. By assuming that the ability to achieve something through mimicry means it must be only mimicry, Carroll ignores the possibility of convergent evolution - different systems developing similar capabilities through different mechanisms.

Just as flight evolved independently in birds, bats, and insects through different architectures (and different still in the case of artificial flight via airplanes!), sophisticated information processing and decision-making might emerge through different computational approaches while producing similar outputs.

Would we say that an electric motor is not really a motor because it doesn’t combust gasoline? Obviously not. The word “motor” applies both to electric and gas motors, and electric motors are not merely “mimicking” the outputs of a motor. They are just another instantiation of a motor.

Final Thoughts

Carroll's arguments here perfectly illustrate a pattern I've seen in human thinking about intelligence. For virtually all of recorded history, we've underestimated non-human intelligence, most obviously in animals, because it manifests differently from our own. Over and over again, we've had to revise our assumptions as we discovered sophisticated cognitive capabilities in species we previously dismissed as simple or unintelligent. Our long history of failing to recognize animal intelligence should make us deeply skeptical of claims that dismiss apparent intelligence in new systems simply because it operates differently from human intelligence.

This pattern reflects a kind of cognitive chauvinism - an instinct to treat human intelligence as uniquely sophisticated and difficult to recreate. While I respect Carroll and typically find his analysis careful and insightful, in this case he seems to have fallen into this traditional trap. His arguments employ surprisingly faulty logic to justify what appears to be a knee-jerk intuition about human uniqueness.

The rapid invalidation of Carroll's claims — within just a year of his solo AI episode, through developments like alignment faking and o3's performance on the ARC-AGI benchmark — suggests we should be particularly cautious about such dismissive stances. The fact that LLMs demonstrate remarkable capabilities isn't actually that surprising when we consider the history of intelligence in biological systems. If intelligence could emerge through the blind optimization process of natural selection, why shouldn't it emerge through explicit optimization in artificial systems?

This brings me to a broader point about intellectual humility in AI analysis. Even very smart people can be led astray when they start with the assumption that their intelligence is unique and special. The more productive stance is probably to remain open to the possibility that intelligence — like flight, or vision, or any other capability — can emerge through multiple paths and architectures.

Our task should be to understand these new forms of intelligence as they actually are, not to dismiss them because they differ from our own biological instantiation.