Does AI Think and Reason?

If not, what are the implications?

Oct 25, 2024

The acceleration of progress on artificial intelligence well and truly entered the public consciousness when the first version of ChatGPT was released for general use in November 2022.

In the months that followed, it achieved the fastest adoption of a new web-based technology ever recorded. Just about everyone who tried it out was stunned by its ability to semantically interpret a text prompt or query and to almost immediately generate lucid, comprehensive and generally accurate responses.

Here was Jordan Peterson’s reaction:

He was certainly impressed. “It’s smarter than you, and its going to be a hell of a lot smarter in two years.” I am not being critical of Peterson’s initial reaction. I felt pretty much the same way.

So, what has happened in the two years that have elapsed since the ChatGPT release?

Successively better versions of ChatGPT have appeared, as well as rival products that utilize the same underlying architecture based on large language models (LLMs). In March 2023 OpenAI, the company led by Sam Altman released GPT-4, an LLM that according to a group of Microsoft researchers had made significant progress toward the Holy Grail of AI research, artificial general intelligence (AGI).

Human-like intelligence, the capacity to master a broad range of tasks at or beyond general, or even expert level, human capabilities. Indeed, the report that the Microsoft researchers produced after six-months of pre-release evaluations was titled Sparks of Artificial General Intelligence: Early experiments with GPT-4.

From this paper’s abstract:

We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction.

That is certainly impressive. But there were some significant wrinkles, which these researchers acknowledged. For example, GPT-4 was incapable of producing reliably accurate results when asked to multiply two numbers together if the numbers had more than three digits. Something any old dumb calculator, and even some human savants, can do in a trice.

Notice in the abstract excerpt above that the authors mention “the possible need for pursuing a new paradigm that moves beyond next-word prediction.”

This raises a fundamental issue. How is it that a large language model, a kind of neural net that, after being “trained” on a huge amount of data drawn from the web and other sources, is capable of constructing text sequences by statistically predicting the next word (or more precisely, token, which can be part of a word) in a sequence, can seemingly exhibit considerable intelligence? Even mathematical intelligence, though with some significant, and embarrassing, exceptions?

Is an LLM actually engaged in logical thinking, or is it just identifying and extending patterns in sequences of words and symbols, and can this process lead to reliably accurate results free of the problem of “hallucination”, seemingly just making stuff up, that has arisen to varying degrees in all LLMs to date?

There is another problem, exemplified by this example. How many times does the letter “r” appear in the word “strawberry?” Three, right. So obvious.

But OpenAI’s ChatGPT-4o released in May 2024, answered two. Someone posting on OpenAI’s developer forum managed to get around the problem by adding one sentence to the query “How many r’s in strawberry? Verify with code.” And presto, it gave the correct answer after generating and running a bit of computer code.

I tried the same experiment with Claude 3.5 (Sonnet) by Anthropic, one of OpenAI’s closest rivals (you may recall my earlier post describing my argument with Claude about how he/she/it could be confident that 1 + 1 = 2).

The results were quite amusing—you can see the exchange below. It reproduced the same error as OpenAI, asserting there are two r’s in strawberry, using an obviously incorrect counting method. When I pointed this out, Claude apologized profusely and thanked me for bringing this error to its attention.

The source of the initial error is related to a process called tokenization whereby LLMs break a string of text into distinct elements called tokens, which generally correspond with whole words, but in some cases words are broken into commonly recurring sub-strings, such as “berry” (cranberry, blueberry etc). The way this happened in ChatGPT is described in this article.

This specific problem has been remedied in the very latest models from OpenAI and Anthropic released in the last few weeks, called respectively OpenAI o1 and Claude 3.5 Sonnet (new). One thing these brilliant technologists don’t seem to be so good at is thinking up imaginative names for their products.

But notice the strange responses when I asked Claude to reflect on its error. This seems to highlight some of the problems, indeed absurdities, that can arise with a probabilistic text-generation engine that relies on patterns in an existing body of data. No doubt, there is plenty of material on the web from people apologizing and making excuses for their failures, but I was not asking for that.

This is no doubt related to the Claude developer’s aim of making it the most “conversational” of the LLM models, and the most fun to converse with. After reading these responses, it is hard not to feel sorry for the poor blighter. Don’t take it to heart Claude! We all make mistakes—even super-intelligent AIs!

Aside from specific issues like this, there is a more general ongoing debate amongst AI experts about whether AIs based on the LLM architecture actually “think” in any meaningful sense. One leading sceptic is the chief AI scientist at Meta, Yan LeCun.

LeCun was a joint winner of the 2018 Turing Prize, regarded as the computing world’s equivalent of the Nobel Prize, for his early work on neural networks. The two other winners LeCun shared the prize with were Geoffrey Hinton, sometimes referred to as the “godfather of artificial intelligence”, and Yoshua Bengio who of all the pioneers has given the most systematic consideration to the potential catastrophic risks of powerful AI.

LeCun agrees that AI with superhuman abilities is only a matter of time, but he believes that LLMs alone will not be sufficient to get there:

For a number of reasons. The first is that there is a number of characteristics of intelligent behaviour. For example, the capacity to understand the world, understand the physical world, the ability to remember and retrieve things, persistent memory, the ability to reason and the ability to plan. Those are four essential characteristic of intelligent systems or entities, humans, animals.
LLMs can do none of those, or they can only do them in a very primitive way. And they don't really understand the physical world, they don't really have persistent memory, they can't really reason and they certainly can't plan. And so if you expect the system to become intelligent just without having the possibility of doing those things, you're making a mistake.
That is not to say that autoregressive LLMs are not useful, they're certainly useful, [or] that they're not interesting [or] that we can't build a whole ecosystem of applications around them. Of course we can. But as a path towards human level intelligence, they're missing essential components.

This view is disputed by Ilya Sutskever, until recently the chief scientist at OpenAI, which he left after being involved in a failed bid to overthrow CEO Sam Altman, and a central figure in developing ChatGPT. He contends that in order to do what LLMs do, they would need to infer a model of how the real world works.

It may look on the surface, that we are just learning statistical correlations in text, but it turns out that to, “just learn” statistical correlations in text, to compress them really well, what the neural network learns is some representation of the process that produced the text. This text is actually a projection of the world.

Some recent research, described in this survey article titled The Anti-LLM Revolution Begins, have led to further doubts about the reasoning and planning ability of LLMs. In one study, conducted by Apple researchers, it was shown that the ability of LLMs to achieve accurate results to mathematical problems are undermined if simple in a problem that should be irrelevant, such as the names of variables, or specific numerical values, are changed. Accuracy appears to correlate with the likelihood that the model will have encountered the specific values in its training data.

Moreover, the Apple study found:

Adding seemingly relevant but ultimately inconsequential information to the logical reasoning of the problem led to substantial performance drops of up to 65% across all state-of-the-art models. Importantly, we demonstrate that LLMs struggle even when provided with multiple examples of the same question or examples containing similar irrelevant information. This suggests deeper issues in their reasoning processes that cannot be easily mitigated through few-shot learning or fine-tuning. Ultimately, our work underscores significant limitations in the ability of LLMs to perform genuine mathematical reasoning.

What is to be concluded from this? Yan LeCun thinks that LLM-based AIs are “dumber than cats”, just statistical pattern-matching machines incapable of abstract thought.

Yet despite that, they have demonstrated some remarkable abilities—check out the table below, posted by Anthropic that highlights the capabilities of its latest model (Gemini refers to Google’s models).

These results seem to belie LeCun’s “dumber than cats” statement. Moreover, they are getting better all the time. In September 2024 OpenAI released its latest version, OpenAI o1, that specifically aims to address the reasoning deficit mentioned by LeCun.

The approach is to combine LLM pattern matching with reinforcement learning, a technique pioneered by the UK-based DeepMind company (now part of Google) to develop its AlphaGo neural net model that stunned the East Asian world in 2016 by defeating the world champion in the game Go, an achievement far more difficult than winning at chess, indeed computationally intractable to current AI because of the beyond-astronomical number of possible board configurations and move sequences.

The original version of AlphaGo was seeded with knowledge of the rules of Go, as well as some encoded expert experience from human players. It achieved world-champion level ability by playing millions of games against separate instantiations of itself, absorbing additional game-play experience with each game, leading to progressively better versions. A later development, AlphaZero, achieved even better play without any encoded human expertise, starting only with the rules of the game.

The most remarkable aspect of this is that AlphaGo and AlphaZero came up with completely novel moves and strategies, unknown and completely implausible to the most experienced human players. This could not be dismissed as just pattern-matching.

OpenAI o1 combines reinforcement learning with an approach called Chain of Thought reasoning, which the company describes as follows:

Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem. Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working. This process dramatically improves the model’s ability to reason.

These are the results OpenAI is claiming for its new model on various benchmark tests (AP stands for Advanced Placement exams):

A different approach is to combine LLMs with the older approach to artificial intelligence termed symbolic AI, in which logical rules and reasoning processes are specifically embodied in code rather than being learned by neural nets, or inferred from patterns in large aggregations of data. In symbolic AI, unlike neural net architectures that are often described as black boxes, the logic and reasoning processes are transparent and explicit.

Each of the two approaches has its own distinctive advantages. What if the strengths of the two approaches could be combined? This is termed the neuro-symbolic approach, and a recent exemplar is DeepMind’s AlphaGeometry system:

AlphaGeometry is a neuro-symbolic system made up of a neural language model and a symbolic deduction engine, which work together to find proofs for complex geometry theorems. Akin to the idea of “thinking, fast and slow”, one system provides fast, “intuitive” ideas, and the other, more deliberate, rational decision-making.
Because language models excel at identifying general patterns and relationships in data, they can quickly predict potentially useful constructs, but often lack the ability to reason rigorously or explain their decisions. Symbolic deduction engines, on the other hand, are based on formal logic and use clear rules to arrive at conclusions. They are rational and explainable, but they can be “slow” and inflexible - especially when dealing with large, complex problems on their own.
AlphaGeometry’s language model guides its symbolic deduction engine towards likely solutions to geometry problems. Olympiad geometry problems are based on diagrams that need new geometric constructs to be added before they can be solved, such as points, lines or circles. AlphaGeometry’s language model predicts which new constructs would be most useful to add, from an infinite number of possibilities. These clues help fill in the gaps and allow the symbolic engine to make further deductions about the diagram and close in on the solution.

So, what to conclude from all this? One striking aspect of the AI scene today is the sheer rapidity of change, with significant changes, including new models, appearing pretty much on a weekly basis. Every time a defect is found in current approaches you can be sure multiple researchers will be tackling the problem using a variety of perspectives and approaches.

Hence it is premature to think that the recently identified weaknesses in the LLM approach means the prospect of AGI has receded back into the distant future. Even so stringent an LLM critic as Yan LeCun accepts the inevitability of this prospect:

There’s no question that we’ll have machines assisting us that are smarter than us. And the question is: is that scary or is that exciting? I think it’s exciting because those machines will be doing our bidding. They will be under our control.

The debate, among most specialists in the field, is over the best technological pathway to AGI. Both LeCun and fellow Turing laureate Yoshua Bengio believe the key is to ground AI with knowledge and understanding of how the real world works and to embed in AI realistic world models, the latter being “abstract representations that the human brain creates from the world to help humans interact and, basically, survive in their environment.”

Mike Houlding

The implications are huge, more that we can envisage, but it occurs to me that we will see huge change in schools and within the teaching profession. Teachers and universities are under fire right now for debasing education in favour of polemics, and I wonder at the worth of spending on the bricks and mortar of school rooms.

Expand full comment

Politics & Civilization

Discussion about this post