It seems that every week this year brought another meaningful development in artificial intelligence (AI) technology. In late November, Meta released an AI model, Cicero, that mastered the game Diplomacy and was so human-like in its calculations that it was even able to lie to other players in order to achieve its objectives. A week later, OpenAI debuted its newest chatbot, ChatGPT, a language model capable of writing university-level essays. ChatGPT can effectively identify bugs in code and compose succinct macro-economic analyses.
These developments have caused many to update their timelines for the expected arrival of AGI (artificial general intelligence) — that is to say, AI systems capable of general, flexible and high-level reasoning across a wide range of tasks. Public prediction markets have moved up the time for AGI emergence to 2027 from 2053.
Given that the development of AGI has been a guiding goal in the AI community since the technology’s infancy, and has been predicted to have massively positive economic consequences, the possibility of its imminent emergence is understandably exciting. However, I would argue these hopes are overly optimistic, if research trends in the development of recent AI models, such as ChatGPT, continue to hold true.
In the last decade, the past five years in particular, increasingly intelligent large language models have been underpinned by three key pillars: neural network algorithms, big data, and powerful computers. For example, GPT-3 (“Generative Pre-Trainer Transformer-3”), LaMDA (“Language Models for Dialog Applications”) and PaLM (“Pathways Language Model”) all used powerful computers to train neural network algorithms on gigantically large data sets. This powerful triple combination has enabled models to learn how language works and even sometimes to acquire emergent behaviours — behaviours their creators had not explicitly programmed them to learn.
In addition, it was recently discovered that the larger the data sets and the more powerful the computers used, the better these models seem to perform at various linguistic tasks. As such, companies pioneering these models have sought to make them increasingly bigger.
In some ways, therefore, high expectations have been justified. There is a notable improvement in the quality of text generated by ChatGPT, released in November of 2022, over GPT-2 (GPT-3’s predecessor), released in 2019.
Yet, for all their apparent intellect, these models remain in some ways very unintelligent, especially in the domain of commonsensical reasoning. The OpenAI chatbot, for instance, was unable to understand basic math, implied that it would be possible to bike from San Francisco to Maui, and struggled with simple logical problems, the kind a second-grader could solve.
These failures — and there are many more — illustrate that the new language models are increasingly good at certain kinds of intellectual tasks but still lacking in others. More specifically, they seem to excel at pattern recognition yet struggle in novel situational contexts, when they are prompted in ways that defy the patterns they’ve previously encountered.
As such, these models have been labelled by their critics as glorified cut-and-paste artists, or stochastic parrots, entities that can statistically predict patterns in language without really understanding it. While scaling has made new AI models smarter in certain contexts, it has not resolved fundamental issues of logical understanding. Moreover, the fact that these models are right most of the time but then sometimes produce highly questionable responses can be dangerous. Users might be lulled into blindly trusting the models’ outputs, even when there might be compelling reasons to regard these outputs with a degree of skepticism.
The inspiring yet simultaneously confusing progress of such systems points to a more fundamental question: What does it mean to be truly generally intelligent? Some have defined a weakly general AI agent as one that should be able to pass a Turing test (a test of an inability to exhibit intelligent behaviour indistinguishable from that of a human), score above 90 percent on the Winograd Schema Challenge (a tricky reading comprehension test for AI), score in the seventy-fifth percentile of the SAT admission test for American colleges, and learn the Atari game Montezuma’s Revenge. That would indeed be impressive, and it seems as if we are on our way to developing AI systems that have such capabilities. However, should an AI system that scores above 1,500 on an SAT but still makes fundamental errors that defy common sense be considered truly intelligent?
What if the definition of intelligence is broadened beyond language ability, learning and problem solving to include consciousness, novel reasoning, emotional awareness and self-awareness? AI systems capable of such self-awareness would truly be human-like in their capabilities and both simultaneously inspiring and terrifying to witness. However, given the failures of current systems and underlying research trends in the AI community, it seems that we are still far away from that kind of AGI.
Developing truly intelligent, human-like AI may require not only more data and computational power but also a fundamentally new underlying operational architecture. Neural networks might not be enough.
Consider that the world’s most advanced large language models have ingested more data than any single human ever has, in a lifetime. Yet almost any human child could flick a glance at a map and tell you that it is impossible to bike from San Francisco to Maui.
This article first appeared in The Line.