Hello Everyone,
There are some core themes lately that I keep thinking about and pondering. Even as I approach esteemed guest contributors.
As such, this publication is also dedicated to the exploration of such themes and topics as the Future of Work, AGI and OpenAI’s development. This is a guest contribution by Hary Law and his second piece in a series, to see his first go here.
AI meets composability
As the leading intelligent composable content platform, Contentful enables developers and marketers alike to easily deliver compliant on-brand experiences at speed and scale—all within one unified content system.
Please Note
Clicking on a Native Sponsor ☝ and signing up with them, helps me afford to offer you better content. I choose Sponsors aligned with my audience to the best of my ability. Hosting Native Sponsors allows me to keep more content free for everyone.
Harry started Learning From Examples about six months ago and I recommend it:
is a Researcher at Google DeepMind | Postgraduate Fellow at the Leverhulme Centre for the Future of Intelligence | PhD Candidate at the University of Cambridge.
I can also recommend GDM’s own AI Policy Perspectives.
Examples of recent articles by the Guest Author
The many kinds of Responsible AI
Harry Law is thus an academic expert in writing about AI history, ethics, and governance. This post was originally written in December, 2023. I really value 🎓 academic minded guest contributors and as such I hope you enjoy this essay.
By
One AGI or many?
Image: Visualising AI by Google DeepMind
Editor’s note: Harry Law writes in a personal capacity. Below are his views, not those of Google DeepMind or the University of Cambridge.
IN a 1988 edition of Deadulus, Seymour Papert asked a question about what he saw as the divergent ways of thinking in the field of AI: “Is there one AI or are there many?”
Papert, who is best known for writing the influential 1969 Perceptrons: An Introduction to Computational Geometry with longtime friend and collaborator Marvin Minsky, was asking whether AI refers to one specific style of computation, or to multiple paradigms that each exist under the umbrella of ‘AI’. Specifically, in the context of resurgent interest in connectionism in the 1980s, he wondered whether symbolic AI would have its day again or whether ‘AI’ would become synonymous with the connectionist portion of the field.
Here, connectionism refers to the ‘branch’ of artificial intelligence that proposes that systems ought to mirror the interaction between neurons of the brain to independently learn from data. Symbolic reasoning, meanwhile, was championed by American AI researchers Allen Newell and Herbert Simon and assumes that aspects of ‘intelligence’ can be replicated through the manipulation of symbols, which in this context refers to representations of logic that a human operator can access and understand.
In Perceptrons: An Introduction to Computational Geometry, Papert and Minsky wrote a searing critique of connectionism to argue that it had fundamental limitations that placed a ceiling on how good connectionist approaches could get. But they were wrong. As Papert, who was clearly in the middle of a Brothers Grimm anthology said almost a decade later in 1988:
Victory seemed assured for the artificial sister. And indeed, for the next decade all the
rewards of the kingdom came to her progeny, of which the family of expert systems did best in fame and fortune. But Snow White was not dead. What Minsky and Papert had shown the world as proof was not the heart of the princess; it was the heart of a pig. To be more literal: their book was read as proving that the neural net approach to building models of mind was dead. But a closer look reveals that they really demonstrated something much less than this. The book did indeed point out very serious limitations of a certain class of nets (nowadays known as one-layer perceptrons) but was misleading in its suggestion that this class of nets was the heart of connectionism.
By asking whether or not there was ‘one AI or many’ Parpert is encouraging us to consider which paradigm is the real AI. Today, there is no question about which has won out: the large models we know today are the successors to the deep learning models of the 2010s, which built on the parallel distributed processing approach of the 1990s and 1980s. These models were the next version of Frank Rosenblatt’s perceptron, about which Minksy and Papert wrote their famous critique. At its core––other than a few clever algorithmic tricks––modern AI belongs almost entirely to the connectionist tradition.
That takes us to today, where frontier models are the name of the game. Some say that frontier models have already reached the threshold required to be described as AGI, whereas others are more cautious (more on that below). Much like Papert’s many AIs of the 1980s, there are multiple competing visions, versions, and definitions of what exactly we mean by AGI. Where this conversation was once confined to quieter concerns of the internet, 2023 has catapulted the term into the mainstream.
It’s hard to measure just how good models are getting, not least because of challenges associated with evaluations including the implementation of benchmarks, the subjectivity of human-led evaluations, and issues with relying too heavily on model-generated approaches. But while skyrocketing benchmark results should be approached with care, few doubt that today’s models are getting very capable very quickly.
One of the most important types of capability increase is that of multimodality. You now have a companion in your pocket that can listen to your voice, read your writing, or look at your photos and respond in turn using any modality. Raw capability increases and the emergence of powerful multimodality has got a lot of folks asking how far exactly we are from AGI (and what we mean by what has historically been a contested term). We know 2023 as the biggest year for AI since AlexNet, but it was also a year in which labs reflected on what AGI is, how far away we are, and even what happens when we get there.
Contesting AGI
WE begin with Google Research’s Blaise Agüera y Arcas and Stanford’s Peter Norvig (also a former researcher at Google) who argued in October that, while today’s most advanced AI models have many flaws, they will be regarded as early manifestations of artificial general intelligence in the future.
As they explain: “decades from now, they will be recognized as the first true examples of AGI, just as the 1945 ENIAC is now recognized as the first true general-purpose electronic computer.” The overriding point here is that, though there are clear limits, frontier models demonstrate the core property of generality through a command of a large number of topics, tasks, languages, and modalities. They are also capable of ‘instructability,’ which refers to the idea that frontier models are capable of learning from real-world interaction after they have already been trained.
This approach, like others that we will get to shortly, is based on the idea that we ought to think about intelligence in terms of a multidimensional scorecard, rather than a single yes or no proposition. It seems unlikely to me that there will be a single moment in time in which we all agree that AGI is here (in fact, we can already discount that as a possibility, given that people are already saying that it’s arrived). What is more likely is that systems will become increasingly capable and little by little more people will align on the idea that we’re dealing with AGI, which reminds me of the iPhone release analogy comparing versions of large models to Apple’s latest and greatest.
According to the pair, there are four reasons why people generally do not describe frontier models as AGI: a healthy scepticism about metrics for AGI, an ideological commitment to alternative AI theories or techniques, a devotion to human (or biological) exceptionalism, and a concern about the economic implications of AGI. In addition to these, we might also consider two other dynamics. First, that—should the authors be proved right—we are at the very early stages of the emergence of a powerful new technology. Some people will recognise the significance of a particular technology faster than others, so we shouldn’t expect universal agreement over the very short term. That’s just how innovation goes. Second, researchers are still human. Part of the reason experts like to be critical is because they, like all of us, suffer from the thing that pollsters call social desirability bias. Why call AGI now when you can wait and see?
Then there was the infamous Sparks of AGI paper from Microsoft. The research, which was released in April, detailed results of a number of tests (some more freewheeling than others) on GPT-4 to argue that the model represented an early form of AGI, defined as “systems that demonstrate broad capabilities of intelligence, including reasoning, planning, and the ability to learn from experience, and with these capabilities at or above human-level.”
To test this claim, Microsoft researchers probed the model’s abilities in areas like multimodal and interdisciplinary composition, coding, mathematical reasoning, interaction with tools and environments, and various theory of mind tests. They assessed GPT-4 on tasks requiring integrative skills across disciplines like art, literature, medicine, law, physics, and programming, and evaluated its coding abilities with challenges from sites like LeetCode and HumanEval as well as real-world tasks like data visualisation, game development, and optimisation algorithms.
The group probed its maths skills with high school and college level questions and modelling tasks, and tested the model’s ability to use tools by having it complete tasks requiring shell commands, calendar management, and web searches. Other experiments included an evaluation of its understanding of humans through false belief experiments and social situation analysis, and an assessment of its discriminative capabilities by having it identify personally identifiable information and evaluate the truthfulness of responses to challenging questions. Based on the results, they argued that “given the breadth and depth of GPT-4’s capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system.”
But sparks are not flames, and it is worth saying that the authors also highlighted limitations stemming from the autoregressive generate-and-test architecture, including a lack of planning, backtracking, and memory when problems required an iterative approach. Whatever the case, the main criticism comes down to how well benchmarks actually represent the qualities that we want to test for. GPT-4 can pass the bar, but I right now would still prefer a human lawyer.
The other issue at play is contamination wherein the model may have actually seen some of these questions before in its training data. Increasingly, my own view is that if capabilities continue to improve then contamination isn’t massively relevant. Providing the systems are safe, can approximate reasoning, and can continue to evolve in the future, I doubt it will matter that memorisation played an outsized role in delivering early successes.
As for defining AGI, an interesting effort came from Google DeepMind at the end of the year, where researchers proposed a framework for classifying the capabilities and behaviours of AGI systems and their precursors. The work is based on six principles that categorising AGI ought to follow, including: focusing on capabilities over mechanisms; separately evaluating generality and performance; and defining stages along the path to AGI rather than just the endpoint.
The researchers introduced a levelled taxonomy with two key dimensions: performance (the depth of capabilities compared to humans) and generality (the breadth of capabilities of a given model). Within this framework, the group introduces five levels of performance that they categorise as emerging, competent, expert, virtuoso, superhuman.
For each of these levels, the paper outlines metrics against which they can be assessed vis-a-vis humans, beginning with ‘equal to or somewhat better than an unskilled human’ for emerging performance before ‘reaching at least 50th percentile of skilled adults’ for competence performance. As for the others, they connect expert performance with ‘at least 90th percentile of skilled adults’, virtuoso performance with ‘at least 99th percentile of skilled adults’ and superhuman performance with the ability to ‘outperform 100% of humans’.
❤️ Support the author by subscribing to their new and recently launched Newsletter. 💬
Topics: AI history, ethics, and governance:
What’s next?
THE question ‘how far are we from AGI?’ is as popular as it is difficult to answer, though not just because the growth of capabilities is tricky to predict.
Another fundamental problem, which this year has made clear, is that the answer depends on what precisely we mean by AGI. Where once that was a far off question that could be kicked into the long grass, now there are people who think AGI is already here.
One route forward that may get us closer to consensus is the move from tool to agent: after all, agents can by definition complete a wider range of tasks which necessarily makes them more general. BabyAGI and AutoGPT proved that simple GPT-powered agents were possible using non-specialised architectures, and the GPT Store shows us what personalised models could look like at scale. Next year, I expect to see assistants get much better, with OpenAI already thinking about how these types of systems ought to be governed.
There is generally broad agreement that a couple of technical advances are needed before we get to the maximalist versions of AGI. Things like better reasoning, memory, and planning are some of the most common characteristics at the top of the wishlist, though so too is work to make large models more reliable and robust. Separately, there’s a question about why current models have been unable to discover new scientific knowledge, though that seems to have been at least in part answered by the FunSearch algorithm. The broader point, though, is that things like reasoning, planning, and memorisation will be at the core of any AGI system and of any frontier agent, albeit to differing degrees.
For that reason, the next generation of models will give us a better understanding of how far along on the path to AGI we actually are. Whether that proves to be behind or ahead of expectations, it is worth remembering just how far we have come in the last two years. And by progress, I don’t just mean increases in underlying capabilities; I mean that people all over the world are now using frontier models every single day.
In some sense, real world usage will be the best barometer for where we are on the road to AGI. There will always be those who cast the net wide and those who prefer a more precise approach, but on the other side of technical definitions we will have to grapple with the effect of AI systems on art, culture, the economy, and politics. Put simply, at some point in the future we can expect a tipping point when the societal impact of AI becomes so great that perhaps it is worth adding that extra letter.
By Harry Law
Read more about Harry Law.
Bio: Harry Law works on ethics and policy issues at Google DeepMind. He’s also a PhD candidate at the Department of History and Philosophy of Science at the University of Cambridge and a postgraduate fellow at the Leverhulme Centre for the Future of Intelligence. You can find him on X/Twitter at @lawhsw and on Substack writing about AI history, ethics, and governance at www.learningfromexamples.com.
On AGI
Why you Should be Skeptical of AGI.
Considering supporting the publication.
Read More in AI Supremacy