Hello Everyone,
Welcome to our next article in our AGI series. In it we explore what AGI means and how we might know when we get there.
Itâs not always clear what Sam Altman has had in his kool-aid of the day. ChatGPT maker OpenAI is working on a novel approach to its artificial intelligence models in a project code-named âStrawberry,â according to a person familiar with the matter and internal documentation reviewed by Reuters.
The company shared a new five-level classification system with employees on Tuesday (July 9) and plans to release it to investors and others outside the company in the future.
OpenAI made a system to determine how smart its AI systems are, ranging from Level 1 to Level 5.
Hierarchy of AGI
â« OpenAI has previously defined AGI as âa highly autonomous system surpassing humans in most economically valuable tasks.â
OpenAI intends to use Strawberry to perform research and quickly obtain a solid grasp on Level 2 above. Some sources inside the company have suggested that OpenAI sees its product as being in-between level 1 and level 2 of âStages of Artificial Intelligenceâ, what they typically refer to as AGI. Letâs try to imagine this somehow:
Todayâs chatbots, like ChatGPT, are at Level 1.
OpenAI claims it is nearing Level 2, defined as a system that can solve basic problems at the level of a person with a PhD.
Level 3 refers to AI agents capable of taking actions on a userâs behalf.
Level 4 involves AI that can create new innovations.
Level 5, the final step to achieving AGI, is AI that can perform the work of entire organizations of people.
đ In partnership with Prolificđ
Create datasets to fine-tune AI, with Prolific
Prolific’s database of 200k+ active participants and domain specialists provide reliable data for your AI projects.
Learn how to use Prolific to create your own high-quality datasets for AI training and fine-tuning. Includes a free download of the dataset created.
Trusted by over 3000 world-renowned organizations.
The five levels are:
Level 1: Chatbots, natural language
Level 2: Reasoners, can apply logic and solve problems at a human level
Level 3: Agents, can perform additional actions
Level 4: Innovators, can make new inventions
Level 5: Can do the work of an entire organization
While Microsoft CTO Kevin Scott desperately tries to convince us this is the future, and Strawberry seems to be Q*, OpenAIâs central narrative is suddenly far less impressive than it felt like in 2023. How many years out is this stuff exactly guys? Guys?
According to Microsoft’s AI CEO Mustafa Suleyman, it will take until GPT-6, two AI generations from now, before we have reliably acting AI agents, corresponding to level 3. Mustafa was himself đ cherry-picked by Microsoft in their dismantling of AI startup Inflection.
Mustafa Suleyman is now titled executive vice president and CEO of Microsoft AI, a new-ish group that will include Copilot, which appears in Windows, Bing and other products. Meanwhile he generally promotes his book more than he does future products in AI.
But letâs get real, what is the Turing test equivalent of the 2020s to determine if AGI has been reached? It could be ARC.
Our guest today is , who just wrote a great piece on if Generative AI makes us more creative. ARC really is fascinating candidate for a novel look at benchmarks for AI reasoning.
What is ARC?
The Abstraction and Reasoning Corpus (ARC) is a unique benchmark designed to measure AI skill acquisition and track progress towards achieving human-level AI1.
The ARC prize was itself announced on June 11th, 2024. See more.
Introduced by Chollet in On the Measure of Intelligence 2. The Abstraction and Reasoning Corpus (ARC) is a dataset created by François Chollet in 2019. It’s designed to measure the gap between machine and human learning. The dataset consists of 1000 image-based reasoning tasks.
Learn more about ARC:
https://github.com/fchollet/ARC-AGI
Website: Abstract & Reasoning Corpus
ARC Challenge
With Lex Fridman, clip (3 years ago).
Solving Chollet’s ARC-AGI with GPT4o (6 days ago), on Machine Learning Street Talk.
Chollet’s ARC Challenge + Current Winners (3 weeks ago), on Machine Learning Street Talk.
If an LLM solves this then we’ll probably have AGI â Francois Chollet (4 weeks ago),
Listen to the Entire Podcast for full context on ARC and Francois Cholletâs ideas
of Teaching Computers how to Talk is a very clear thinker. And he expressed an interest in sharing his take on this topic.
đ§ Podcast Version 7:04. [Anthropic reference, Editorâs note]
Cracking the AGI Code: The ARC Benchmark’s $1M Prize
By , July, 2024.
Challenging machines to reason like humans
By Teaching Computers how to talk.
Artificial general intelligence (AGI) progress has stalled. New ideas are needed. Thatâs the premise of ARC-AGI, an AI benchmark that has garnered worldwide attention after Mike Knoop, François Chollet, and Lab42 announced a 1.000.000 dollar prize pool.
ARC-AGI stands for âAbstraction and Reasoning Corpus for Artificial General Intelligenceâ and is aimed to measure the efficiency of AI skill-acquisition on unknown tasks. François Chollet, the creator of ARC-AGI, is a deep learning veteran. Heâs the creator of Keras, an open-source deep learning library adopted by over 2.5M developers worldwide, and works as an AI researcher at Google.
The ARC-AGI benchmark isnât new. It has actually been around for a while, five years to be exact. And here comes the crazy part, since its introduction in 2019, no AI has been able to solve it.Â
What makes ARC so hard for AI to solve?
Now I know what youâre thinking, if AI canât pass the test, this ARC-thing must be pretty hard. Turns out, it isnât. Most of its puzzles can be solved by a 5-year old.
The benchmark was explicitly designed to compare artificial intelligence with human intelligence. It doesnât rely on acquired or cultural knowledge. Instead, the puzzles (for lack of a better word) require something that Chollet refers to as âcore knowledgeâ. These are things that we as humans naturally understand about the world from a very young age.
Here are a few examples:
Objectness
Objects persist and cannot appear or disappear without reason. Objects can interact or not depending on the circumstances.
Goal-directedness
Objects can be animate or inanimate. Some objects are âagentsâ â they have intentions and they pursue goals.
Numbers & counting
Objects can be counted or sorted by their shape, appearance, or movement using basic mathematics like addition, subtraction, and comparison.
Basic geometry & topology
Objects can be shapes like rectangles, triangles, and circles which can be mirrored, rotated, translated, deformed, combined, repeated, etc. Differences in distances can be detected.
As children, we learn experimentally. We learn by interacting with the world, often through play, and that which we come to understand intuitively, we apply to novel situations.
But wait, didnât ChatGPT pass the bar exam?
Now, you might be under the impression that AI is pretty smart already. With every test it passes â whether it is a medical, law, or business school exam â it strengthens the idea that these systems are intellectually outclassing us.
If you believe the benchmarks, AI is well on its way to outperforming humans on a wide range of tasks. Surely it can solve this ARC-test, no?
To answer that question, we should take a closer look at how AI manages to pass these tests.
Large language models (LLMs) have the ability to store a lot of information in their parameters, so they tend to perform well when they can rely on stored knowledge rather than reasoning. They are so good at storing knowledge that sometimes they even regurgitate training data verbatim, as evidenced by the court case brought against OpenAI by the New York Times.
So when it was reported that GPT-4 passed the bar exam and the US medical licensing exam, the question we should ask ourselves is: could it have simply memorized the answers? We canât check if that is the case, because we donât know what is in the training data, since very few AI companies disclose this kind of information.
This is commonly referred to as the contamination problem. And it is for this reason that Arvind Narayanan and Sayash Kapoor have called evaluating LLMs a minefield.
ARC does things differently. The test itself doesnât rely on knowledge stored in the model. Instead, the benchmark consists exclusively of visual reasoning puzzles that are pretty obvious to solve (for humans, at least).
To tackle the problem of contamination, ARC uses a private evaluation set. This is done to ensure that the test itself doesnât become part of the data that the AI is trained on. You also need to open source the solution and publish a paper outlining what youâve done to solve it in order to be eligible for the prize money.
This rule does two things:
It forces transparency making it harder to cheat.
It promotes open research. Strong market incentives have pushed companies to go closed source, but it didnât used to be like that. ARC was created in the spirit of the days when AI research was still done in the open.
Are we getting closer to AGI?
ARCâs prize money is awarded to the team, or teams, that score at least 85% on the private evaluation during an annual competition period. This yearâs competition runs until November 10, 2024, and if no one claims the grand prize, it will continue during the next annual competition. Thus far no AI has been up to the task.
According to Chollet, progress toward AGI has stalled. While LLMs are trained on unimaginably vast amounts of data, they remain brittle reasoners and are unable to adapt to simple problems they havenât been trained on. Despite that, research attention and capital keep pouring in, in the hope these capabilities will somehow emerge from scaling our current approach. Chollet, and others with him, have argued this is unlikely.
To promote the launch of the ARC-AGI Prize 2024, François Chollet and Mike Knoop were interviewed by Dwarkesh Patel. I recommend you watch it in full here.Â
During that interview, Chollet said: âIntelligence is what you use when you donât know what to do.â It’s a quote that belongs to Jean Piaget, a famous Swiss psychologist who has written a lot about cognitive development in children.Â
The simple nature of the ARC puzzles is what makes it so powerful. Most AI benchmarks measure skill. But skill is not intelligence. General intelligence is the ability to efficiently acquire new skills. And the fact that ARC remains unbeaten speaks to its resilience. New ideas are needed.
Oh, and to those who think that solving ARC equals solving AGIâŠ
Looking to test your own intelligence on the ARC benchmark? You can play here.
Jurgen Gravestein works for the professional services branch of Conversation Design Institute and has trained more than 100+ conversational AI teams globally. He has been teaching computers how to talk since 2018. His eponymous newsletter is read by folks in tech, academia, and journalism. Subscribe for free here.
Jurgen made our list of âWho to follow in AIâ list of 2024. (see the full list here).
Explore more of his thought-provoking Work
The Intelligence Paradox
Â
Why Your AI Assistant Is Probably Sweet Talking You
What AI Engineers Can Learn From Wittgenstein
Addendum
Follow the YouTube of the ARC prize to keep up with it here.
Full disclosure Chollet works at Google for the past 9 years, follow him on X/Twitter.
Why the Abstraction and Reasoning Corpus is interesting and important for AI, by Melanie Mitchell.
How do we teach computers human reasoning?
Read More in  AI SupremacyÂ