Hello Everyone,

Welcome to our next article in our AGI series. In it we explore what AGI means and how we might know when we get there.

It’s not always clear what Sam Altman has had in his kool-aid of the day. ChatGPT maker OpenAI is working on a novel approach to its artificial intelligence models in a project code-named ā€œStrawberry,ā€ according to a person familiar with the matter and internal documentation reviewed by Reuters.

The company shared a new five-level classification system with employees on Tuesday (July 9) and plans to release it to investors and others outside the company in the future.

OpenAI made a system to determine how smart its AI systems are, ranging from Level 1 to Level 5.

Hierarchy of AGI

⚫ OpenAI has previously defined AGI as ā€œa highly autonomous system surpassing humans in most economically valuable tasks.ā€

OpenAI intends to use Strawberry to perform research and quickly obtain a solid grasp on Level 2 above. Some sources inside the company have suggested that OpenAI sees its product as being in-between level 1 and level 2 of ā€œStages of Artificial Intelligenceā€, what they typically refer to as AGI. Let’s try to imagine this somehow:

Today’s chatbots, like ChatGPT, are at Level 1.

OpenAI claims it is nearing Level 2, defined as a system that can solve basic problems at the level of a person with a PhD.

Level 3 refers to AI agents capable of taking actions on a user’s behalf.

Level 4 involves AI that can create new innovations.

Level 5, the final step to achieving AGI, is AI that can perform the work of entire organizations of people.

🌟 In partnership with ProlificšŸ“ˆ

Create datasets to fine-tune AI, with Prolific

Prolific’s database of 200k+ active participants and domain specialists provide reliable data for your AI projects.

Learn how to use Prolific to create your own high-quality datasets for AI training and fine-tuning. Includes a free download of the dataset created.

Watch Case Study Now

Trusted by over 3000 world-renowned organizations.

The five levels are:

Level 1: Chatbots, natural language

Level 2: Reasoners, can apply logic and solve problems at a human level

Level 3: Agents, can perform additional actions

Level 4: Innovators, can make new inventions

Level 5: Can do the work of an entire organization

While Microsoft CTO Kevin Scott desperately tries to convince us this is the future, and Strawberry seems to be Q*, OpenAI’s central narrative is suddenly far less impressive than it felt like in 2023. How many years out is this stuff exactly guys? Guys?

According to Microsoft’s AI CEO Mustafa Suleyman, it will take until GPT-6, two AI generations from now, before we have reliably acting AI agents, corresponding to level 3. Mustafa was himself šŸ’ cherry-picked by Microsoft in their dismantling of AI startup Inflection.

Mustafa Suleyman is now titled executive vice president and CEO of Microsoft AI, a new-ish group that will include Copilot, which appears in Windows, Bing and other products. Meanwhile he generally promotes his book more than he does future products in AI.

But let’s get real, what is the Turing test equivalent of the 2020s to determine if AGI has been reached? It could be ARC.

Our guest today is , who just wrote a great piece on if Generative AI makes us more creative. ARC really is fascinating candidate for a novel look at benchmarks for AI reasoning.

What is ARC?

The Abstraction and Reasoning Corpus (ARC) is a unique benchmark designed to measure AI skill acquisition and track progress towards achieving human-level AI1.

The ARC prize was itself announced on June 11th, 2024. See more.

Introduced by Chollet in On the Measure of Intelligence 2. The Abstraction and Reasoning Corpus (ARC) is a dataset created by FranƧois Chollet in 2019. It’s designed to measure the gap between machine and human learning. The dataset consists of 1000 image-based reasoning tasks.

Measure of Intelligence (2019)

Learn more about ARC:

https://github.com/fchollet/ARC-AGI

Website: Abstract & Reasoning Corpus

ARC Challenge

With Lex Fridman, clip (3 years ago).

Solving Chollet’s ARC-AGI with GPT4o (6 days ago), on Machine Learning Street Talk.

Chollet’s ARC Challenge + Current Winners (3 weeks ago), on Machine Learning Street Talk.

If an LLM solves this then we’ll probably have AGI – Francois Chollet (4 weeks ago),

Listen to the Entire Podcast for full context on ARC and Francois Chollet’s ideas

of Teaching Computers how to Talk is a very clear thinker. And he expressed an interest in sharing his take on this topic.

šŸŽ§ Podcast Version 7:04. [Anthropic reference, Editor’s note]

Cracking the AGI Code: The ARC Benchmark’s $1M Prize

By , July, 2024.

Challenging machines to reason like humans

By Teaching Computers how to talk.

Artificial general intelligence (AGI) progress has stalled. New ideas are needed. That’s the premise of ARC-AGI, an AI benchmark that has garnered worldwide attention after Mike Knoop, FranƧois Chollet, and Lab42 announced a 1.000.000 dollar prize pool.

ARC-AGI stands for ā€œAbstraction and Reasoning Corpus for Artificial General Intelligenceā€ and is aimed to measure the efficiency of AI skill-acquisition on unknown tasks. FranƧois Chollet, the creator of ARC-AGI, is a deep learning veteran. He’s the creator of Keras, an open-source deep learning library adopted by over 2.5M developers worldwide, and works as an AI researcher at Google.

The ARC-AGI benchmark isn’t new. It has actually been around for a while, five years to be exact. And here comes the crazy part, since its introduction in 2019, no AI has been able to solve it.Ā 

What makes ARC so hard for AI to solve?

Now I know what you’re thinking, if AI can’t pass the test, this ARC-thing must be pretty hard. Turns out, it isn’t. Most of its puzzles can be solved by a 5-year old.

The benchmark was explicitly designed to compare artificial intelligence with human intelligence. It doesn’t rely on acquired or cultural knowledge. Instead, the puzzles (for lack of a better word) require something that Chollet refers to as ā€˜core knowledge’. These are things that we as humans naturally understand about the world from a very young age.

Here are a few examples:

Objectness
Objects persist and cannot appear or disappear without reason. Objects can interact or not depending on the circumstances.

Goal-directedness
Objects can be animate or inanimate. Some objects are ā€œagentsā€ — they have intentions and they pursue goals.

Numbers & counting
Objects can be counted or sorted by their shape, appearance, or movement using basic mathematics like addition, subtraction, and comparison.

Basic geometry & topology
Objects can be shapes like rectangles, triangles, and circles which can be mirrored, rotated, translated, deformed, combined, repeated, etc. Differences in distances can be detected.

As children, we learn experimentally. We learn by interacting with the world, often through play, and that which we come to understand intuitively, we apply to novel situations.

But wait, didn’t ChatGPT pass the bar exam?

Now, you might be under the impression that AI is pretty smart already. With every test it passes — whether it is a medical, law, or business school exam — it strengthens the idea that these systems are intellectually outclassing us.

If you believe the benchmarks, AI is well on its way to outperforming humans on a wide range of tasks. Surely it can solve this ARC-test, no?

To answer that question, we should take a closer look at how AI manages to pass these tests.

Large language models (LLMs) have the ability to store a lot of information in their parameters, so they tend to perform well when they can rely on stored knowledge rather than reasoning. They are so good at storing knowledge that sometimes they even regurgitate training data verbatim, as evidenced by the court case brought against OpenAI by the New York Times.

So when it was reported that GPT-4 passed the bar exam and the US medical licensing exam, the question we should ask ourselves is: could it have simply memorized the answers? We can’t check if that is the case, because we don’t know what is in the training data, since very few AI companies disclose this kind of information.

This is commonly referred to as the contamination problem. And it is for this reason that Arvind Narayanan and Sayash Kapoor have called evaluating LLMs a minefield.

ARC does things differently. The test itself doesn’t rely on knowledge stored in the model. Instead, the benchmark consists exclusively of visual reasoning puzzles that are pretty obvious to solve (for humans, at least).

To tackle the problem of contamination, ARC uses a private evaluation set. This is done to ensure that the test itself doesn’t become part of the data that the AI is trained on. You also need to open source the solution and publish a paper outlining what you’ve done to solve it in order to be eligible for the prize money.

This rule does two things:

It forces transparency making it harder to cheat.

It promotes open research. Strong market incentives have pushed companies to go closed source, but it didn’t used to be like that. ARC was created in the spirit of the days when AI research was still done in the open.

Are we getting closer to AGI?

ARC’s prize money is awarded to the team, or teams, that score at least 85% on the private evaluation during an annual competition period. This year’s competition runs until November 10, 2024, and if no one claims the grand prize, it will continue during the next annual competition. Thus far no AI has been up to the task.

According to Chollet, progress toward AGI has stalled. While LLMs are trained on unimaginably vast amounts of data, they remain brittle reasoners and are unable to adapt to simple problems they haven’t been trained on. Despite that, research attention and capital keep pouring in, in the hope these capabilities will somehow emerge from scaling our current approach. Chollet, and others with him, have argued this is unlikely.

To promote the launch of the ARC-AGI Prize 2024, François Chollet and Mike Knoop were interviewed by Dwarkesh Patel. I recommend you watch it in full here. 

During that interview, Chollet said: ā€œIntelligence is what you use when you don’t know what to do.ā€ It’s a quote that belongs to Jean Piaget, a famous Swiss psychologist who has written a lot about cognitive development in children.Ā 

The simple nature of the ARC puzzles is what makes it so powerful. Most AI benchmarks measure skill. But skill is not intelligence. General intelligence is the ability to efficiently acquire new skills. And the fact that ARC remains unbeaten speaks to its resilience. New ideas are needed.

Oh, and to those who think that solving ARC equals solving AGI…

Looking to test your own intelligence on the ARC benchmark? You can play here.

Jurgen Gravestein works for the professional services branch of Conversation Design Institute and has trained more than 100+ conversational AI teams globally. He has been teaching computers how to talk since 2018. His eponymous newsletter is read by folks in tech, academia, and journalism. Subscribe for free here.

Jurgen made our list of ā€œWho to follow in AIā€ list of 2024. (see the full list here).

Explore more of his thought-provoking Work

The Intelligence Paradox

Teaching computers how to talk
The Intelligence Paradox
Key insights of today’s newsletter: There’s been a lot of confusion about what AI is and isn’t — and much of that confusion can be attributed to the language used to describe AI. Not only does this confusion add to an over-attribution in AI capabilities, it also contributes to a narrowing understanding of the human mind…
Read more

Ā 

Why Your AI Assistant Is Probably Sweet Talking You

Teaching computers how to talk
Why Your AI Assistant Is Probably Sweet Talking You
Key insights of today’s newsletter: New research by Anthropic suggests that sycophantic behavior, or sweet talking, can be observed in all state-of-the-art models. Sycophancy in AI models is likely caused by the way we steer model behavior, a process known as reinforcement learning with human feedback (RLHF…
Read more

What AI Engineers Can Learn From Wittgenstein

Teaching computers how to talk
What AI Engineers Can Learn From Wittgenstein
Key insights of today’s newsletter: Ludwig Wittgenstein, the Austrian philosopher, wrote extensively about language and its intricacies. His ideas are especially relevant for anyone working in AI, because his views on language help us better understand the limitations of large language models…
Read more

Addendum

Follow the YouTube of the ARC prize to keep up with it here.

Full disclosure Chollet works at Google for the past 9 years, follow him on X/Twitter.

Why the Abstraction and Reasoning Corpus is interesting and important for AI, by Melanie Mitchell.

AI: A Guide for Thinking Humans
Why the Abstraction and Reasoning Corpus is interesting and important for AI
AI’s Most Important Open Problem: Forming Concepts and Abstractions In their proposal for the 1956 Dartmouth summer workshop, John McCarthy et al. summarized their plan for a ā€œ2 month, 10 man study of artificial intelligenceā€: ā€œAn attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now r…
Read more

How do we teach computers human reasoning?

See Tweet | Read Paper.

2311
410KB āˆ™ PDF file

Download

Download

Read MoreĀ in Ā AI SupremacyĀ