Welcome to Today’s Newsletter,

Today we are going to talk about important foundations in the future of robotics. I hope you can appreciate how important this is to the future of how innovation in robotics will scale, accelerate and improve rapidly in the coming months and years. October, 2023 will be important in the history of robotics, and in this post you will learn why.

While Generative A.I. has gotten a lot of attention in late 2022 and the first three quarters of 2023 and Google DeepMind is about to announce Gemini, progress in robotics has really caught my attention in the early 2020s as well.

If you like my coverage and want to get my deep dives you can support the channel. Start a free trial and see if it’s for you.

Subscribe now

🏆Sponsored⭐

Talk to a LLM – live video and AI demo

Choose your own adventure – AI style. This is a demo showing how to build voice-driven and speech-to-speech AI apps. It is built on top of daily-python, Deepgram real-time transcription, Azure OpenAI GPT-4, Azure OpenAI DALL-E, and Azure AI Speech.

Check it Out

I consider it very important to follow what Google DeepMind is doing in terms of research. Even with the exodus of talent from Google Brain and DeepMind in recent years, a recently unified Google DeepMind remains one of the highest concentrations of A.I. talent in the world.

Read the Blog

Towards General Purpose Robots

I will be writing more about the robotics space on this Newsletter in the weeks to come.

What would “general purpose” learning for robots be like, even as humanoid general purpose robots seem to be getting a lot more funding in the early 2020s so far. What will companies like Alphabet and Amazon become in robotics in the long-term?

Together with partners from 33 academic labs, Google DeepMind have pooled data from 22 different robot types to create the Open X-Embodiment dataset and RT-X model

Google’s DeepMind robotics team last week announced the work it has done with 33 research institutes designed to create a massive, shared database called Open X-Embodiment. The researchers behind the project liken it to ImageNet, a database of more than 14 million images that dates back to 2009.

Github: https://robotics-transformer-x.github.io/

Robots are great specialists, but poor generalists

By the 2030s, robots are going to be able to learn a lot faster on their own.

Typically, you have to train a model for each task, robot, and environment. Changing a single variable often requires starting from scratch. But what if we could combine the knowledge across robotics and create a way to train a general-purpose robot?

Open X-Embodiment

The project is a shared robot database equivalent to ImageNet for computer vision. This ambitious project aims to train a generalist model capable of controlling various robots, performing complex tasks, and learning from diverse instructions. By creating a vast dataset and making it available to the research community, DeepMind hopes to accelerate advancements in robot learning and foster collaboration among researchers. If you are a believer that robots can do good in the world, a world that might have a fertility and youth crisis soon, then you should be paying attention.

Read the Paper

Generative A.I. open-source researchers might be able to use Open X-Embodiment with some surprising results in the years ahead. Think AutoGPT meets robotics. Learning, agency and LLMs augmenting how robots interact with the real world at a much faster pace of self-development is now feasible one day.

A Unified Database for Robotics Learning

“Just as ImageNet propelled computer vision research, we believe Open X-Embodiment can do the same to advance robotics,” note DeepMind researchers, Quan Vuong and Pannag Sanketi.

Functionally, Google DeepMind researchers tested their RT-1-X model in five different research labs, demonstrating 50% success rate improvement on average across five different commonly used robots compared to methods developed independently and specifically for each robot. We also showed that training our visual language action model, RT-2, on data from multiple embodiments tripled its performance on real-world robotic skills. 

Google Deepmind takes the next step toward general-purpose robots

https://robotics-transformer-x.github.io/

DeepMind, in collaboration with 33 academic laboratories heralds the arrival of RT-1-X, a novel robotics transformer (RT) model that evolves from RT-1. RT-1-X is meticulously trained on the novel Open X-Embodiment dataset constructed by the researchers and showcases a remarkable 50% improvement in success rates compared to task-specified models.

So what if OpenAI’s GPT-4V Vision and Google RT-X robotic learning comes to something perhaps in synergy with other advancements in general purpose robots (GPRs) ? It is becoming fairly likely that robotics has among the most to gain from these advancements of multi-modal LLMs.

Google seems relatively well placed. The Open X-Embodiment project was born out of the intuition that combining data from diverse robots and tasks could create a generalized model superior to specialized models, applicable to all kinds of robots.

This concept was partly inspired by large language models (LLMs), which, when trained on large, general datasets, can match or even outperform smaller models trained on narrow, task-specific datasets. Surprisingly, the researchers found that the same principle applies to robotics.

We knew back in July, 2023 that Google DeepMind’s new RT-2 system enables robots to perform novel tasks, but there’s more potential now that could be unlocked in the years to come. The late 2020s could be the start of a golden age in robotics. While A.I. and Generative A.I. might taken decades to mature into actionable AGI, robotics could move faster (as counter-intuitive as that sounds today in 2023).

The Future of Innovation is Open-Source and Global

Google DeepMind add that such a task is far too large to entrust to a single lab. The database features more than 500 skills and 150,000 tasks pulled from 22 different robot types. As the “Open” bit of the name implies, its creators are making the data available to the research community.

“We hope that open sourcing the data and providing safe but limited models will reduce barriers and accelerate research,” the team adds. “The future of robotics relies on enabling robots to learn from each other, and most importantly, allowing researchers to learn from one another.”

So when will the “OpenAI of robotics” emerge? Very soon actually.

The primary objectives of this undertaking can be summarized as follows:

Demonstrating Positive Transfer: The first aim of this research is to showcase that policies crafted from a diverse array of robotic data and environments enjoy the benefits of positive transfer. These policies exhibit superior performance when compared to those trained exclusively on data from specific evaluation setups.

Facilitating Future Research: The second goal is to contribute datasets, data formats, and models to the robotics community, thereby empowering and encouraging future research endeavors focusing on X-embodiment models.

WATCH THE ANIMATION VIDEO

Datasets, and the models trained on them, have played a critical role in advancing AI. They are likely now to do the same for general purpose robotics.

What is RT-1?

RT-1-X is built on top of Robotics Transformer 1 (RT-1), a multi-task model for real-world robotics at scale. RT-2-X is built on RT-1’s successor RT-2, a vision-language-action (VLA) model that has learned from both robotics and web data and can respond to natural language commands.

Centralizing A Global Robot Database

Google DeepMind to develop the Open X-Embodiment dataset, partnered with academic research labs across more than 20 institutions to gather data from 22 robot embodiments, demonstrating more than 500 skills and 150,000 tasks across more than 1 million episodes. This dataset is believed to be the most comprehensive robotics dataset of its kind.

How Open-Source accelerates the intersection of LLMs and Robotics means that a lot of real innovation will be occuring in the space in the 2020s, including in military robotics, smarter drones for warfare and that sort of thing and not just useful humanoid general purpose robots and robots of all kinds.

Unfortunately it’s not just about cute robot arms and robots for cleaning.

A Robot Academy for Building General Purpose Robots

The Open X-Embodiment dataset combines data across embodiments, datasets and skills.

Emergence of RT-X

Google DeepMind focus on in particular, training the RT-1 and RT-2 models on nine distinct robotic manipulators. The resulting models, collectively referred to as RT-X, surpass the capabilities of policies trained solely on data derived from the evaluation domain. These models exhibit superior generalization and innovative capabilities.

One of the researchers wrote an important X Thread about the work here and (related) here.

Share

Google Just Opened a Pandora’s Box of General Purpose Robots Arriving Sooner than Later

Empirical studies underscore the transformative potential of Transformer-based policies trained on the constructed dataset.

The dataset represents diverse behaviors, robot embodiments and environments, and enables learning generalized robotic policies.

The DeepMind researchers express their hope that by open sourcing the data and providing limited yet secure models, they can reduce barriers and accelerate research in the field.

Read about RT-2 (July, 2023)

Meanwhile, the more expansive vision-language-model-based iteration, RT-2-X, showcases approximately threefold improvements in generalization over models exclusively trained on evaluation embodiment data.

Sergey Levine said on X:

“This model only knows which robot it’s “driving” from looking through the camera, and it takes language command. We also trained a smaller “RT-1-X” that we could open source. Some labs ran RT-1-X (if we had to send code), but a few tested RT-2-X (if we could run it there)”

How are the Organizations Involved?

RT-X: A general-purpose robotics model

RT-X builds on two of their robotics transformer models. The researchers trained RT-1-X using RT-1, our model for real-world robotic control at scale, and they trained RT-2-X on RT-2, their vision-language-action (VLA) model that learns from both web and robotics data.

Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications.

In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics?

Can we instead train “generalist” X-robot policy that can be adapted efficiently to new robots, tasks, and environments?

Adapting VLMs for robotic control

RT-2 builds upon VLMs that take one or more images as input, and produces a sequence of tokens that, conventionally, represent natural language text.

Such VLMs have been successfully trained on web-scale data to perform tasks, like visual question answering, image captioning, or object recognition. In our work, we adapt Pathways Language and Image model (PaLI-X) and Pathways Language model Embodied (PaLM-E) to act as the backbones of RT-2.

Google Gemini might be into tap into the Future Software or Robots

For any inquiries, please email open-x-embodiment@googlegroups.com

I believe this work is foundational for the future of robotics and is more important than the media and A.I. scientists are giving credit to it.

Share

A Breakthrough in Robotics is on the Horizon

Multimodal LLMs will improve Robots

Emergent abilities will arrive

General purpose learning will improve

Generalisation and emergent skills

The researchers performed a series of qualitative and quantitative experiments on their RT-2 models, on over 6,000 robotic trials.

Exploring RT-2’s emergent capabilities, they first searched for tasks that would require combining knowledge from web-scale data and the robot’s experience, and then defined three categories of skills: symbol understanding, reasoning, and human recognition

Each task required understanding visual-semantic concepts and the ability to perform robotic control to operate on these concepts. Commands such as “pick up the bag about to fall off the table” or “move banana to the sum of two plus one” – where the robot is asked to perform a manipulation task on objects or scenarios never seen in the robotic data – required knowledge translated from web-based data to operate. 

Robots will Soon be Capable of Novel Learning

Their results suggest that co-training with data from other platforms imbues RT-2-X with additional skills that were not present in the original dataset, enabling it to perform novel tasks.

RT-2-X was three times more successful than RT-2 on emergent skills, novel tasks that were not included in the training dataset. In particular, RT-2-X showed better performance on tasks that require spatial understanding, such as telling the difference between moving an apple near a cloth as opposed to placing it on the cloth.

Share

Robots Equipped with General Purpose Emergent Skills are Coming

RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms.

What happens when robots can learn from each other and transfer experience to each other better? What if all robots in the system automatically know a novel skill one of them has learned? How fast do things accelerate then?

“Our results suggest that co-training with data from other platforms imbues RT-2-X with additional skills that were not present in the original dataset, enabling it to perform novel tasks,” the researchers write in a blog post that announces Open X and RT-X.

Microsoft can claim sparks of AGI based on the musings of a guy who got early access to GPT-4 all it wants, but I’m interested in general intelligent with emergent properties in a way that can scale to everybody and everything. How robots “come alive” is much more relevant than how LLMs multiply, and slowly become more efficient at multimodal tasks at this point. But the magic happens when the two streams really intersect.

Looking ahead, the scientists are considering research directions that could combine these advances with insights from RoboCat, a self-improving model developed by DeepMind. Robotics becomes actionable in the world around the same time as LLMs become vastly more able to generalize their learning. I would say that 2035 to 2055 period is the critical window. It always takes more time and longer than people assume, especially at it relates to anything general purpose or emergent capabilities.

Advancing robotic control

RT-2 shows that vision-language models (VLMs) can be transformed into powerful vision-language-action (VLA) models, which can directly control a robot by combining VLM pre-training with robotic data.

As an interface for talking to robots come into being, we’ll be able to “teach robots” much more easily. They will then be able to learn from each other from a central database such as Open X-Embodiment.

Also check out these threads from some of the folks who played key leadership roles in the project: thread by Quan Vuong, thread by Karl Pertsch, thread by Karol Hausman, thread by Sergey Levine.

Curiously, the team has open-sourced the Open X-Embodiment dataset and a small version of the RT-1-X model, but not the RT-2-X model.

Google should actively pursue humanoid general purpose robot development, given its R&D here and talent base at Google DeepMind and affiliated partner network. We can’t rely on Tesla, a few robotics startups and China to do this alone.

Subscribe now

Further Viewing:

Thread #1

Thread #2

Thread #3

Thread #4

Share

Watch Closely:

Google announced this important news in robotics history on October 3rd, 2023.

Read More in  AI Supremacy