2 min

Tags in this article

, ,

Google unveils the first vision-language-action model (VLA model). The model allows a robot to teach itself actions through text and images from the internet. Less training time is wasted on the robot because the underlying model learns in more or less the same way as humans.

Google’s new VLA model called RT-2 removes much of the complexity of training foundation models for robots. RT-2 develops itself based on text and images from the Internet. This allows a robot to perform actions on which it was not explicitly trained. “In other words, RT-2 can speak robot,” writes Vincent Vanhoucke, head of robotics at Google DeepMind.

More complex than a language model

According to Vanhoucke, learning a language model is much simpler. “Their training is not just about, let’s say, learning everything there is to know about an apple: how it grows, its physical properties, or even that one supposedly landed on Sir Isaac Newton’s head.” A robot must be able to convert the information into actions and make associations based on the information: “A robot must be able to recognize an apple in context, distinguish it from a red ball, understand what it looks like, and above all, know how to pick it up.”

RT-2 would be capable of these things. That is not true for every situation, but it could take action in 62 percent of the “new” scenarios tested and do actions it was not taught. As a result, it performed twice as well as the previous model, RT-1. There is however a small caveat to this included in the technical paper on RT-2. The researchers there indicate that the robot cannot perform actions it was not taught, but that the new actions are always parodies of learned actions.

“Although there is still an enormous amount of work to be done to enable helpful robots in human-centered environments, RT-2 shows us that an exciting future for robotics is within reach,” Vanhoucke concludes.

Tip: Google wants to outsource application development to autonomous robots