Late last week, Google research scientist Fei Xia sat in the center of a bright, open-plan kitchen and typed a command into a laptop connected to a one-armed, wheeled robot resembling a large floor lamp. “I’m hungry,” he wrote. The robot promptly zoomed over to a nearby countertop, gingerly picked up a bag of multigrain chips with a large plastic pincer, and wheeled over to Xia to offer up a snack.
The most impressive thing about that demonstration, held in Google’s robotics lab in Mountain View, California, was that no human coder had programmed the robot to understand what to do in response to Xia’s command. Its control software had learned how to translate a spoken phrase into a sequence of physical actions using millions of pages of text scraped from the web.
That means a person doesn’t have to use specific preapproved wording to issue commands, as can be necessary with virtual assistants such as Alexa or Siri. Tell the robot “I’m parched,” and it should try to find you something to drink; tell it “Whoops, I just spilled my drink,” and it ought to come back with a sponge.
“In order to deal with the diversity of the real world, robots need to be able to adapt and learn from their experiences,” Karol Hausman, a senior research scientist at Google, said during the demo, which also included the robot bringing a sponge over to clean up a spill. To interact with humans, machines must learn to grasp how words can be put together in a multitude of ways to generate different meanings. “It’s up to the robot to understand all the little subtleties and intricacies of language,” Hausman said.
Google’s demo was a step toward the longstanding goal of creating robots capable of interacting with humans in complex environments. In the past few years, researchers have found that feeding huge amounts of text taken from books or the web into large machine learning models can yield programs with impressive language skills, including OpenAI’s text generator GPT-3. By digesting the many forms of writing online, software can pick up the ability to summarize or answer questions about text, generate coherent articles on a given subject, or even hold cogent conversations.
Google and other Big Tech firms are making wide use of these large language models for search and advertising. A number of companies offer the technology via cloud APIs, and new services have sprung up applying AI language capabilities to tasks like generating code or writing advertising copy. Google engineer Blake Lemoine was recently fired after publicly warning that a chatbot powered by the technology, called LaMDA, might be sentient. A Google vice president who remains employed at the company wrote in The Economist that chatting with the bot felt like “talking to something intelligent.”
Despite those strides, AI programs are still prone to becoming confused or regurgitating gibberish. Language models trained with web text also lack a grasp of truth and often reproduce biases or hateful language found in their training data, suggesting careful engineering may be required to reliably guide a robot without it running amok.
The robot demonstrated by Hausman was powered by the most powerful language model Google has announced so far, known as PaLM. It is capable of many tricks, including explaining, in natural language, how it comes to a particular conclusion when answering a question. The same approach is used to generate a sequence of steps that the robot will execute to perform a given task.