This new system can teach a robot a simple household task within 20 minutes

While other types of AI, such as large language models, are trained on huge repositories of data scraped from the internet, the same can’t be done with robots, because the data needs to be physically collected. This makes it a lot harder to build and scale training databases.

Similarly, while it’s relatively easy to train robots to execute tasks inside a laboratory, these conditions don’t necessarily translate to the messy unpredictability of a real home.

To combat these problems, the team came up with a simple, easily replicable way to collect the data needed to train Dobb-E—using an iPhone attached to a reacher-grabber stick, the kind typically used to pick up trash. Then they set the iPhone to record videos of what was happening.

Volunteers in 22 homes in New York completed certain tasks using the stick, including opening and closing doors and drawers, turning lights on and off, and placing tissues in the trash. The iPhones’ lidar systems, motion sensors, and gyroscopes were used to record data on movement, depth, and rotation—important information when it comes to training a robot to replicate the actions on its own.

After they’d collected just 13 hours’ worth of recordings in total, the team used the data to train an AI model to instruct a robot in how to carry out the actions. The model used self-supervised learning techniques, which teach neural networks to spot patterns in data sets by themselves, without being guided by labeled examples.

The next step involved testing how reliably a commercially available robot called Stretch, which consists of a wheeled unit, a tall pole, and a retractable arm, was able to use the AI system to execute the tasks. An iPhone held in a 3D-printed mount was attached to Stretch’s arm to replicate the setup on the stick.

The researchers tested the robot in 10 homes in New York over 30 days, and it completed 109 household tasks with an overall success rate of 81%. Each task typically took Dobb-E around 20 minutes to learn: five minutes of demonstration from a human using the stick and attached iPhone, followed by 15 minutes of fine-tuning, when the system compared its previous training with the new demonstration.