Learning to Learn Faster from Human Feedback with Language Model Predictive Control

Jacky Liang , Fei Xia , Wenhao Yu , Andy Zeng
Montserrat Gonzalez Arenas, Maria Attarian, Maria Bauza, Matthew Bennice, Alex Bewley, Adil Dostmohamed, Chuyuan Kelly Fu, Nimrod Gileadi, Marissa Giustina, Keerthana Gopalakrishnan, Leonard Hasenclever, Jan Humplik, Jasmine Hsu, Nikhil Joshi, Ben Jyenis, Chase Kew, Sean Kirmani, Tsang-Wei Edward Lee, Kuang-Huei Lee, Assaf Hurwitz Michaely, Joss Moore, Ken Oslund, Dushyant Rao, Allen Ren, Baruch Tabanpour, Quan Vuong, Ayzaan Wahid, Ted Xiao, Ying Xu, Vincent Zhuang, Peng Xu†, Erik Frey†, Ken Caluwaerts†, Tingnan Zhang†, Brian Ichter†, Jonathan Tompson†, Leila Takayama†, Vincent Vanhoucke†, Izhak Shafran†, Maja Mataric†, Dorsa Sadigh†, Nicolas Heess†, Kanishka Rao†, Nik Stewart†, Jie Tan†, Carolina Parada†

Google DeepMind

corresponding authors in alphabetical order, †advising leads
all other authors in alphabetical order

arXiv PDF Code Demo

Abstract

Large language models (LLMs) have been shown to exhibit a wide range of capabilities, such as writing robot code from language commands - enabling non-experts to direct robot behaviors, modify them based on feedback, or compose them to perform newtasks. However, these capabilities (driven by in-context learning) are limited to short-term interactions, where users' feedback remains relevant for only as long as it fits within the context size of the LLM,and can be forgotten over longer interactions. In this work, we investigate fine-tuning the robot code-writing LLMs, to remember their in-context interactions and improve their teachability i.e., how efficiently they adapt to human inputs (measured by average number of corrections before the user considers the task successful). Our key observation is that when human-robot interactions are formulated as a partially observable Markov decision process (in which human language inputs are observations, and robot code outputs are actions), then training an LLM to complete previous interactions can be viewed as training a transition dynamics model - that can be combined with classic robotics techniques such as model predictive control (MPC) to discover shorter paths to success. This gives rise to Language Model Predictive Control (LMPC), a framework that fine-tunes PaLM 2 to improve its teachability on 78 tasks across 5 robot embodiments - improving non-expert teaching success rates of unseen tasks by 26.9% while reducing the average number of human corrections from 2.4 to 1.9. Experiments show that LMPC also produces strong meta-learners, improving the success rate of in-context learning new tasks on unseen robot embodiments and APIs by 31.5%.

Supplemental Video:

Language Model Predictive Control

Given a dataset of users teaching robots new tasks with language (represented as text inputs and code outputs from online in-context learning – left), LMPC- Rollouts is trained to predict subsequent inputs and outputs conditioned on the current chat history (middle), and uses MPC (receding horizon control) for inference-time search to return the next best action (with fewest expected corrections before success). LMPC-Skip is an alternate variant that is trained to directly predict the last action (right). Both LMPC variants accelerate fast robot adaptation via in-context learning.

Experiments

Our experiments evaluate how much the various proposed finetuning strategies (slow adaptation) improve online in-context learning (fast adaptation) for humans interactively teaching robots via natural language feedback. Evaluations are performed on 78 robot tasks, across 5 robot embodiments in simulation and 2 on real hardware. We specifically explore the following questions:

How much does fine-tuning improve teachability, especially on test tasks?
How do LMPC-Rollouts and LMPC-Skip compare?
What are the benefits of Top-User Conditioning?
Does finetuning enable cross-embodiment generalization?
Can iterative finetuning further improve teachability?

Our fine-tuned LLMs with LMPC-Rollouts and LMPC-Skip improve the teachability of the base model (PaLM 2-S), and outperforms a RAG baseline across all embodiments. LMPC-Skip overfits to train tasks (left), while LMPC-Rollouts generalizes better (i.e., more teachable and responsive to feedback) on unseen test tasks (right) for multi-turn sessions (with more than one chat turn).

Comparing base and finetuned models across all embodiments. Success: overall success rate on all tasks. Num Chat Turns: mean number of chat turns for successful chat sessions. Good Rating: proportion of positively rated chat turns after the turn. Successful Tasks: proportion of tasks with at least one successful chat session. 1 turn Success: the proportion of chat sessions that were successful with just one chat turn. 2+ turn Success: the proportion of chat sessions that were successful with two or more chat turns. For both train and test tasks, LMPC-Skip achieves the lowest Num Chat Turns for successful chat sessions, as well as the highest 1-turn Success Rate. These reflect how LMPC-Skip is trained to predict the final code as fast as possible. However, LMPC-Rollouts has the highest 2+ turn Success Rate, suggesting it is most amenable to corrective feedback given an incorrect first response. To maximize performance in practice, these results suggest that one should use LMPC-Skip for responding to the initial user instruction, then LMPC-Rollouts for responding to subsequent user feedback. For RAG, while the method improves upon the base model on overall success rate, it achieves lower Successful Task Rate than the base model on test tasks. This suggests that while RAG may be proficient at increasing the success rate of tasks similar to the retrieved examples, it struggles to perform well on novel tasks.

We evaluate our approach on a subset of tasks for the Mobile Manipulator and the Robot Dog in the real world. For each task, we ask users to perform four teaching sessions on the real-robot directly. See results that compare PaLM 2-S and LMPC-Rollouts in the table. LMPC-Rollouts achieves higher success rate than PaLM 2-S across all tasks. While Num Chat Turns for successful sessions is about the same for PaLM 2-S and LMPC-Rollouts on these tasks, LMPC-Rollouts achieves much higher success rates

Teaching Demos

Below we show some complex robot behaviors across many robot embodiments that can be taught using our system.

Below we show our method can be applied to real robots, and there is a significant difference in robot behavior before and after teaching.

Below we show our a chat session rolled out in the real world, applying reward conditioned distillation to change robot motions.

BibTeX

@article{liang2024learning,
      title={Learning to Learn Faster from Human Feedback with Language Model Predictive Control}, 
      author={Jacky Liang and Fei Xia and Wenhao Yu and Andy Zeng and Montserrat Gonzalez Arenas and Maria Attarian and Maria Bauza and Matthew Bennice and Alex Bewley and Adil Dostmohamed and Chuyuan Kelly Fu and Nimrod Gileadi and Marissa Giustina and Keerthana Gopalakrishnan and Leonard Hasenclever and Jan Humplik and Jasmine Hsu and Nikhil Joshi and Ben Jyenis and Chase Kew and Sean Kirmani and Tsang-Wei Edward Lee and Kuang-Huei Lee and Assaf Hurwitz Michaely and Joss Moore and Ken Oslund and Dushyant Rao and Allen Ren and Baruch Tabanpour and Quan Vuong and Ayzaan Wahid and Ted Xiao and Ying Xu and Vincent Zhuang and Peng Xu and Erik Frey and Ken Caluwaerts and Tingnan Zhang and Brian Ichter and Jonathan Tompson and Leila Takayama and Vincent Vanhoucke and Izhak Shafran and Maja Mataric and Dorsa Sadigh and Nicolas Heess and Kanishka Rao and Nik Stewart and Jie Tan and Carolina Parada},
      year={2024},
      eprint={2402.11450},
      archivePrefix={arXiv},
      primaryClass={cs.RO}
}

Acknowledgments

We thank John Guilyard for his expert animations, and Giles Ruscoe for beautiful renderings. We thank Steven Bohez, Yuval Tassa, Tom Erez, Murilo Martins, Rugile Pevceviciute, David Rendleman, and Connor Schenck for their dedication to ensuring we had strong simulated environments. We thank Travis Armstrong, Noah Brown, Spencer Goodrich, Craig Hickman, Atil Iscen, Jerad Kirkland, Jason Powell, Stefano Saliceti, Ron Sloat, Sergey Yaroshenko, Eddie Yu, Grace Vesom, and Jake Varley for additional robot platform support and robot lab operations. Special thanks to Michael Ahn, Kendra Byrne, Aleksandra Faust, René Wagner, Yuheng Kuang, Yao Lu, Yansong Pang, and Zhuo Xu for supporting this project. We thank all the teachers who volunteered to collect the robot teaching data. We also thank the Google DeepMind Visualization and Human Interaction teams for their help with the development and support of the chat interface. We also want to thank the entire Google DeepMind Robotics team whose tireless efforts can be traced to additional support on this paper. This includes Administrative, Product, Programs, and Strategy teams whose contributions impact all of the team’s successes. We also want to thank our friends in Google DeepMind and Google Research for their guidance, inspirational research, and even direct contributions.