Foundation Models for Robotics: RT-1, RT-2, and Beyond

4 minute read

Published:

TL;DR: Foundation models — large models pre-trained on broad data — are entering robotics. RT-1 demonstrated that a Transformer trained on 130k diverse robot demonstrations generalises to novel tasks. RT-2 takes this further by co-fine-tuning a billion-parameter vision-language model on robot data, enabling emergent generalisation to instructions never seen in robot training. Open-source models like Octo and OpenVLA are making these capabilities accessible.
RT-2 vision-language-action model
RT-2: Vision-Language-Action model for generalised robot control (Brohan et al., 2023)

The Case for Scale in Robot Learning

Individually trained robot policies are brittle: a model trained to pick apples often fails on oranges. The success of large language models taught us that scale — more data, more parameters, more compute — enables emergent generalisation. The central question for robotics is: does the same principle apply when the “language” is actions?

The answer emerging from RT-1, RT-2, and their successors is: yes, but it requires large, diverse robot datasets and architectures that can absorb and transfer that diversity.

RT-1: Transformer for Robot Learning

RT-1 (Brohan et al. 2022, arXiv:2212.06817) trained an 35M-parameter EfficientNet + Transformer architecture on a dataset of 130,000 demonstrations collected over 17 months by 13 robots in Google’s office kitchens. Tasks spanned picking, placing, opening drawers, and knocking over objects — 700+ distinct tasks with natural language instructions.

The architecture:

  • An EfficientNet-B3 image encoder processes each camera frame.
  • A TokenLearner module compresses visual tokens from 81 to 8.
  • A Transformer decoder attends over language tokens and 6 image frames to predict robot actions.
  • Output: a tokenised 7-DOF robot action (discrete bins for each joint + gripper).
a_t = argmax_a p_theta(a | o_{t-5:t}, l)

where \(o_{t-5:t}\) is a stack of recent observations and \(l\) is the language instruction.

RT-1 achieved 97% success on seen tasks and, crucially, ~25% success on novel tasks not in the training set — demonstrating that broad training improves generalisation beyond specialised single-task models.

Key Insight: The single most important finding from RT-1 is not the architecture — it is that scale and diversity of robot demonstrations matter. A policy trained across hundreds of tasks learns representations that transfer to new tasks, while a policy trained on a single task does not.

RT-2: VLMs as Robot Policies

RT-2 (Brohan et al. 2023, arXiv:2307.15818) makes a bolder move: directly fine-tune a large Vision-Language Model (PaLI-X, 55B parameters; or PaLM-E) on robot demonstration data, treating robot actions as additional tokens in the language model’s vocabulary.

The key insight: VLMs already encode rich semantic knowledge about objects, actions, and the physical world from internet-scale pre-training. By co-fine-tuning the VLM on robot data (web data and robot demonstrations simultaneously), RT-2 retains this general knowledge while acquiring robot-specific action generation.

Results showed remarkable emergent capabilities: RT-2 could follow instructions like “move the banana to the correct country flag” (requiring reasoning about geography) without any robot demonstrations of this task — it transferred knowledge from the language pre-training.

Open-Source: Octo and OpenVLA

Open-Source Ecosystem: RT-1 and RT-2 are proprietary. Octo (Ghosh et al. 2023) and OpenVLA (Kim et al. 2024) provide open-source generalist robot policies trained on the Open X-Embodiment dataset — a community effort aggregating robot demonstrations from 22 different research labs and embodiments.

Octo is a 93M-parameter Transformer trained on 800k demonstrations across diverse robots. It supports language and goal-image conditioning, can be fine-tuned to new robots in minutes, and achieves competitive performance with proprietary models on standard benchmarks.

OpenVLA (7B parameters) is based on Prismatic-7B, a VLM fine-tuned on 970k Open X-Embodiment demonstrations. It matches RT-2-55B on several benchmarks while being 7x smaller and open-source.

Generalisation to Novel Tasks

The core promise of foundation model approaches is systematic generalisation. Evaluations show that generalist robot policies can:

  • Follow novel language instructions through compositional reasoning
  • Adapt to new object instances not seen during training
  • Transfer across embodiments with brief fine-tuning
  • Exhibit emergent behaviours from language pre-training

Remaining challenges include long-horizon tasks, precise manipulation, and the fundamental data bottleneck — even 130k demonstrations is tiny compared to the billions of tokens used to train language models.

References

  • Brohan, A., et al. (2022). RT-1: Robotics Transformer for real-world control at scale. arXiv:2212.06817.
  • Brohan, A., et al. (2023). RT-2: Vision-language-action models transfer web knowledge to robotic control. arXiv:2307.15818.
  • Ghosh, D., et al. (2023). Octo: An open-source generalist robot policy. arXiv:2405.12213.
  • Kim, M. J., et al. (2024). OpenVLA: An open-source vision-language-action model. arXiv:2406.09246.
  • Padalkar, A., et al. (2023). Open X-Embodiment: Robotic learning datasets and RT-X models. arXiv:2310.08864.
  • Bommasani, R., et al. (2021). On the opportunities and risks of foundation models. arXiv:2108.07258.