Language-Conditioned Robot Policies

5 minute read

Published: September 17, 2025

TL;DR: Language-conditioned robots can receive instructions in natural language and translate them into physical actions. SayCan grounds LLM task plans in robot affordances; CLIPort uses CLIP features to ground language in spatial manipulation; and modern VLMs serve as end-to-end instruction-following policies, enabling robots to understand complex, context-dependent commands.

SayCan language-conditioned policy — SayCan: language-conditioned robot skill selection (Ahn et al., 2022)

Grounding Language to Actions

Large Language Models possess impressive commonsense reasoning, world knowledge, and instruction following. The challenge for robotics is grounding: translating abstract language representations into concrete physical robot actions. A robot that understands “bring me something to drink” must parse the instruction, identify relevant objects, plan a sequence of manipulation primitives, and execute them — all while respecting the physical constraints of its embodiment.

The grounding problem is hard because language operates at a semantic level (“the mug on the left”) while robot control requires precise geometric specifications (joint angles, end-effector positions). Bridging these levels of abstraction is the central challenge of language-conditioned robotics.

SayCan: Affordance-Weighted LLM Planning

SayCan (Ahn et al. 2022, arXiv:2204.01691) elegantly decomposes the grounding problem into two components:

Language probability: a large language model (PaLM) scores how plausible each candidate skill string is given the user’s instruction and task context.
Affordance probability: for each candidate skill, a learned value function \(V(s, \text{skill})\) estimates the probability that this skill can be successfully executed from the current robot state.

The robot selects the skill maximising the product:

skill* = argmax_i p_LLM(skill_i | instruction, context) * V(s, skill_i)

This ensures that selected skills are both semantically appropriate (per the LLM) and physically executable (per the affordance model). SayCan demonstrated impressive open-ended instruction following in a real cafeteria environment with 101 skills across picking, placing, and opening tasks.

Key Insight: SayCan's key contribution is recognising that LLMs alone cannot plan for robots — they lack knowledge of physical feasibility. By multiplying LLM scores with affordance scores, the system balances what makes semantic sense with what the robot can actually do. Neither alone is sufficient.

CLIP for Robot Manipulation

CLIP (Contrastive Language-Image Pre-training, Radford et al. 2021) jointly trains image and text encoders so that semantically related image-text pairs have similar embeddings. This gives CLIP zero-shot visual grounding ability: given the text “red mug on the left”, CLIP can identify the corresponding region in an image without task-specific training.

CLIPort (Shridhar et al. 2022, arXiv:2109.12098) integrates CLIP features into a manipulation policy for tabletop pick-and-place. The architecture combines:

Semantic stream: CLIP features encode the language instruction and global visual context.
Spatial stream: a standard convolutional network encodes fine-grained spatial information for precise grasp localisation.

These streams are fused via element-wise multiplication to produce a pixel-level picking map (where to pick) and a placing map (where to place). CLIPort trained on just 10–100 language-labelled demonstrations achieves remarkable generalisation to novel instructions and object appearances.

LLMs as Task Planners

Beyond grounding to atomic skills, LLMs can plan entire multi-step task sequences. Given an instruction and a description of available skills, an LLM generates a plan:

Instruction: "Set the table for dinner"
Plan:
  1. pick up plate, place at table position A
  2. pick up fork, place left of plate
  3. pick up knife, place right of plate
  4. pick up glass, place above plate

This approach (used in systems like Inner Monologue, Huang et al. 2022) works when skills are reliably executable and the LLM’s world model is accurate. Failures arise when the LLM generates physically impossible plans (e.g., stacking objects that cannot balance) or when skill execution fails, causing the plan to go off-track.

Inner Monologue addresses this by providing the LLM with natural language feedback from the environment (object detection, success/failure signals) to enable re-planning in a closed loop.

VLMs as End-to-End Instruction Followers

The most direct approach conditions the full robot policy end-to-end on vision and language. Models like RT-2, OpenVLA, and Octo accept an image observation and a language instruction and directly output motor commands, bypassing explicit task planning and skill selection.

These models benefit from VLM pre-training’s rich semantic representations and can generalise to novel instruction phrasings, novel objects, and even novel task types that were not present in robot training data — abilities that emerge from the breadth of internet-scale pre-training.

References

Ahn, M., et al. (2022). Do as I can, not as I say: Grounding language in robotic affordances. arXiv:2204.01691.
Shridhar, M., et al. (2022). CLIPort: What and where pathways for robotic manipulation. CoRL 2022. arXiv:2109.12098.
Radford, A., et al. (2021). Learning transferable visual models from natural language supervision. ICML 2021 (CLIP).
Huang, W., et al. (2022). Inner monologue: Embodied reasoning through planning with language models. CoRL 2022.
Zeng, A., et al. (2022). Socratic models: Composing zero-shot multimodal reasoning with language. arXiv:2204.00598.
Brohan, A., et al. (2023). RT-2: Vision-language-action models transfer web knowledge to robotic control. arXiv:2307.15818.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

Language-Conditioned Robot Policies

Grounding Language to Actions

SayCan: Affordance-Weighted LLM Planning

CLIP for Robot Manipulation

LLMs as Task Planners

VLMs as End-to-End Instruction Followers

References

Share on

You May Also Enjoy

GAPE: Remember to Forget — Gated Adaptive Positional Encoding

PolyNSD: Polynomial Neural Sheaf Diffusion

TDA in Materials Science: Topology of Structure and Phase

TDA in Drug Discovery: Molecular Topology