XXooptRobotics

Vision-Language-Action (VLA)

End-to-end physical-AI policies: images + language โ†’ actions.

hardLearning

Why it matters in robotics

Vision-Language-Action models are the headline of "physical AI" and the fastest-moving area in robot learning, so they come up constantly in research and applied-robotics interviews right now. Expect to be asked how a VLA differs from classic modular perception-planning-control stacks, why fine-tuning an internet-pretrained VLM gives better generalization than training a policy from scratch, and the tradeoffs of action representation (discrete token binning vs. FAST tokens vs. diffusion/flow-matching decoding). Interviewers also probe the canonical systems lineage (RT-1, RT-2, Open X-Embodiment, Octo, OpenVLA, pi-0) and the practical pain points: data scale, cross-embodiment normalization, control frequency/latency from large backbones, evaluation, and safety. Strong candidates can reason about both the modeling choices and the systems constraints of deploying a large model in a real-time control loop.

Application focus

The same topic, tailored to the robot you're building. Your choice is remembered across the roadmap and every topic.

Select an application above.

At a glance

Camera image +language instructionPretrainedvision-languagebackbone(internet-pretrained)Action decoder: tokenbinning / FAST tokensor diffusion /flow-matchingRobot actions(end-effector deltas,gripper)encode pixels +textfused embeddingsdecode actionchunkexecute, observenext state

A VLA policy maps a camera image plus a language instruction through a pretrained vision-language backbone to robot actions, decoded either as discrete action tokens or via a continuous diffusion/flow-matching head.

What to study

  • โœ“Action representation: discrete action-token binning (RT-2/OpenVLA) vs. frequency-space tokenization (FAST/DCT) vs. continuous diffusion/flow-matching action heads (Octo, pi-0), and the precision/frequency tradeoffs of each.
  • โœ“Why internet pretraining helps: how a finetuned VLM backbone transfers semantic and visual knowledge so the policy generalizes to unseen objects, attributes, and language not present in the robot demos.
  • โœ“Key systems and their lineage: RT-1 (transformer policy), RT-2 (VLM-as-policy), Open X-Embodiment/RT-X (cross-embodiment data), Octo (open generalist diffusion policy), OpenVLA (7B open VLA), pi-0 and pi0-FAST (flow-matching vs. autoregressive).
  • โœ“Deployment realities and open problems: control frequency/latency vs. large-backbone inference (action chunking, async inference), data scale and cross-embodiment normalization, evaluation/reproducibility, and safety of language-conditioned action.

Study by time budget

Pick the path that fits the time you have before your interview.

  1. โœŽpi-0: Our First Generalist Policy (Physical Intelligence blog)โ†—ArticlePhysical Intelligenceยท ~25 min
  2. โ–ถRobot Foundation Models (talk)โ†—VideoSergey Levine, UC Berkeley / Physical Intelligenceยท ~1 hr

Where to practice coding

Prerequisites

Practice questions (2)