Vision-Language-Action (VLA)

Why it matters in robotics

Vision-Language-Action models are the headline of "physical AI" and the fastest-moving area in robot learning, so they come up constantly in research and applied-robotics interviews right now. Expect to be asked how a VLA differs from classic modular perception-planning-control stacks, why fine-tuning an internet-pretrained VLM gives better generalization than training a policy from scratch, and the tradeoffs of action representation (discrete token binning vs. FAST tokens vs. diffusion/flow-matching decoding). Interviewers also probe the canonical systems lineage (RT-1, RT-2, Open X-Embodiment, Octo, OpenVLA, pi-0) and the practical pain points: data scale, cross-embodiment normalization, control frequency/latency from large backbones, evaluation, and safety. Strong candidates can reason about both the modeling choices and the systems constraints of deploying a large model in a real-time control loop.

Application focus

The same topic, tailored to the robot you're building. Your choice is remembered across the roadmap and every topic.

Select an application above.

At a glance

A VLA policy maps a camera image plus a language instruction through a pretrained vision-language backbone to robot actions, decoded either as discrete action tokens or via a continuous diffusion/flow-matching head.

What to study

✓Action representation: discrete action-token binning (RT-2/OpenVLA) vs. frequency-space tokenization (FAST/DCT) vs. continuous diffusion/flow-matching action heads (Octo, pi-0), and the precision/frequency tradeoffs of each.
✓Why internet pretraining helps: how a finetuned VLM backbone transfers semantic and visual knowledge so the policy generalizes to unseen objects, attributes, and language not present in the robot demos.
✓Key systems and their lineage: RT-1 (transformer policy), RT-2 (VLM-as-policy), Open X-Embodiment/RT-X (cross-embodiment data), Octo (open generalist diffusion policy), OpenVLA (7B open VLA), pi-0 and pi0-FAST (flow-matching vs. autoregressive).
✓Deployment realities and open problems: control frequency/latency vs. large-backbone inference (action chunking, async inference), data scale and cross-embodiment normalization, evaluation/reproducibility, and safety of language-conditioned action.

Study by time budget

Pick the path that fits the time you have before your interview.

Where to practice coding

⌨ Run/fine-tune OpenVLA (official codebase) ↗

Prerequisites

Foundation Models Imitation Learning