XXooptRobotics

Foundation Models

Large pretrained VLMs/LLMs as the robot's semantic brain.

hardLearning

Why it matters in robotics

Foundation models are reshaping robotics: vision-language-action (VLA) models like RT-2, OpenVLA, and pi-0 now turn internet-pretrained transformers into generalist robot policies, so interviewers increasingly probe whether candidates understand this shift. Expect questions on how transformers, pretraining, and scaling laws produce emergent abilities, and how contrastive vision-language pretraining (CLIP) gives robots open-vocabulary grounding. A very common theme is adaptation: when to fine-tune fully vs. prompt vs. use LoRA/adapters, and why robotics teams co-train on web plus robot data rather than robot data alone. Candidates are also expected to reason crisply about limitations -- hallucination, weak physical grounding, inference latency vs. control rate, and the chronic scarcity of robot data. Strong answers connect the ML mechanism to a concrete robotics deployment consequence.

Application focus

The same topic, tailored to the robot you're building. Your choice is remembered across the roadmap and every topic.

Select an application above.

At a glance

Web-scale image-text +VLM pretraining (CLIP/ VLM)Co-train on web +robot demos (OpenX-Embodiment)Action head: discretetokens orflow/diffusion expertClosed-loop robotcontroltransfersemantic andvisual knowledgemap observationsplus instructionto actionsexecute atcontrol ratecollect morerobot data(scarce)

Adapting an internet-pretrained vision-language model into a robot vision-language-action (VLA) policy.

What to study

  • โœ“Transformer pretraining and scaling: self-attention, next-token objective, scaling laws (loss as a power law in params/data/compute), and emergent abilities that appear only at scale.
  • โœ“Vision-language grounding: contrastive image-text pretraining (CLIP), VLM architectures, and how open-vocabulary visual features enable semantic generalization in robots.
  • โœ“Adapting pretrained models to robots: building VLAs by co-training on web + robot data, discrete action tokenization (RT-2, OpenVLA) vs. flow/diffusion action experts (pi-0), and parameter-efficient fine-tuning with LoRA/adapters vs. full fine-tuning vs. prompting.
  • โœ“Deployment limitations: inference latency vs. control frequency, hallucination and weak physical grounding, robot data scarcity, and safety/uncertainty considerations for closed-loop control.

Study by time budget

Pick the path that fits the time you have before your interview.

  1. โœŽpi-0: Our First Generalist Policyโ†—ArticlePhysical Intelligenceยท ~25 min
  2. โœŽRT-2: New model translates vision and language into actionโ†—ArticleGoogle DeepMindยท ~20 min

Where to practice coding

Prerequisites

Practice questions (2)