Why co-train a VLA on web data instead of fine-tuning only on robot data

mediumsubjectivesystem design

General

A team builds a vision-language-action (VLA) policy by initializing from an internet-pretrained vision-language model. Engineer A proposes fully fine-tuning only on the team's robot teleoperation dataset (a few hundred thousand trajectories). Engineer B argues for *co-training*: mixing the robot data with the original web image-text data during fine-tuning. Explain the trade-off. Why does co-training (or keeping web data in the mix) tend to preserve generalization, and what failure mode does fine-tuning purely on robot data risk? Reference how this connects to the value of internet-scale pretraining for robotics.

Your answer