Why internet pretraining helps VLA generalization

mediummcq

General

RT-2 and OpenVLA are built by fine-tuning a vision-language model that was pretrained on internet-scale image-text data, rather than training a policy from scratch on robot demonstrations alone. Which statement best explains the primary reason this internet pretraining improves the resulting policy's generalization?