Token binning vs. continuous action decoding in VLAs

hardsubjectivesystem design

General

A team is building a Vision-Language-Action policy by fine-tuning a pretrained VLM on robot demonstrations. They debate two action-decoding designs: (a) discretize each action dimension per timestep into bins and predict them autoregressively as extra vocabulary tokens (the RT-2 / OpenVLA style), versus (b) attach a continuous action head trained with diffusion or flow matching (the Octo / pi-0 style). Discuss the tradeoffs between these representations. In your answer address: precision and multimodality of the action distribution, suitability for high-frequency dexterous control, inference latency/throughput, and how an approach like frequency-space tokenization ( $FAST$ ) changes the picture for autoregressive decoding.

Your answer