Action tokenization vs. flow-matching action heads in VLAs

mediummcq

General

Several vision-language-action models reuse a pretrained language-model backbone but differ in how they produce continuous robot actions. RT-2 and OpenVLA discretize each action dimension into bins and emit them as *text tokens* via the standard LM head, whereas pi-0 attaches a separate flow-matching ('action expert') head that denoises a continuous action chunk. Which statement best captures a key advantage of the flow-matching / continuous approach over discrete action tokenization for high-frequency dexterous control?