Choosing a policy class for multimodal demonstrations

mediummcq

General

Suppose human demonstrations for an obstacle-avoidance task are multimodal: sometimes the expert steers left of the obstacle and sometimes right, both equally valid. A policy is trained to regress a single action by minimizing mean-squared error to the demonstrated action. What is the most likely failure, and which policy formulation best fixes it?