GAE: Effect of the lambda Parameter on the Bias-Variance Tradeoff

hardmcq

General

In policy-gradient methods such as PPO, the advantage is commonly estimated with Generalized Advantage Estimation (GAE):

$ $\hat{A}_t^{\text{GAE}(\gamma,\lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \, \delta_{t+l}, \qquad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ $

where $V$ is the learned value (critic) baseline. Consider the role of the parameter $\lambda \in [0,1]$ (with $\gamma$ fixed). Which of the following statements is correct?