Deadly Triad and Off-Policy Value Divergence

hardsubjective

General

You are training a value-based agent on a robotic manipulation task. To improve sample efficiency you switch from on-policy TD(0) with a tabular value table to off-policy Q-learning with a deep neural network function approximator and experience replay. During training the loss occasionally spikes and the learned $Q$ -values diverge to very large magnitudes, even though the reward signal is bounded.

Explain the deadly triad and why this specific combination of design choices can cause value estimates to diverge. Then describe concrete algorithmic mechanisms (at least two) used in modern deep RL to mitigate this instability, and explain *why* each one helps.

Your answer