Rainbow DQN: Combining Six Improvements
Published:

Component 1: Double DQN
Standard DQN uses the same network to select and evaluate the greedy action, causing maximisation bias — Q-values are systematically overestimated because max is taken over noisy estimates.
Double DQN decouples selection (online network \(\theta\)) from evaluation (target network \(\theta^-\)):
The online network picks the best action; the target network evaluates it. This eliminates the upward bias, leading to more accurate Q-values and better final performance.
Component 2: Dueling Architecture
The dueling network decomposes Q into two streams: a state-value \(V(s)\) and an advantage \(A(s, a) = Q(s, a) - V(s)\):
The mean-subtraction ensures identifiability (otherwise V and A can be shifted arbitrarily). The advantage is that the value stream \(V(s)\) can learn from every transition regardless of the action taken — useful in states where the action choice matters little.
Component 3: Prioritised Experience Replay
Standard replay samples transitions uniformly. Prioritised Experience Replay (PER) samples with probability proportional to the TD error magnitude:
\[P(i) = \frac{p_i^\alpha}{\sum_j p_j^\alpha}, \quad p_i = |\delta_i| + \epsilon\]Transitions with large TD error (surprising experiences) are replayed more often. Importance sampling weights \(w_i = (N \cdot P(i))^{-\beta}\) correct for the sampling bias. PER provides the largest individual improvement among the six components in Rainbow.
Component 4: Multi-Step Returns
Instead of 1-step TD targets, Rainbow uses n-step returns (n=3 in the paper):
\[G_t^{(n)} = r_t + \gamma r_{t+1} + \cdots + \gamma^{n-1} r_{t+n-1} + \gamma^n \max_{a'} Q(s_{t+n}, a')\]Multi-step returns propagate reward information faster along trajectories, reducing the effective horizon that bootstrapping must cover.
Component 5: Distributional RL (C51)
Instead of learning \(\mathbb{E}[G_t]\), C51 (Bellemare et al., 2017) learns the full distribution of returns, represented as a categorical distribution over \(N=51\) fixed atoms \(z_1, \ldots, z_{51}\) spanning \([V_{\min}, V_{\max}]\):
The network outputs a softmax over 51 atoms for each action. The Bellman update projects the shifted distribution \(r + \gamma Z(s', a^*)\) back onto the support atoms, and the loss is the cross-entropy between the projected and predicted distributions.
Component 6: Noisy Networks
Noisy networks replace ε-greedy exploration with parameter noise. Each linear layer has factorised Gaussian noise:
\[y = (\mu^w + \sigma^w \odot \varepsilon^w)x + (\mu^b + \sigma^b \odot \varepsilon^b)\]where \(\varepsilon\) is sampled at the start of each episode. The noise parameters \(\sigma\) are learned, allowing the network to adaptively control its exploration level per state — unlike ε-greedy, which explores uniformly.
Rainbow: All Six Combined
Rainbow combines all six components in a single agent. The ablation study from the paper reveals:
| Removed component | Median score drop |
|---|---|
| Prioritised replay | Largest drop |
| Multi-step returns | Large drop |
| Distributional RL | Significant drop |
| Double DQN | Moderate drop |
| Dueling | Moderate drop |
| Noisy nets | Modest drop |
Rainbow achieves the median human-normalised score across 57 Atari games with far fewer environment steps than any individual component alone — approximately 7× more sample-efficient than DQN.
References
- Hessel, M., et al. (2018). Rainbow: Combining improvements in deep reinforcement learning. AAAI 2018. arXiv:1710.02298.
- Bellemare, M. G., Dabney, W., & Munos, R. (2017). A distributional perspective on reinforcement learning. ICML 2017. arXiv:1707.06887.
- Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized experience replay. ICLR 2016. arXiv:1511.05952.
