Rainbow DQN: Combining Six Improvements

4 minute read

Published:

TL;DR: Rainbow (Hessel et al., 2018) combines six orthogonal improvements to DQN into a single agent. The combination outperforms each individual improvement and vanilla DQN by a large margin on the Atari benchmark, while an ablation study confirms that each of the six components contributes positively.
Rainbow DQN combining six improvements
Rainbow DQN builds on the original DQN architecture (Mnih et al., 2015)
Key Paper: Hessel, M., et al. (2018). Rainbow: Combining improvements in deep reinforcement learning. AAAI 2018. arXiv:1710.02298. The paper is notable for showing that independently developed improvements are complementary and collectively achieve much better sample efficiency than any individual method.

Component 1: Double DQN

Standard DQN uses the same network to select and evaluate the greedy action, causing maximisation bias — Q-values are systematically overestimated because max is taken over noisy estimates.

Double DQN decouples selection (online network \(\theta\)) from evaluation (target network \(\theta^-\)):

$$y^{DDQN} = r + \gamma Q\!\left(s',\, \arg\max_{a'} Q(s', a'; \theta);\, \theta^-\right)$$

The online network picks the best action; the target network evaluates it. This eliminates the upward bias, leading to more accurate Q-values and better final performance.

Component 2: Dueling Architecture

The dueling network decomposes Q into two streams: a state-value \(V(s)\) and an advantage \(A(s, a) = Q(s, a) - V(s)\):

$$Q(s, a; \theta) = V(s; \theta_V) + \left[A(s, a; \theta_A) - \frac{1}{|\mathcal{A}|}\sum_{a'} A(s, a'; \theta_A)\right]$$

The mean-subtraction ensures identifiability (otherwise V and A can be shifted arbitrarily). The advantage is that the value stream \(V(s)\) can learn from every transition regardless of the action taken — useful in states where the action choice matters little.

Component 3: Prioritised Experience Replay

Standard replay samples transitions uniformly. Prioritised Experience Replay (PER) samples with probability proportional to the TD error magnitude:

\[P(i) = \frac{p_i^\alpha}{\sum_j p_j^\alpha}, \quad p_i = |\delta_i| + \epsilon\]

Transitions with large TD error (surprising experiences) are replayed more often. Importance sampling weights \(w_i = (N \cdot P(i))^{-\beta}\) correct for the sampling bias. PER provides the largest individual improvement among the six components in Rainbow.

Component 4: Multi-Step Returns

Instead of 1-step TD targets, Rainbow uses n-step returns (n=3 in the paper):

\[G_t^{(n)} = r_t + \gamma r_{t+1} + \cdots + \gamma^{n-1} r_{t+n-1} + \gamma^n \max_{a'} Q(s_{t+n}, a')\]

Multi-step returns propagate reward information faster along trajectories, reducing the effective horizon that bootstrapping must cover.

Component 5: Distributional RL (C51)

Instead of learning \(\mathbb{E}[G_t]\), C51 (Bellemare et al., 2017) learns the full distribution of returns, represented as a categorical distribution over \(N=51\) fixed atoms \(z_1, \ldots, z_{51}\) spanning \([V_{\min}, V_{\max}]\):

$$Z(s, a) = \sum_{i=1}^{51} p_i(s, a)\, \delta_{z_i}$$

The network outputs a softmax over 51 atoms for each action. The Bellman update projects the shifted distribution \(r + \gamma Z(s', a^*)\) back onto the support atoms, and the loss is the cross-entropy between the projected and predicted distributions.

Key Insight: Distributional RL outperforms expected-value RL even when the downstream policy is still greedy. The reason is that learning the full distribution provides richer learning signals and better-calibrated value estimates — the distribution matters for learning, not just for risk-sensitive behaviour.

Component 6: Noisy Networks

Noisy networks replace ε-greedy exploration with parameter noise. Each linear layer has factorised Gaussian noise:

\[y = (\mu^w + \sigma^w \odot \varepsilon^w)x + (\mu^b + \sigma^b \odot \varepsilon^b)\]

where \(\varepsilon\) is sampled at the start of each episode. The noise parameters \(\sigma\) are learned, allowing the network to adaptively control its exploration level per state — unlike ε-greedy, which explores uniformly.

Rainbow: All Six Combined

Rainbow combines all six components in a single agent. The ablation study from the paper reveals:

Removed componentMedian score drop
Prioritised replayLargest drop
Multi-step returnsLarge drop
Distributional RLSignificant drop
Double DQNModerate drop
DuelingModerate drop
Noisy netsModest drop

Rainbow achieves the median human-normalised score across 57 Atari games with far fewer environment steps than any individual component alone — approximately 7× more sample-efficient than DQN.

References

  1. Hessel, M., et al. (2018). Rainbow: Combining improvements in deep reinforcement learning. AAAI 2018. arXiv:1710.02298.
  2. Bellemare, M. G., Dabney, W., & Munos, R. (2017). A distributional perspective on reinforcement learning. ICML 2017. arXiv:1707.06887.
  3. Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized experience replay. ICLR 2016. arXiv:1511.05952.