Deep Q-Networks (DQN): Playing Atari from Pixels

4 minute read

Published:

TL;DR: DQN extends Q-learning to high-dimensional pixel inputs using a convolutional neural network. Two key innovations stabilise training: (1) an experience replay buffer to break temporal correlations, and (2) a target network updated periodically to provide stable bootstrap targets. DQN achieved human-level play on 49 Atari games from raw pixels.
Deep Q-Network architecture
DQN: Deep Q-Network for Atari games (Mnih et al., 2015)
Key Paper: Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529–533. DQN demonstrated that a single algorithm, with the same architecture and hyperparameters, could master diverse games from pixels alone — a landmark result in AI.

The Deadly Triad: Why Naive Deep Q-Learning Fails

Tabular Q-learning provably converges, but naively replacing the table with a neural network \(Q_\theta(s, a)\) can diverge or oscillate. The culprit is the deadly triad (Sutton & Barto):

  1. Function approximation: small parameter changes affect all states globally.
  2. Bootstrapping: using the current Q-estimate as the learning target creates a moving target.
  3. Off-policy learning: data distribution changes as the policy improves.

The combination of all three — which occurs in deep Q-learning — can cause training instability. DQN addresses issues 1 and 3 with experience replay, and issues 1 and 2 with a target network.

Experience Replay Buffer

Instead of updating on the most recent transition, DQN stores all transitions \((s, a, r, s')\) in a replay buffer \(\mathcal{D}\) of capacity \(N\) (typically 1 million). At each update step, a mini-batch of \(B\) transitions is sampled uniformly at random.

Benefits:

  • Breaks temporal correlations: consecutive frames in Atari are highly correlated; random sampling makes the mini-batch approximately i.i.d.
  • Data efficiency: each transition can be replayed multiple times.
  • Stabilises gradient estimates: reduces variance from correlated samples.

Target Network

The Q-learning loss with a neural network is:

$$\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\!\left[\left(r + \gamma \max_{a'} Q(s', a';\, \theta^-) - Q(s, a;\, \theta)\right)^2\right]$$

where \(\theta^-\) are the target network parameters — a copy of \(\theta\) that is held frozen for \(C\) steps (e.g., \(C = 10{,}000\)) and then updated by copying \(\theta^- \leftarrow \theta\). This decouples the target from the online network, preventing the feedback loop where the target changes at every gradient step.

Key Insight: The target network is analogous to a fixed regression target in supervised learning. Without it, the network is chasing a moving target — every gradient step shifts both the prediction and the label, causing oscillations. Freezing the target for C steps converts the problem into a sequence of stable supervised regression problems.

CNN Architecture for Atari

The DQN network processes raw Atari frames:

  1. Preprocessing: crop to 84×84 pixels, convert to grayscale, stack 4 consecutive frames (to capture motion).
  2. Conv1: 32 filters, 8×8, stride 4 → ReLU
  3. Conv2: 64 filters, 4×4, stride 2 → ReLU
  4. Conv3: 64 filters, 3×3, stride 1 → ReLU
  5. FC: 512 units → ReLU
  6. Output: $$\mathcal{A}$$ Q-values (one per action)

The single forward pass computes Q-values for all actions simultaneously — efficient for the max operation needed in the Q-learning update.

Training Loop Pseudocode

Initialise Q_θ with random weights
Set Q_{θ⁻} ← Q_θ  (target network)
Initialise replay buffer D with capacity N

for t = 1, 2, ..., T:
    # Interact with environment
    a_t ← ε-greedy action from Q_θ(s_t)
    s_{t+1}, r_t ← env.step(a_t)
    store (s_t, a_t, r_t, s_{t+1}) in D

    # Sample and update
    Sample mini-batch {(s,a,r,s')} from D
    y = r + γ max_{a'} Q_{θ⁻}(s', a')    # target
    L = mean((y - Q_θ(s,a))²)
    Update θ via SGD/Adam on L

    # Periodic target update
    if t mod C == 0: θ⁻ ← θ

Rewards are clipped to \([-1, +1]\) to bound the magnitude of error gradients across games with different score scales.

Human-Level Performance on Atari

DQN was evaluated on 49 Atari 2600 games from the Arcade Learning Environment, trained from raw pixels with the same hyperparameters for each game. Key results:

  • Surpasses human performance on 29 of 49 games.
  • Strong results on: Pong, Breakout, Space Invaders, Q*bert.
  • Weak results on: games requiring long-horizon planning (Montezuma’s Revenge — the classic hard exploration problem).

The 2013 arXiv preprint showed the initial Atari results on 7 games; the 2015 Nature paper scaled to 49 games with the improved architecture and target network.

References

  1. Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529–533.
  2. Mnih, V., et al. (2013). Playing Atari with deep reinforcement learning. arXiv:1312.5602.
  3. van Hasselt, H., Guez, A., & Silver, D. (2016). Deep reinforcement learning with double Q-learning. AAAI 2016.