Exploration vs Exploitation: The Core Dilemma of RL

4 minute read

Published:

TL;DR: Every RL agent must balance exploiting known good actions against exploring unknown ones. The multi-armed bandit isolates this dilemma in its purest form. Solutions range from simple ε-greedy heuristics to principled Bayesian methods (Thompson sampling) and UCB algorithms with provable regret bounds. In full RL, exploration must be state-dependent and can leverage intrinsic motivation signals.
Exploration-exploitation in Atari
ε-greedy exploration in Atari DQN (Mnih et al., 2015)

The Multi-Armed Bandit

The multi-armed bandit (MAB) is the simplest problem in which the exploration-exploitation trade-off arises. A learner faces \(K\) slot machines (arms); at each round \(t\) it selects arm \(a_t \in \{1,\ldots,K\}\) and receives a reward \(r_t \sim \mathcal{D}_{a_t}\) drawn from arm \(a_t\)’s unknown distribution with mean \(\mu_{a_t}\).

The goal is to minimise cumulative regret over \(T\) rounds:

$$\mathcal{R}_T = T \mu^* - \mathbb{E}\!\left[\sum_{t=1}^T r_t\right], \quad \mu^* = \max_a \mu_a$$

Any algorithm that always exploits the empirically best arm will get stuck in a suboptimal arm if its initial samples were lucky. Any algorithm that always explores wastes reward on bad arms. The goal is to achieve sublinear regret \(\mathcal{R}_T = o(T)\), meaning the per-step regret vanishes.

ε-Greedy: Simple but Effective

The ε-greedy strategy selects the arm with the highest empirical mean with probability \(1 - \varepsilon\), and a uniformly random arm with probability \(\varepsilon\):

for t = 1, 2, ..., T:
    with prob ε:  a_t ← uniform random arm
    with prob 1-ε: a_t ← argmax_a Q(a)   # exploit
    observe r_t, update Q(a_t) ← Q(a_t) + α(r_t - Q(a_t))

With fixed \(\varepsilon\), regret is \(O(\varepsilon T)\) — linear, not sublinear. Using a decaying \(\varepsilon_t = c/t\) recovers logarithmic regret. Despite its simplicity, ε-greedy is widely used in deep RL (e.g., the original DQN paper uses ε annealed from 1.0 to 0.1).

UCB1: Optimism in the Face of Uncertainty

The Upper Confidence Bound (UCB1) algorithm selects the arm that maximises an upper confidence bound on its true mean:

$$a_t = \arg\max_a \left[\hat{\mu}_a + \sqrt{\frac{2 \ln t}{N_a(t)}}\right]$$

where \(\hat{\mu}_a\) is the empirical mean of arm \(a\) and \(N_a(t)\) is the number of times it has been selected. The second term is an exploration bonus: it is large when an arm has been sampled few times (high uncertainty) and shrinks as the arm is explored. UCB1 achieves \(O(\sqrt{KT \ln T})\) regret — provably near-optimal — without any hyperparameter tuning.

Key Insight: UCB algorithms embody "optimism in the face of uncertainty" — treating under-explored arms as if they might be optimal. This is a principled frequentist approach to exploration and has clean theoretical guarantees. The same idea appears in model-based RL as RMAX and in tree search (UCT in MCTS).

Thompson Sampling: Bayesian Exploration

Thompson sampling maintains a Bayesian posterior over each arm’s mean and samples an action proportional to the probability of each arm being optimal.

For Bernoulli rewards with Beta prior:

Initialise: for each arm a, α_a = β_a = 1 (uniform Beta prior)
for t = 1, 2, ..., T:
    for each arm a: sample θ_a ~ Beta(α_a, β_a)
    a_t ← argmax_a θ_a
    observe r_t ∈ {0, 1}
    update: α_{a_t} += r_t,  β_{a_t} += (1 - r_t)

The posterior update is the standard Bayesian update: each success increments \(\alpha\), each failure increments \(\beta\). Thompson sampling achieves logarithmic regret matching the Lai-Robbins lower bound, and empirically outperforms UCB1 in many settings.

Exploration in Full RL

In the full RL setting (beyond bandits), exploration must be state-dependent: the agent needs to visit new states, not just try new actions. This is harder because:

  • The same action in different states has different exploration value.
  • Rewards may be entirely absent (sparse reward) until the agent reaches a distant goal state.

Strategies for exploration in RL:

Count-based exploration: augment the reward with a bonus \(r^+ = \beta / \sqrt{N(s,a)}\) where \(N(s,a)\) counts visits. Requires density models for large state spaces.

Intrinsic motivation / curiosity: reward the agent for visiting states where its internal model has high prediction error. Pathak et al.’s Intrinsic Curiosity Module (ICM) trains a forward dynamics model and rewards surprise.

Random Network Distillation (RND): maintain a random target network \(f\) and a trained predictor \(\hat{f}\). The exploration bonus is the prediction error \(\|f(s) - \hat{f}(s)\|^2\), which is large for novel states.

Noisy networks: add learnable noise to network weights (used in Rainbow DQN), providing state-dependent exploration without ε-greedy.

References

  1. Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2–3), 235–256.
  2. Chapelle, O., & Li, L. (2011). An empirical evaluation of Thompson sampling. NeurIPS 2011.
  3. Burda, Y., Edwards, H., Storkey, A., & Klimov, O. (2019). Exploration by random network distillation. ICLR 2019. arXiv:1810.12894.