Exploration vs Exploitation: The Core Dilemma of RL

4 minute read

Published: August 04, 2025

TL;DR: Every RL agent must balance exploiting known good actions against exploring unknown ones. The multi-armed bandit isolates this dilemma in its purest form. Solutions range from simple ε-greedy heuristics to principled Bayesian methods (Thompson sampling) and UCB algorithms with provable regret bounds. In full RL, exploration must be state-dependent and can leverage intrinsic motivation signals.

Exploration-exploitation in Atari — ε-greedy exploration in Atari DQN (Mnih et al., 2015)

The Multi-Armed Bandit

The multi-armed bandit (MAB) is the simplest problem in which the exploration-exploitation trade-off arises. A learner faces $K$ slot machines (arms); at each round $t$ it selects arm $a_t \in \{1,\ldots,K\}$ and receives a reward $r_t \sim \mathcal{D}_{a_t}$ drawn from arm $a_t$’s unknown distribution with mean $\mu_{a_t}$.

The goal is to minimise cumulative regret over $T$ rounds:

$$\mathcal{R}_T = T \mu^* - \mathbb{E}\!\left[\sum_{t=1}^T r_t\right], \quad \mu^* = \max_a \mu_a$$

Any algorithm that always exploits the empirically best arm will get stuck in a suboptimal arm if its initial samples were lucky. Any algorithm that always explores wastes reward on bad arms. The goal is to achieve sublinear regret $\mathcal{R}_T = o(T)$, meaning the per-step regret vanishes.

ε-Greedy: Simple but Effective

The ε-greedy strategy selects the arm with the highest empirical mean with probability $1 - \varepsilon$, and a uniformly random arm with probability $\varepsilon$:

for t = 1, 2, ..., T:
    with prob ε:  a_t ← uniform random arm
    with prob 1-ε: a_t ← argmax_a Q(a)   # exploit
    observe r_t, update Q(a_t) ← Q(a_t) + α(r_t - Q(a_t))

With fixed $\varepsilon$, regret is $O(\varepsilon T)$ — linear, not sublinear. Using a decaying $\varepsilon_t = c/t$ recovers logarithmic regret. Despite its simplicity, ε-greedy is widely used in deep RL (e.g., the original DQN paper uses ε annealed from 1.0 to 0.1).

UCB1: Optimism in the Face of Uncertainty

The Upper Confidence Bound (UCB1) algorithm selects the arm that maximises an upper confidence bound on its true mean:

$$a_t = \arg\max_a \left[\hat{\mu}_a + \sqrt{\frac{2 \ln t}{N_a(t)}}\right]$$

where $\hat{\mu}_a$ is the empirical mean of arm $a$ and $N_a(t)$ is the number of times it has been selected. The second term is an exploration bonus: it is large when an arm has been sampled few times (high uncertainty) and shrinks as the arm is explored. UCB1 achieves $O(\sqrt{KT \ln T})$ regret — provably near-optimal — without any hyperparameter tuning.

Key Insight: UCB algorithms embody "optimism in the face of uncertainty" — treating under-explored arms as if they might be optimal. This is a principled frequentist approach to exploration and has clean theoretical guarantees. The same idea appears in model-based RL as RMAX and in tree search (UCT in MCTS).

Thompson Sampling: Bayesian Exploration

Thompson sampling maintains a Bayesian posterior over each arm’s mean and samples an action proportional to the probability of each arm being optimal.

For Bernoulli rewards with Beta prior:

Initialise: for each arm a, α_a = β_a = 1 (uniform Beta prior)
for t = 1, 2, ..., T:
    for each arm a: sample θ_a ~ Beta(α_a, β_a)
    a_t ← argmax_a θ_a
    observe r_t ∈ {0, 1}
    update: α_{a_t} += r_t,  β_{a_t} += (1 - r_t)

The posterior update is the standard Bayesian update: each success increments $\alpha$, each failure increments $\beta$. Thompson sampling achieves logarithmic regret matching the Lai-Robbins lower bound, and empirically outperforms UCB1 in many settings.

Exploration in Full RL

In the full RL setting (beyond bandits), exploration must be state-dependent: the agent needs to visit new states, not just try new actions. This is harder because:

The same action in different states has different exploration value.
Rewards may be entirely absent (sparse reward) until the agent reaches a distant goal state.

Strategies for exploration in RL:

Count-based exploration: augment the reward with a bonus $r^+ = \beta / \sqrt{N(s,a)}$ where $N(s,a)$ counts visits. Requires density models for large state spaces.

Intrinsic motivation / curiosity: reward the agent for visiting states where its internal model has high prediction error. Pathak et al.’s Intrinsic Curiosity Module (ICM) trains a forward dynamics model and rewards surprise.

Random Network Distillation (RND): maintain a random target network $f$ and a trained predictor $\hat{f}$. The exploration bonus is the prediction error $\|f(s) - \hat{f}(s)\|^2$, which is large for novel states.

Noisy networks: add learnable noise to network weights (used in Rainbow DQN), providing state-dependent exploration without ε-greedy.

References

Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2–3), 235–256.
Chapelle, O., & Li, L. (2011). An empirical evaluation of Thompson sampling. NeurIPS 2011.
Burda, Y., Edwards, H., Storkey, A., & Klimov, O. (2019). Exploration by random network distillation. ICLR 2019. arXiv:1810.12894.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

Exploration vs Exploitation: The Core Dilemma of RL

The Multi-Armed Bandit

ε-Greedy: Simple but Effective

UCB1: Optimism in the Face of Uncertainty

Thompson Sampling: Bayesian Exploration

Exploration in Full RL

References

Share on

You May Also Enjoy

GAPE: Remember to Forget — Gated Adaptive Positional Encoding

PolyNSD: Polynomial Neural Sheaf Diffusion

TDA in Materials Science: Topology of Structure and Phase

TDA in Drug Discovery: Molecular Topology