Exploration vs Exploitation: The Core Dilemma of RL
Published:

The Multi-Armed Bandit
The multi-armed bandit (MAB) is the simplest problem in which the exploration-exploitation trade-off arises. A learner faces \(K\) slot machines (arms); at each round \(t\) it selects arm \(a_t \in \{1,\ldots,K\}\) and receives a reward \(r_t \sim \mathcal{D}_{a_t}\) drawn from arm \(a_t\)’s unknown distribution with mean \(\mu_{a_t}\).
The goal is to minimise cumulative regret over \(T\) rounds:
Any algorithm that always exploits the empirically best arm will get stuck in a suboptimal arm if its initial samples were lucky. Any algorithm that always explores wastes reward on bad arms. The goal is to achieve sublinear regret \(\mathcal{R}_T = o(T)\), meaning the per-step regret vanishes.
ε-Greedy: Simple but Effective
The ε-greedy strategy selects the arm with the highest empirical mean with probability \(1 - \varepsilon\), and a uniformly random arm with probability \(\varepsilon\):
for t = 1, 2, ..., T:
with prob ε: a_t ← uniform random arm
with prob 1-ε: a_t ← argmax_a Q(a) # exploit
observe r_t, update Q(a_t) ← Q(a_t) + α(r_t - Q(a_t))
With fixed \(\varepsilon\), regret is \(O(\varepsilon T)\) — linear, not sublinear. Using a decaying \(\varepsilon_t = c/t\) recovers logarithmic regret. Despite its simplicity, ε-greedy is widely used in deep RL (e.g., the original DQN paper uses ε annealed from 1.0 to 0.1).
UCB1: Optimism in the Face of Uncertainty
The Upper Confidence Bound (UCB1) algorithm selects the arm that maximises an upper confidence bound on its true mean:
where \(\hat{\mu}_a\) is the empirical mean of arm \(a\) and \(N_a(t)\) is the number of times it has been selected. The second term is an exploration bonus: it is large when an arm has been sampled few times (high uncertainty) and shrinks as the arm is explored. UCB1 achieves \(O(\sqrt{KT \ln T})\) regret — provably near-optimal — without any hyperparameter tuning.
Thompson Sampling: Bayesian Exploration
Thompson sampling maintains a Bayesian posterior over each arm’s mean and samples an action proportional to the probability of each arm being optimal.
For Bernoulli rewards with Beta prior:
Initialise: for each arm a, α_a = β_a = 1 (uniform Beta prior)
for t = 1, 2, ..., T:
for each arm a: sample θ_a ~ Beta(α_a, β_a)
a_t ← argmax_a θ_a
observe r_t ∈ {0, 1}
update: α_{a_t} += r_t, β_{a_t} += (1 - r_t)
The posterior update is the standard Bayesian update: each success increments \(\alpha\), each failure increments \(\beta\). Thompson sampling achieves logarithmic regret matching the Lai-Robbins lower bound, and empirically outperforms UCB1 in many settings.
Exploration in Full RL
In the full RL setting (beyond bandits), exploration must be state-dependent: the agent needs to visit new states, not just try new actions. This is harder because:
- The same action in different states has different exploration value.
- Rewards may be entirely absent (sparse reward) until the agent reaches a distant goal state.
Strategies for exploration in RL:
Count-based exploration: augment the reward with a bonus \(r^+ = \beta / \sqrt{N(s,a)}\) where \(N(s,a)\) counts visits. Requires density models for large state spaces.
Intrinsic motivation / curiosity: reward the agent for visiting states where its internal model has high prediction error. Pathak et al.’s Intrinsic Curiosity Module (ICM) trains a forward dynamics model and rewards surprise.
Random Network Distillation (RND): maintain a random target network \(f\) and a trained predictor \(\hat{f}\). The exploration bonus is the prediction error \(\|f(s) - \hat{f}(s)\|^2\), which is large for novel states.
Noisy networks: add learnable noise to network weights (used in Rainbow DQN), providing state-dependent exploration without ε-greedy.
References
- Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2–3), 235–256.
- Chapelle, O., & Li, L. (2011). An empirical evaluation of Thompson sampling. NeurIPS 2011.
- Burda, Y., Edwards, H., Storkey, A., & Klimov, O. (2019). Exploration by random network distillation. ICLR 2019. arXiv:1810.12894.
