MuZero: Planning Without a Ground-Truth Model

4 minute read

Published:

TL;DR: MuZero (Schrittwieser et al. 2020) is AlphaZero generalised to unknown environments. It jointly learns a representation function, a dynamics function, and a prediction function โ€” enabling Monte Carlo Tree Search planning without access to the true environment model. MuZero matches AlphaZero on board games and surpasses prior methods on Atari, all from pixels.
MuZero planning with learned model
Model-based RL with learned representations (Mnih et al., 2016)

From AlphaGo to MuZero

AlphaGo (Silver et al. 2016) combined deep neural networks with Monte Carlo Tree Search (MCTS), exploiting the known rules of Go to simulate future states. AlphaZero (Silver et al. 2017) removed human knowledge and learned from self-play alone โ€” but still required perfect knowledge of the game dynamics for tree search.

MuZero removes this last constraint. It works in environments where the rules are completely unknown, learning everything it needs purely from interaction.

Three Learned Networks

MuZero uses three network functions:

Representation function \(h_\theta\): maps an observation history to a latent state.

s_0 = h_ฮธ(o_1, o_2, ..., o_t)

Dynamics function \(g_\theta\): predicts the next latent state and immediate reward given the current state and action.

(r_t, s_{t+1}) = g_ฮธ(s_t, a_t)

Prediction function \(f_\theta\): maps a latent state to a policy distribution and value estimate.

(p_t, v_t) = f_ฮธ(s_t)

The latent state \(s_t\) is not required to reconstruct observations โ€” it only needs to support accurate policy, value, and reward predictions. This task-focused representation is more compact and efficient than full observation reconstruction.

Key Insight: MuZero does not try to learn a pixel-accurate world model. The dynamics function operates entirely in a learned abstract latent space, optimised only for planning relevance. This abstraction is crucial: irrelevant details (background textures, exact positions) are discarded, and the model learns to represent only what matters for predicting returns.

MCTS with a Learned Model

At each real step, MuZero runs MCTS in the learned latent space:

  1. Selection: traverse the tree using the PUCT formula, balancing exploration (prior \(p\)) and exploitation (value \(Q\)).
  2. Expansion: at a leaf, call the prediction network to get \((p, v)\) and initialise the child.
  3. Simulation: unroll the dynamics network \(K\) steps from the leaf, accumulating predicted rewards.
  4. Backpropagation: update action value statistics \(Q(s,a)\) along the path.
PUCT(s,a) = Q(s,a) + c ยท p(a|s) ยท โˆš(N(s)) / (1 + N(s,a))

The visit counts after MCTS form the improved policy target \(\pi_t\), and the search value provides a better value target than the raw network estimate.

Training

MuZero is trained from self-play data stored in a replay buffer. For each stored trajectory, it unrolls the dynamics function \(K\) steps and minimises three losses simultaneously:

L = ฮฃ_{k=0}^{K} ( L_r(r_t^k, แบ‘_t^k) + L_v(v_t^k, z_t^k) + L_p(p_t^k, ฯ€_t^k) )

where \(\hat{z}\) are bootstrapped value targets and \(\pi_t\) are the MCTS policy targets.

Results

MuZero matches or exceeds AlphaZero on Go, Chess, and Shogi (where the rules are known) and dramatically outperforms prior methods on Atari games (where rules are unknown). On a 57-game Atari benchmark it achieves a new state of the art, and with reanalyse (MuZero Reanalyse) it further improves sample efficiency by re-running MCTS on old stored positions.

Legacy

MuZero demonstrates that planning and model learning can be unified end-to-end without privileged access to environment rules. EfficientZero and other successors have extended MuZero to achieve human-level Atari performance with far fewer samples, pointing toward general-purpose model-based RL agents.

References

  • Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T., & Silver, D. (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Nature, 588, 604โ€“609. arXiv:1911.08265.
  • Silver, D., Hubert, T., Schrittwieser, J., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play (AlphaZero). Science, 362(6419), 1140โ€“1144.
  • Silver, D., Huang, A., Maddison, C.J., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529, 484โ€“489.
  • Ye, W., Liu, S., Kurutach, T., Abbeel, P., & Gao, Y. (2021). Mastering Atari Games with Limited Data (EfficientZero). NeurIPS. arXiv:2111.00210.