MuZero: Planning Without a Ground-Truth Model
Published:

From AlphaGo to MuZero
AlphaGo (Silver et al. 2016) combined deep neural networks with Monte Carlo Tree Search (MCTS), exploiting the known rules of Go to simulate future states. AlphaZero (Silver et al. 2017) removed human knowledge and learned from self-play alone โ but still required perfect knowledge of the game dynamics for tree search.
MuZero removes this last constraint. It works in environments where the rules are completely unknown, learning everything it needs purely from interaction.
Three Learned Networks
MuZero uses three network functions:
Representation function \(h_\theta\): maps an observation history to a latent state.
Dynamics function \(g_\theta\): predicts the next latent state and immediate reward given the current state and action.
Prediction function \(f_\theta\): maps a latent state to a policy distribution and value estimate.
The latent state \(s_t\) is not required to reconstruct observations โ it only needs to support accurate policy, value, and reward predictions. This task-focused representation is more compact and efficient than full observation reconstruction.
MCTS with a Learned Model
At each real step, MuZero runs MCTS in the learned latent space:
- Selection: traverse the tree using the PUCT formula, balancing exploration (prior \(p\)) and exploitation (value \(Q\)).
- Expansion: at a leaf, call the prediction network to get \((p, v)\) and initialise the child.
- Simulation: unroll the dynamics network \(K\) steps from the leaf, accumulating predicted rewards.
- Backpropagation: update action value statistics \(Q(s,a)\) along the path.
The visit counts after MCTS form the improved policy target \(\pi_t\), and the search value provides a better value target than the raw network estimate.
Training
MuZero is trained from self-play data stored in a replay buffer. For each stored trajectory, it unrolls the dynamics function \(K\) steps and minimises three losses simultaneously:
where \(\hat{z}\) are bootstrapped value targets and \(\pi_t\) are the MCTS policy targets.
Results
MuZero matches or exceeds AlphaZero on Go, Chess, and Shogi (where the rules are known) and dramatically outperforms prior methods on Atari games (where rules are unknown). On a 57-game Atari benchmark it achieves a new state of the art, and with reanalyse (MuZero Reanalyse) it further improves sample efficiency by re-running MCTS on old stored positions.
Legacy
MuZero demonstrates that planning and model learning can be unified end-to-end without privileged access to environment rules. EfficientZero and other successors have extended MuZero to achieve human-level Atari performance with far fewer samples, pointing toward general-purpose model-based RL agents.
References
- Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T., & Silver, D. (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Nature, 588, 604โ609. arXiv:1911.08265.
- Silver, D., Hubert, T., Schrittwieser, J., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play (AlphaZero). Science, 362(6419), 1140โ1144.
- Silver, D., Huang, A., Maddison, C.J., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529, 484โ489.
- Ye, W., Liu, S., Kurutach, T., Abbeel, P., & Gao, Y. (2021). Mastering Atari Games with Limited Data (EfficientZero). NeurIPS. arXiv:2111.00210.
