Dyna: Integrating Planning and Learning
Published:

The Core Trade-off: Real vs Simulated Experience
In reinforcement learning, the fundamental bottleneck is often the number of interactions with the real environment โ each real step may be expensive, dangerous, or time-consuming. Model-free methods are sample-inefficient: they require vast amounts of real experience. Model-based methods can generate cheap simulated experience from a learned model, but model errors can mislead the agent.
Dyna bridges both worlds: use real experience to learn a model, then use the model to generate additional (simulated) experience for planning.
The Dyna-Q Algorithm
Dyna-Q is the simplest instantiation of the Dyna framework:
- Act: take action \(a_t\) in the real environment, observe \((r_t, s_{t+1})\).
- Direct RL update: apply one step of Q-learning with the real transition.
- Model update: update the model \(\hat{M}\) with the transition \((s_t, a_t, r_t, s_{t+1})\).
- Planning: repeat \(n\) times: sample a previously visited \((s, a)\), query the model for \(\hat{r}, \hat{s}'\), apply a Q-learning update.
The planning step uses simulated transitions from the model, but the Q-learning update formula is identical. With \(n = 0\), Dyna-Q reduces to standard Q-learning.
Sample Efficiency Gains
Empirically, Dyna-Q with \(n = 5\) planning steps achieves the same policy quality as Q-learning with \(n \times\) more real interactions. In tabular settings, Sutton showed that Dyna-Q on a simple gridworld converges to the optimal policy in roughly one-tenth the number of episodes compared to Q-learning with \(n = 50\) planning steps per real step.
The ratio of simulated to real experience is a critical hyperparameter:
- Too low: wastes the model; reverts to model-free behaviour.
- Too high: model errors compound; the agent over-fits to the modelโs inaccuracies.
Handling Changing Environments: Dyna-Q+
Real environments are rarely stationary. Dyna-Q+ (Sutton 1992) adds an exploration bonus to transitions that have not been tried recently:
where \(\tau(s,a)\) is the number of steps since \((s,a)\) was last tried and \(\kappa\) is a small constant. This encourages the agent to re-explore parts of the state space where the model may have become stale, recovering quickly when the environment changes.
Modern Successors
The Dyna framework anticipates several modern model-based RL methods:
- MBPO (Janner et al. 2019): uses short model rollouts with a neural network model inside an SAC training loop, achieving high sample efficiency on MuJoCo.
- Dreamer (Hafner et al. 2020): trains the policy entirely on imagined trajectories using a recurrent world model.
- TD-MPC (Hansen et al. 2022): combines temporal difference learning with model predictive control, planning in latent space.
All of these inherit Dynaโs core insight: real experience builds the model; the model multiplies real experience.
References
- Sutton, R.S. (1991). Dyna, an Integrated Architecture for Learning, Planning, and Reacting. ACM SIGART Bulletin, 2(4), 160โ163.
- Sutton, R.S. (1990). Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. ICML.
- Sutton, R.S., & Barto, A.G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Chapter 8.
- Janner, M., Fu, J., Zhang, M., & Levine, S. (2019). When to Trust Your Model: Model-Based Policy Optimization. NeurIPS. arXiv:1906.08253.
- Hafner, D., Lillicrap, T., Ba, J., & Norouzi, M. (2020). Dream to Control: Learning Behaviors by Latent Imagination. ICLR. arXiv:1912.01603.
