Dyna: Integrating Planning and Learning

4 minute read

Published:

TL;DR: Dyna (Sutton 1991) is a framework that unifies model-free and model-based RL. After each real environment step, the agent updates its world model and then performs multiple planning steps using simulated experience drawn from that model. This dramatically reduces the number of real interactions needed to learn a good policy, at the cost of computational effort during planning.
Dyna architecture
Dyna integrates real and simulated experience (Mnih et al., 2016)

The Core Trade-off: Real vs Simulated Experience

In reinforcement learning, the fundamental bottleneck is often the number of interactions with the real environment โ€” each real step may be expensive, dangerous, or time-consuming. Model-free methods are sample-inefficient: they require vast amounts of real experience. Model-based methods can generate cheap simulated experience from a learned model, but model errors can mislead the agent.

Dyna bridges both worlds: use real experience to learn a model, then use the model to generate additional (simulated) experience for planning.

The Dyna-Q Algorithm

Dyna-Q is the simplest instantiation of the Dyna framework:

  1. Act: take action \(a_t\) in the real environment, observe \((r_t, s_{t+1})\).
  2. Direct RL update: apply one step of Q-learning with the real transition.
  3. Model update: update the model \(\hat{M}\) with the transition \((s_t, a_t, r_t, s_{t+1})\).
  4. Planning: repeat \(n\) times: sample a previously visited \((s, a)\), query the model for \(\hat{r}, \hat{s}'\), apply a Q-learning update.
Q(s,a) โ† Q(s,a) + ฮฑ [ r + ฮณ max_{a'} Q(s',a') - Q(s,a) ]

The planning step uses simulated transitions from the model, but the Q-learning update formula is identical. With \(n = 0\), Dyna-Q reduces to standard Q-learning.

Key Insight: The elegance of Dyna is that simulated and real experience are treated identically for the purpose of value function updates. There is no architectural distinction between model-free and model-based learning โ€” only a difference in the source of the transition tuple. This modularity makes Dyna easy to implement and reason about.</insight>

Sample Efficiency Gains

Empirically, Dyna-Q with \(n = 5\) planning steps achieves the same policy quality as Q-learning with \(n \times\) more real interactions. In tabular settings, Sutton showed that Dyna-Q on a simple gridworld converges to the optimal policy in roughly one-tenth the number of episodes compared to Q-learning with \(n = 50\) planning steps per real step.

The ratio of simulated to real experience is a critical hyperparameter:

  • Too low: wastes the model; reverts to model-free behaviour.
  • Too high: model errors compound; the agent over-fits to the modelโ€™s inaccuracies.

Handling Changing Environments: Dyna-Q+

Real environments are rarely stationary. Dyna-Q+ (Sutton 1992) adds an exploration bonus to transitions that have not been tried recently:

rฬƒ(s, a) = r(s, a) + ฮบ โˆšฯ„(s, a)

where \(\tau(s,a)\) is the number of steps since \((s,a)\) was last tried and \(\kappa\) is a small constant. This encourages the agent to re-explore parts of the state space where the model may have become stale, recovering quickly when the environment changes.

Modern Successors

The Dyna framework anticipates several modern model-based RL methods:

  • MBPO (Janner et al. 2019): uses short model rollouts with a neural network model inside an SAC training loop, achieving high sample efficiency on MuJoCo.
  • Dreamer (Hafner et al. 2020): trains the policy entirely on imagined trajectories using a recurrent world model.
  • TD-MPC (Hansen et al. 2022): combines temporal difference learning with model predictive control, planning in latent space.

All of these inherit Dynaโ€™s core insight: real experience builds the model; the model multiplies real experience.

References

  • Sutton, R.S. (1991). Dyna, an Integrated Architecture for Learning, Planning, and Reacting. ACM SIGART Bulletin, 2(4), 160โ€“163.
  • Sutton, R.S. (1990). Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. ICML.
  • Sutton, R.S., & Barto, A.G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Chapter 8.
  • Janner, M., Fu, J., Zhang, M., & Levine, S. (2019). When to Trust Your Model: Model-Based Policy Optimization. NeurIPS. arXiv:1906.08253.
  • Hafner, D., Lillicrap, T., Ba, J., & Norouzi, M. (2020). Dream to Control: Learning Behaviors by Latent Imagination. ICLR. arXiv:1912.01603.