Dyna: Integrating Planning and Learning

4 minute read

Published: August 16, 2025

TL;DR: Dyna (Sutton 1991) is a framework that unifies model-free and model-based RL. After each real environment step, the agent updates its world model and then performs multiple planning steps using simulated experience drawn from that model. This dramatically reduces the number of real interactions needed to learn a good policy, at the cost of computational effort during planning.

Dyna architecture — Dyna integrates real and simulated experience (Mnih et al., 2016)

The Core Trade-off: Real vs Simulated Experience

In reinforcement learning, the fundamental bottleneck is often the number of interactions with the real environment — each real step may be expensive, dangerous, or time-consuming. Model-free methods are sample-inefficient: they require vast amounts of real experience. Model-based methods can generate cheap simulated experience from a learned model, but model errors can mislead the agent.

Dyna bridges both worlds: use real experience to learn a model, then use the model to generate additional (simulated) experience for planning.

The Dyna-Q Algorithm

Dyna-Q is the simplest instantiation of the Dyna framework:

Act: take action \(a_t\) in the real environment, observe \((r_t, s_{t+1})\).
Direct RL update: apply one step of Q-learning with the real transition.
Model update: update the model \(\hat{M}\) with the transition \((s_t, a_t, r_t, s_{t+1})\).
Planning: repeat \(n\) times: sample a previously visited \((s, a)\), query the model for \(\hat{r}, \hat{s}'\), apply a Q-learning update.

Q(s,a) ← Q(s,a) + α [ r + γ max_{a'} Q(s',a') - Q(s,a) ]

The planning step uses simulated transitions from the model, but the Q-learning update formula is identical. With \(n = 0\), Dyna-Q reduces to standard Q-learning.

Key Insight: The elegance of Dyna is that simulated and real experience are treated identically for the purpose of value function updates. There is no architectural distinction between model-free and model-based learning — only a difference in the source of the transition tuple. This modularity makes Dyna easy to implement and reason about.</insight>

Sample Efficiency Gains

Empirically, Dyna-Q with \(n = 5\) planning steps achieves the same policy quality as Q-learning with \(n \times\) more real interactions. In tabular settings, Sutton showed that Dyna-Q on a simple gridworld converges to the optimal policy in roughly one-tenth the number of episodes compared to Q-learning with \(n = 50\) planning steps per real step.

The ratio of simulated to real experience is a critical hyperparameter:

Too low: wastes the model; reverts to model-free behaviour.
Too high: model errors compound; the agent over-fits to the model’s inaccuracies.

Handling Changing Environments: Dyna-Q+

Real environments are rarely stationary. Dyna-Q+ (Sutton 1992) adds an exploration bonus to transitions that have not been tried recently:

r̃(s, a) = r(s, a) + κ √τ(s, a)

where \(\tau(s,a)\) is the number of steps since \((s,a)\) was last tried and \(\kappa\) is a small constant. This encourages the agent to re-explore parts of the state space where the model may have become stale, recovering quickly when the environment changes.

Modern Successors

The Dyna framework anticipates several modern model-based RL methods:

MBPO (Janner et al. 2019): uses short model rollouts with a neural network model inside an SAC training loop, achieving high sample efficiency on MuJoCo.
Dreamer (Hafner et al. 2020): trains the policy entirely on imagined trajectories using a recurrent world model.
TD-MPC (Hansen et al. 2022): combines temporal difference learning with model predictive control, planning in latent space.

All of these inherit Dyna’s core insight: real experience builds the model; the model multiplies real experience.

References

Sutton, R.S. (1991). Dyna, an Integrated Architecture for Learning, Planning, and Reacting. ACM SIGART Bulletin, 2(4), 160–163.
Sutton, R.S. (1990). Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. ICML.
Sutton, R.S., & Barto, A.G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Chapter 8.
Janner, M., Fu, J., Zhang, M., & Levine, S. (2019). When to Trust Your Model: Model-Based Policy Optimization. NeurIPS. arXiv:1906.08253.
Hafner, D., Lillicrap, T., Ba, J., & Norouzi, M. (2020). Dream to Control: Learning Behaviors by Latent Imagination. ICLR. arXiv:1912.01603.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

Dyna: Integrating Planning and Learning

The Core Trade-off: Real vs Simulated Experience

The Dyna-Q Algorithm

Sample Efficiency Gains

Handling Changing Environments: Dyna-Q+

Modern Successors

References

Share on

You May Also Enjoy

GAPE: Remember to Forget — Gated Adaptive Positional Encoding

PolyNSD: Polynomial Neural Sheaf Diffusion

TDA in Materials Science: Topology of Structure and Phase

TDA in Drug Discovery: Molecular Topology