Looped Transformers: Thinking More with the Same Weights
Published:
The Standard Transformer’s Bottleneck
A standard 32-layer Transformer has 32 different blocks, each with its own weights. Depth = more parameters. Making models “think harder” means making them bigger.
But consider this: after seeing a hard question, do you immediately answer, or do you think for a while and refine your answer? For most of us, more thinking time (even with the same brain) leads to better answers.
Looped Transformers bring this intuition to neural networks.
Weight Tying Across Layers
In a standard Transformer: each layer l has its own W_Q^l, W_K^l, W_V^l, W_O^l, and FFN weights.
In a Looped Transformer: all layers share the same weights. You apply one Transformer block T times:
h₀ = embed(x)
h₁ = Block(h₀) # iteration 1
h₂ = Block(h₁) # iteration 2, same Block
h₃ = Block(h₂) # iteration 3, same Block
...
hₜ = Block(h_{t-1}) # iteration T
output = head(hₜ)
The block learns to be a general “refinement step” that improves representations iteratively — like a recurrence in a modern skin.
Adaptive Computation: Think Harder When Needed
A key advantage: the number of loops T can be varied at inference time.
- Easy question → 2 loops → fast answer.
- Hard math problem → 20 loops → slow but accurate.
This connects to a broader idea called inference-time scaling or test-time compute: models that can spend more compute on harder examples. This is the idea behind OpenAI’s o1/o3 and DeepSeek R1.
Related Ideas
| Model / Idea | How they loop |
|---|---|
| Universal Transformer (Dehghani 2018) | Weight-tied layers with adaptive halting per position |
| Albert | Weight-tied encoder layers (parameter efficiency, not computation) |
| Mixture of Experts (MoE) | Different experts per token but shared routing — related |
| Diffusion LMs | Iteratively refine the output sequence |
| o1 / o3 / R1 | Generate a long chain-of-thought “scratchpad” before answering |
Limitations
- Training deeper “looped” networks can be harder — gradients must flow through many applications of the same block.
- Harder to specialise different layers for different types of representations (early layers vs. late layers in standard Transformers learn qualitatively different things).
- Varying T at inference creates deployment complexity.
✅ Key Takeaways
- Looped Transformers apply one weight-tied block T times, mimicking depth without proportional parameters.
- T can be varied at inference: more loops = more compute = better answers for hard problems.
- Connects to inference-time scaling — the key idea behind reasoning models like o1 and R1.
- Trade-off: harder to train, may lack the specialisation benefits of independent layer weights.
