Looped Transformers: Thinking More with the Same Weights

3 minute read

Published: January 15, 2024

TL;DR: Looped Transformers re-use the same transformer block T times in sequence (weight tying). Each "loop" refines the token representations, mimicking deeper networks with far fewer parameters. Varying loop count at inference lets you trade speed for quality.

The Standard Transformer’s Bottleneck

A standard 32-layer Transformer has 32 different blocks, each with its own weights. Depth = more parameters. Making models “think harder” means making them bigger.

But consider this: after seeing a hard question, do you immediately answer, or do you think for a while and refine your answer? For most of us, more thinking time (even with the same brain) leads to better answers.

Looped Transformers bring this intuition to neural networks.

Weight Tying Across Layers

In a standard Transformer: each layer l has its own W_Q^l, W_K^l, W_V^l, W_O^l, and FFN weights.

In a Looped Transformer: all layers share the same weights. You apply one Transformer block T times:

h₀ = embed(x)
h₁ = Block(h₀)    # iteration 1
h₂ = Block(h₁)    # iteration 2, same Block
h₃ = Block(h₂)    # iteration 3, same Block
...
hₜ = Block(h_{t-1})  # iteration T
output = head(hₜ)

The block learns to be a general “refinement step” that improves representations iteratively — like a recurrence in a modern skin.

Figure 1: Standard Transformers have N independent blocks; Looped Transformers run one shared block T times. Same compute depth, fraction of the parameters.

Adaptive Computation: Think Harder When Needed

A key advantage: the number of loops T can be varied at inference time.

Easy question → 2 loops → fast answer.
Hard math problem → 20 loops → slow but accurate.

This connects to a broader idea called inference-time scaling or test-time compute: models that can spend more compute on harder examples. This is the idea behind OpenAI’s o1/o3 and DeepSeek R1.

Key insight: "Chain of thought" reasoning (generating intermediate steps before answering) is one form of spending more tokens on hard problems. Looped Transformers provide a complementary mechanism: more processing passes at the embedding level before any token is generated.

Model / Idea	How they loop
Universal Transformer (Dehghani 2018)	Weight-tied layers with adaptive halting per position
Albert	Weight-tied encoder layers (parameter efficiency, not computation)
Mixture of Experts (MoE)	Different experts per token but shared routing — related
Diffusion LMs	Iteratively refine the output sequence
o1 / o3 / R1	Generate a long chain-of-thought “scratchpad” before answering

Limitations

Training deeper “looped” networks can be harder — gradients must flow through many applications of the same block.
Harder to specialise different layers for different types of representations (early layers vs. late layers in standard Transformers learn qualitatively different things).
Varying T at inference creates deployment complexity.

✅ Key Takeaways

Looped Transformers apply one weight-tied block T times, mimicking depth without proportional parameters.
T can be varied at inference: more loops = more compute = better answers for hard problems.
Connects to inference-time scaling — the key idea behind reasoning models like o1 and R1.
Trade-off: harder to train, may lack the specialisation benefits of independent layer weights.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

Looped Transformers: Thinking More with the Same Weights

The Standard Transformer’s Bottleneck

Weight Tying Across Layers

Adaptive Computation: Think Harder When Needed

Limitations

✅ Key Takeaways

Share on

You May Also Enjoy

GIN: Graph Isomorphism Network — The Most Expressive GNN

GraphSAGE: Inductive Learning on Large Graphs

GAT: Graph Attention Networks

GCN: Graph Convolutional Networks

Alessio Borgi

The Standard Transformer’s Bottleneck

Weight Tying Across Layers

Adaptive Computation: Think Harder When Needed

Related Ideas

Limitations

✅ Key Takeaways

Share on

You May Also Enjoy

GIN: Graph Isomorphism Network — The Most Expressive GNN

GraphSAGE: Inductive Learning on Large Graphs

GAT: Graph Attention Networks

GCN: Graph Convolutional Networks