Z-SASLM: Zero-Shot Style Blending via Spherical Interpolation

10 minute read

Published: May 26, 2026

TL;DR: Linear style blending in diffusion latents distorts the geometry of the latent space. Z-SASLM replaces it with SLERP, producing cleaner multi-style blends that better preserve the structure of the original style representations.

Paper: "Z-SASLM: Zero-Shot Style-Aligned SLI Blending Latent Manipulation" · arXiv:2503.23234
Authors: A. Borgi, L. Maiano, I. Amerini
Venue: CVPR 2025 Workshop on Computer Vision for Extended Universe (CVEU) · 📄 Read the paper

First page of the Z-SASLM paper — Paper preview — Z-SASLM: Zero-Shot Style-Aligned SLI Blending Latent Manipulation (Borgi et al., 2025).

The Problem: Linear Blending in a Non-Linear Space

Intuition First: Imagine you are standing at the North Pole and want to navigate to a point halfway between London and Tokyo. The "average" of their GPS coordinates on a flat map gives you a point somewhere in Russia — correct-ish on a flat projection, but geometrically wrong on a sphere. The true midpoint on the Earth's surface follows the great circle arc between them. Diffusion latent spaces are spherical in the same sense: the meaningful paths between style representations are arcs, not straight lines. LERP takes the shortcut through the interior; SLERP follows the surface.

Latent diffusion models (like Stable Diffusion) encode styles as vectors in a high-dimensional latent space. When you want a generated image that combines two or more reference styles, the intuitive approach is to take a weighted average: blend = α·style₁ + β·style₂ + …

This is linear interpolation (LERP), and it has a fundamental flaw: it assumes the latent space is Euclidean — that style representations live on a flat plane and midpoints are simply averages. But latent spaces of diffusion models are curved; style representations live on (or near) a hypersphere.

Linear blending of unit vectors produces a result that is shorter than the originals — it falls inside the sphere, into a low-density region of the latent space. The result: blended styles lose structure, introduce artifacts, and fail to faithfully combine the reference styles.

Animated LERP vs SLERP on the hypersphere. LERP (left, red) draws a straight chord between the two style vectors — the midpoint falls inside the sphere into a low-density region with weak semantic support. SLERP (right, blue) follows the great-circle arc — the midpoint stays on the sphere's surface, where the diffusion model's learned distribution is concentrated.

Why This Is a Real Failure Mode

When style blending fails, the output usually does not fail in an obvious binary way. Instead, one reference style dominates, another becomes washed out, or the image acquires unstable artifacts in the regions where the styles should interact. That is why the geometry matters: even if the prompt and the base diffusion model are unchanged, the interpolation rule alone can move generation into a part of latent space where the model has much weaker semantic support.

Z-SASLM: Geodesic Blending

Z-SASLM replaces LERP with Spherical Linear Interpolation (SLERP), which interpolates along the great circle (geodesic) of the hypersphere. For two unit vectors u and v:

\[\text{SLERP}(\mathbf{u}, \mathbf{v}; t) = \frac{\sin((1-t)\Omega)}{\sin\Omega}\,\mathbf{u} + \frac{\sin(t\Omega)}{\sin\Omega}\,\mathbf{v}\]

where Ω is the angle between u and v. The result stays on the sphere, preserving the norm and intrinsic geometry of the latent space.

Key Insight — why sines instead of weights: In LERP, the interpolation weights α and (1−α) add to 1. In SLERP, the weights are sin((1−t)Ω)/sinΩ and sin(tΩ)/sinΩ — also summing to 1, but curved along the arc. When t=0.5 and Ω is large (the vectors are very different styles), the SLERP weights diverge significantly from 0.5/0.5, compensating for the sphere's curvature. For small angles (similar styles), SLERP degenerates gracefully to LERP. This graceful degradation means SLERP is always at least as good as LERP, and strictly better when the styles are geometrically distant.

Step-by-step: SLERP for two style vectors at t = 0.5

Suppose style₁ and style₂ are unit vectors with angle Ω = 60° between them.

Quantity	Value
Ω	60° = π/3 rad
sin(Ω)	sin(60°) = 0.866
sin((1−t)Ω) = sin(0.5 × 60°)	sin(30°) = 0.500
sin(tΩ) = sin(0.5 × 60°)	sin(30°) = 0.500
Weight for u	0.500 / 0.866 = 0.577
Weight for v	0.500 / 0.866 = 0.577
LERP weights at t=0.5	0.500 / 0.500 (flat)

Both SLERP weights are 0.577, and the result vector has norm ≈ 1 (stays on the sphere). LERP would give weights 0.5/0.5 but the resulting vector has norm cos(30°) ≈ 0.866 — 13% shorter than it should be, pushed inside the sphere.

For multiple styles, Z-SASLM extends SLERP iteratively: blend style₁ and style₂ to get an intermediate representation, then blend that with style₃, and so on. Weights are applied at each step to control the contribution of each style.

Iterative SLERP chaining for 3 styles. Style₁ and Style₂ are first blended geodesically (with weights 0.4 and 0.35). The intermediate result blend₁₂ is then SLERP'd with Style₃ (weight 0.25) to produce the final fused style vector. At each step the result stays on the hypersphere — no norm shrinkage accumulates across the chain.

SLERP vs. linear interpolation in latent space — Figure 1 — SLERP vs. linear interpolation. Linear blending (dashed) falls inside the hypersphere, into a low-density region. SLERP (arc) stays on the sphere's surface, preserving the intrinsic latent structure throughout the blend.

Full Pipeline

The pipeline leverages StyleAligned attention sharing for style injection: at generation time, the blended style vector influences the self-attention maps of the UNet decoder, imprinting the fused style onto the generated image without retraining.

Key Insight — how StyleAligned attention injection works: In a standard diffusion UNet, each image in a batch attends only to its own self-attention keys and values. StyleAligned modifies this: the style reference image and the target image are denoised together, and the target's self-attention queries attend to the style reference's keys and values — sharing appearance statistics across the attention layers. Z-SASLM computes the blended SLERP style vector once before denoising begins, then uses it as the single shared style reference throughout the entire diffusion trajectory. This means the geometry fix (SLERP vs LERP) happens upstream of the attention mechanism — it changes what the model is shown, not how attention is computed.

What Actually Makes Z-SASLM Practical

The method is not just “use SLERP instead of LERP.” The practical contribution is the combination of:

a zero-shot pipeline, so no style-specific fine-tuning is needed;
multi-reference blending, not only two-style interpolation;
context-aware weighting, so different reference modalities can contribute differently;
an evaluation protocol that checks whether all styles remain visible in the result.

That combination makes the method usable as an actual generation workflow rather than a one-off interpolation demo.

Results

2-Style Blending

Z-SASLM 2-style SLI blending: Medieval-Cubism result — Figure 3 — Two-style blend (Medieval + Cubism) with Z-SASLM. The generated image faithfully captures both the ornate structure of medieval art and the geometric fragmentation of Cubism, without artifacts or style dominance.

SLERP vs. Linear: 3-Style Comparison

Linear vs. SLERP blending with 3 styles: artifact comparison — Figure 4 — Three-style blend comparison. Linear blending (left) produces artifacts and style collapse in the blended region; Z-SASLM's SLERP blending (right) maintains coherent style fusion across all three references.

New Evaluation Metric: WMS-DINO

Standard style-transfer metrics (CLIP score, DINO similarity) evaluate similarity to a single reference style. For multi-style blending, you need to measure consistency with all styles simultaneously.

Z-SASLM introduces Weighted Multi-Style DINO VIT-B/8 (WMS-DINO): a weighted average of pairwise DINO similarities between the generated image and each style reference, using the same weights as the blend. This metric quantitatively captures whether all input styles are faithfully represented in the output.

Key Insight — why existing metrics fail for multi-style: CLIP similarity and standard DINO similarity both measure how close a generated image is to one reference. If you have three styles and compute three separate scores, you can declare success if any one of them is high — but that masks style collapse, where the output locks onto the dominant style and ignores the others. WMS-DINO solves this by aggregating all style scores with the same weights used in the blend. A high WMS-DINO score means all styles are proportionally visible — not just the winner.

Worked example — WMS-DINO calculation for 3 styles:

Blend weights: w₁=0.4, w₂=0.35, w₃=0.25. DINO similarities of the generated image to each reference:

	DINO sim to Style₁	DINO sim to Style₂	DINO sim to Style₃	WMS-DINO
LERP result	0.72	0.45	0.31	0.4×0.72 + 0.35×0.45 + 0.25×0.31 = 0.523
SLERP result	0.68	0.63	0.58	0.4×0.68 + 0.35×0.63 + 0.25×0.58 = 0.639

The LERP result scores higher on Style₁ alone (0.72 vs 0.68) — it dominated. But the balanced WMS-DINO score is lower because Styles 2 and 3 were suppressed. The SLERP result trades a fraction of Style₁ fidelity for substantially better balance across all three.

The Core Takeaway

Z-SASLM is a paper about respecting representation geometry. If the latent space behaves like a curved manifold, then interpolation should follow that geometry. Once that is enforced, the rest of the style-alignment pipeline becomes noticeably more stable.

✅ Key Takeaways

Linear blending of latent style vectors is geometrically incorrect — the latent space is curved, not flat.
Z-SASLM replaces LERP with iterative SLERP along the geodesic of the hypersphere, preserving latent manifold structure.
Zero-shot and fine-tuning-free: works with any pre-trained latent diffusion model via StyleAligned attention injection.
Introduces WMS-DINO, a new evaluation metric for multi-style consistency.
Published at CVPR 2025 Workshop on Computer Vision for Extended Universe (CVEU).

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

Z-SASLM: Zero-Shot Style Blending via Spherical Interpolation

The Problem: Linear Blending in a Non-Linear Space

Why This Is a Real Failure Mode

Z-SASLM: Geodesic Blending

Full Pipeline

What Actually Makes Z-SASLM Practical

Results

2-Style Blending

SLERP vs. Linear: 3-Style Comparison

New Evaluation Metric: WMS-DINO

The Core Takeaway

✅ Key Takeaways

Share on

You May Also Enjoy

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

Activation Functions in Neural Networks: Why Non-Linearity Matters

FoPE: Fourier Position Embedding for Length Generalization

Alessio Borgi

The Problem: Linear Blending in a Non-Linear Space

Why This Is a Real Failure Mode

Z-SASLM: Geodesic Blending

Full Pipeline

What Actually Makes Z-SASLM Practical

Results

2-Style Blending

SLERP vs. Linear: 3-Style Comparison

New Evaluation Metric: WMS-DINO

The Core Takeaway

✅ Key Takeaways

Share on

You May Also Enjoy

📄 Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

📄 Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

📄 Activation Functions in Neural Networks: Why Non-Linearity Matters

📄 FoPE: Fourier Position Embedding for Length Generalization

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

Activation Functions in Neural Networks: Why Non-Linearity Matters

FoPE: Fourier Position Embedding for Length Generalization