Z-SASLM: Zero-Shot Style Blending via Spherical Interpolation

4 minute read

Published:

TL;DR: Linear blending of style representations in latent diffusion models assumes a flat latent space — which it isn't. Z-SASLM replaces linear interpolation with Spherical Linear Interpolation (SLERP) along the geodesic of the hypersphere, preserving the latent manifold structure when fusing multiple styles. Zero-shot, no fine-tuning, and a new evaluation metric to match.
Paper: "Z-SASLM: Zero-Shot Style-Aligned SLI Blending Latent Manipulation"  ·  arXiv:2503.23234
Authors: A. Borgi, L. Maiano, I. Amerini
Venue: CVPR 2025 Workshop on Computer Vision for Extended Universe (CVEU)  ·  📄 Read the paper

The Problem: Linear Blending in a Non-Linear Space

Latent diffusion models (like Stable Diffusion) encode styles as vectors in a high-dimensional latent space. When you want a generated image that combines two or more reference styles, the intuitive approach is to take a weighted average: blend = α·style₁ + β·style₂ + …

This is linear interpolation (LERP), and it has a fundamental flaw: it assumes the latent space is Euclidean — that style representations live on a flat plane and midpoints are simply averages. But latent spaces of diffusion models are curved; style representations live on (or near) a hypersphere.

Linear blending of unit vectors produces a result that is shorter than the originals — it falls inside the sphere, into a low-density region of the latent space. The result: blended styles lose structure, introduce artifacts, and fail to faithfully combine the reference styles.

Z-SASLM: Geodesic Blending

Z-SASLM replaces LERP with Spherical Linear Interpolation (SLERP), which interpolates along the great circle (geodesic) of the hypersphere. For two unit vectors u and v:

\[\text{SLERP}(\mathbf{u}, \mathbf{v}; t) = \frac{\sin((1-t)\Omega)}{\sin\Omega}\,\mathbf{u} + \frac{\sin(t\Omega)}{\sin\Omega}\,\mathbf{v}\]

where Ω is the angle between u and v. The result stays on the sphere, preserving the norm and intrinsic geometry of the latent space.

For multiple styles, Z-SASLM extends SLERP iteratively: blend style₁ and style₂ to get an intermediate representation, then blend that with style₃, and so on. Weights are applied at each step to control the contribution of each style.

SLERP vs. linear interpolation in latent space
Figure 1 — SLERP vs. linear interpolation. Linear blending (dashed) falls inside the hypersphere, into a low-density region. SLERP (arc) stays on the sphere's surface, preserving the intrinsic latent structure throughout the blend.

Full Pipeline

Z-SASLM full pipeline architecture
Figure 2 — The Z-SASLM pipeline. Style reference images are encoded into the diffusion latent space; their representations are combined via iterative SLERP blending with user-specified weights; the blended style vector is then used to guide generation via StyleAligned attention injection. No fine-tuning required.

The pipeline leverages StyleAligned attention sharing for style injection: at generation time, the blended style vector influences the self-attention maps of the UNet decoder, imprinting the fused style onto the generated image without retraining.

Results

2-Style Blending

Z-SASLM 2-style SLI blending: Medieval-Cubism result
Figure 3 — Two-style blend (Medieval + Cubism) with Z-SASLM. The generated image faithfully captures both the ornate structure of medieval art and the geometric fragmentation of Cubism, without artifacts or style dominance.

SLERP vs. Linear: 3-Style Comparison

Linear vs. SLERP blending with 3 styles: artifact comparison
Figure 4 — Three-style blend comparison. Linear blending (left) produces artifacts and style collapse in the blended region; Z-SASLM's SLERP blending (right) maintains coherent style fusion across all three references.

New Evaluation Metric: WMS-DINO

Standard style-transfer metrics (CLIP score, DINO similarity) evaluate similarity to a single reference style. For multi-style blending, you need to measure consistency with all styles simultaneously.

Z-SASLM introduces Weighted Multi-Style DINO VIT-B/8 (WMS-DINO): a weighted average of pairwise DINO similarities between the generated image and each style reference, using the same weights as the blend. This metric quantitatively captures whether all input styles are faithfully represented in the output.

✅ Key Takeaways

  • Linear blending of latent style vectors is geometrically incorrect — the latent space is curved, not flat.
  • Z-SASLM replaces LERP with iterative SLERP along the geodesic of the hypersphere, preserving latent manifold structure.
  • Zero-shot and fine-tuning-free: works with any pre-trained latent diffusion model via StyleAligned attention injection.
  • Introduces WMS-DINO, a new evaluation metric for multi-style consistency.
  • Published at CVPR 2025 Workshop on Computer Vision for Extended Universe (CVEU).