Z-SASLM: Zero-Shot Style Blending via Spherical Interpolation
Published:
The Problem: Linear Blending in a Non-Linear Space
Latent diffusion models (like Stable Diffusion) encode styles as vectors in a high-dimensional latent space. When you want a generated image that combines two or more reference styles, the intuitive approach is to take a weighted average: blend = α·style₁ + β·style₂ + …
This is linear interpolation (LERP), and it has a fundamental flaw: it assumes the latent space is Euclidean — that style representations live on a flat plane and midpoints are simply averages. But latent spaces of diffusion models are curved; style representations live on (or near) a hypersphere.
Linear blending of unit vectors produces a result that is shorter than the originals — it falls inside the sphere, into a low-density region of the latent space. The result: blended styles lose structure, introduce artifacts, and fail to faithfully combine the reference styles.
Z-SASLM: Geodesic Blending
Z-SASLM replaces LERP with Spherical Linear Interpolation (SLERP), which interpolates along the great circle (geodesic) of the hypersphere. For two unit vectors u and v:
\[\text{SLERP}(\mathbf{u}, \mathbf{v}; t) = \frac{\sin((1-t)\Omega)}{\sin\Omega}\,\mathbf{u} + \frac{\sin(t\Omega)}{\sin\Omega}\,\mathbf{v}\]where Ω is the angle between u and v. The result stays on the sphere, preserving the norm and intrinsic geometry of the latent space.
For multiple styles, Z-SASLM extends SLERP iteratively: blend style₁ and style₂ to get an intermediate representation, then blend that with style₃, and so on. Weights are applied at each step to control the contribution of each style.

Full Pipeline

The pipeline leverages StyleAligned attention sharing for style injection: at generation time, the blended style vector influences the self-attention maps of the UNet decoder, imprinting the fused style onto the generated image without retraining.
Results
2-Style Blending

SLERP vs. Linear: 3-Style Comparison

New Evaluation Metric: WMS-DINO
Standard style-transfer metrics (CLIP score, DINO similarity) evaluate similarity to a single reference style. For multi-style blending, you need to measure consistency with all styles simultaneously.
Z-SASLM introduces Weighted Multi-Style DINO VIT-B/8 (WMS-DINO): a weighted average of pairwise DINO similarities between the generated image and each style reference, using the same weights as the blend. This metric quantitatively captures whether all input styles are faithfully represented in the output.
✅ Key Takeaways
- Linear blending of latent style vectors is geometrically incorrect — the latent space is curved, not flat.
- Z-SASLM replaces LERP with iterative SLERP along the geodesic of the hypersphere, preserving latent manifold structure.
- Zero-shot and fine-tuning-free: works with any pre-trained latent diffusion model via StyleAligned attention injection.
- Introduces WMS-DINO, a new evaluation metric for multi-style consistency.
- Published at CVPR 2025 Workshop on Computer Vision for Extended Universe (CVEU).
