Z-SASLM: Zero-Shot Multi-Style Image Synthesis via Spherical Linear Interpolation

Presented at CVPR 2025 Workshop.

Z-SASLM (Zero-Shot Style-Aligned Spherical Linear Morphing) is a framework for generating images that coherently blend multiple artistic styles โ€” without any fine-tuning or additional training. By operating entirely at inference time, it is applicable to any pre-trained text-to-image diffusion model.

The Problem

Existing style transfer methods either require fine-tuning on target styles (expensive, inflexible) or produce abrupt style transitions when mixing multiple references. Z-SASLM achieves smooth, semantically coherent blending across an arbitrary number of style references in a single forward pass.

Method

  • Spherical Linear Interpolation (SLI): interpolates between style latent codes along geodesics on the unit hypersphere, producing perceptually uniform blends that avoid the โ€œgrey averageโ€ failure of linear interpolation.
  • Style-Aligned attention sharing: cross-image shared attention keys/values propagate style information across the batch during denoising.
  • DINOv2 style encoding: robust visual style descriptors extracted without task-specific training.
  • Zero-shot: no fine-tuning required โ€” works out of the box on any diffusion checkpoint.

Results

Z-SASLM produces high-fidelity multi-style composites that outperform linear blending and single-reference style transfer baselines on both qualitative and CLIP-based quantitative metrics.

Technology

Python, Jupyter Notebooks, Hugging Face Diffusers, DINOv2, SDXL / Stable Diffusion backbones.