Z-SASLM: Zero-Shot Multi-Style Image Synthesis via Spherical Linear Interpolation
Presented at CVPR 2025 Workshop.
Z-SASLM (Zero-Shot Style-Aligned Spherical Linear Morphing) is a framework for generating images that coherently blend multiple artistic styles โ without any fine-tuning or additional training. By operating entirely at inference time, it is applicable to any pre-trained text-to-image diffusion model.
The Problem
Existing style transfer methods either require fine-tuning on target styles (expensive, inflexible) or produce abrupt style transitions when mixing multiple references. Z-SASLM achieves smooth, semantically coherent blending across an arbitrary number of style references in a single forward pass.
Method
- Spherical Linear Interpolation (SLI): interpolates between style latent codes along geodesics on the unit hypersphere, producing perceptually uniform blends that avoid the โgrey averageโ failure of linear interpolation.
- Style-Aligned attention sharing: cross-image shared attention keys/values propagate style information across the batch during denoising.
- DINOv2 style encoding: robust visual style descriptors extracted without task-specific training.
- Zero-shot: no fine-tuning required โ works out of the box on any diffusion checkpoint.
Results
Z-SASLM produces high-fidelity multi-style composites that outperform linear blending and single-reference style transfer baselines on both qualitative and CLIP-based quantitative metrics.
Technology
Python, Jupyter Notebooks, Hugging Face Diffusers, DINOv2, SDXL / Stable Diffusion backbones.
