Z-SASLM: Zero-Shot Style-Aligned SLI Blending Latent Manipulation

CVPR 2025, Workshop on AI for Creative Visual Content Generation, Editing, and Understanding

Published in 2025 IEEE-CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

Official CVPR 2025 Workshop Procedings

Nashville, Tennesseee(TN), USA

Hugging Face
MedCub PurpleMacro VanGoghEgyptian EgyPurpleMacro EgyVanGoghMacro Multi-Modal Context Multi-Modal Context Multi-Modal Context Multi-Modal Context Multi-Modal Context Multi-Modal Context

Abstract

We introduce Z-SASLM, a Zero-Shot Style-Aligned SLI (Spherical Linear Interpolation) Blending Latent Manipulation pipeline that overcomes the limitations of current blending methods. Conventional approaches rely on linear blending, assuming a flat latent space leading to suboptimal results when integrating multiple reference styles. In contrast, our framework leverages the non-linear geometry of the latent space by using SLI Blending to combine weighted style representations. By interpolating along the geodesic on the hypersphere, Z-SASLM preserves the intrinsic structure of the latent space, ensuring high-fidelity and coherent blending of diverse styles—all without the need for fine-tuning. We further propose a new metric, Weighted Multi-Style DINO VIT-B/8, designed to quantitatively evaluate the consistency of the blended styles. While our primary focus is on the theoretical and practical advantages of SLI Blending for style manipulation, we also demonstrate its effectiveness in a multi-modal content fusion setting through comprehensive experimental studies. Experimental results show that Z-SASLM achieves enhanced and robust style alignment.

Features

  • Zero-Shot Versatility: Unlock infinite style possibilities without any fine-tuning.
  • SLI Blending for Multi-Reference Style Conditioning: Introduces a novel architecture that leverages spherical linear interpolation to seamlessly blend multiple reference styles without any fine-tuning.
  • Latent Space Mastery: Capitalizes on the intrinsic non-linearity of the latent manifold for optimal style integration.
  • Innovative Evaluation Metric: Proposes the Weighted Multi-Style DINO VIT-B/8 metric to rigorously quantify style consistency across generated images.
  • Multi-Modal Content Fusion: Demonstrates the framework’s robustness by integrating diverse modalities—such as image, audio, and weather data—into a unified content fusion approach.

Architecture

Our framework is built as a modular pipeline that efficiently combines diverse style references and multi-modal cues without fine-tuning. The architecture comprises four main components:

  1. Reference Image Encoding & Blending:
    • A Variational Autoencoder (VAE) extracts latent representations from each reference style image.
    • Our novel Spherical Linear Interpolation (SLI) Blending module then fuses these latent codes along the geodesic of the hypersphere, ensuring smooth and coherent style transitions.
  2. Text Encoding:
    • Textual prompts are encoded using a CLIP-based module, capturing semantic cues and aligning them with visual features.
    • This stage supports both simple captions and richer prompts derived from multiple modalities.
  3. Style-Aligned Image Generation:
    • The blended style representation is combined with the text embeddings to condition a diffusion-based generation process.
    • A style-aligned attention mechanism reinforces consistent style propagation throughout the image generation.
  4. Optional Multi-Modal Content Fusion:
    • Additional inputs such as audio, music, or weather data are first transformed into text.
    • These are fused into a single “Multi-Content Textual Prompt” via a T5-based rephrasing module, further enriching the conditioning signal for improved creative synthesis.
Architecture

Image Results

Our experimental evaluation confirms the effectiveness of Z-SASLM across various style blending scenarios:

  • Style Consistency: Quantitative comparisons using our Weighted Multi-Style DINO VIT-B/8 metric show that SLI Blending significantly outperforms conventional linear interpolation, producing images with robust and coherent style alignment.
  • Visual Quality: Z-SASLM preserves fine stylistic details and avoids abrupt transitions common in linear blending, delivering high-fidelity visuals even under challenging multi-reference conditions.
  • Multi-Modal Fusion: Ablation studies reveal that incorporating diverse content (e.g., audio and weather data) enriches outputs and enhances the richness and contextuality of generated images.
MedCub PurpleMacro VanGoghEgyptian EgyPurpleMacro EgyVanGoghMacro

Results

We compare our SLI Blending method to traditional Linear Blending adapted from StyleGAN2-ADA. We conduct experiments across multiple blending weights using two reference styles (Medieval and Cubism), evaluating with both our Weighted Multi-Style DINO VIT-B/8 metric and CLIP Score. Our results show that Z-SASLM's SLI Blending provides improved style consistency and image-text alignment, especially under equal or near-equal style weights.

Style Weights Linear (StyleGAN2-ADA) Z-SASLM (Ours)
{wmed, wcub} WMSDINO-VIT-B/8 CLIP Score WMSDINO-VIT-B/8 CLIP Score
{0, 1}*0.475520.302800.475520.30280
{0.15, 0.85}0.411510.315340.449000.31049
{0.25, 0.75}0.405750.314200.423470.31657
{0.5, 0.5}0.363930.292320.391530.31434
{0.75, 0.25}0.364300.317520.347600.31911
{0.85, 0.15}0.363150.323810.367790.31499
{1, 0}*0.298910.305700.298910.30570

* No blending: single-style reference (as in StyleAligned).

SLI consistently improves Weighted Multi-Style DINO scores compared to linear interpolation. For balanced blending ({0.5, 0.5}), Z-SASLM achieves a notable boost in both metrics. This confirms that respecting the geometry of the latent space via SLI leads to better multi-style integration and high-fidelity outputs.

Multi-Modality Ablation

  • Enhanced Context: Fusing multi-modal data (e.g., image, audio, weather) enriches the textual prompt, leading to more contextually informed and creative outputs.
  • Improved Style Consistency: The integration of diverse modalities boosts the robustness of style alignment across generated images.
  • Comparative Advantage: Multi-modal fusion outperforms single-modal baselines in both quantitative metrics and visual quality.
Multi-Modal Context Multi-Modal Context Multi-Modal Context Multi-Modal Context Multi-Modal Context Multi-Modal Context Multi-Modal Context

Guidance Ablation

  • Low Guidance: Yields images closely aligned to the textual prompt but with less pronounced stylistic details.
  • High Guidance: Emphasizes style characteristics more aggressively, sometimes at the expense of prompt adherence.
  • Optimal Range: A balanced range (15–20) achieves the best trade-off between style fidelity and semantic alignment.
Guidance Ablation Guidance Ablation

Scaling Ablation

  • Problem of Style Dominance: Styles with higher latent activation norms (e.g., “famous” styles like Cubism) tend to dominate blending results by skewing attention scores in their favor.
  • Rescaling Strategy: We identify dominant styles by checking the norm of their key vectors and apply a normalization that dampens their attention contribution while slightly boosting "normal" styles.
  • Improved Balance: Experiments show that our attention rescaling technique significantly reduces the style imbalance, enabling more faithful and balanced multi-style blending.
Attention Rescaling Effect Scaling vs Non-Scaling

BibTeX


            @misc{borgi2025zsaslmzeroshotstylealignedsli,
              title={Z-SASLM: Zero-Shot Style-Aligned SLI Blending Latent Manipulation}, 
              author={Alessio Borgi and Luca Maiano and Irene Amerini},
              year={2025},
              eprint={2503.23234},
              archivePrefix={arXiv},
              primaryClass={cs.CV},
              url={https://arxiv.org/abs/2503.23234}, 
        }