Z-SAMB: Zero-Shot Multi-Reference Multi-Modal Style Alignment
Z-SAMB (Zero-Shot Style-Aligned Multi-Reference Multi-Modal Blending) extends the StyleAligned paradigm to handle multiple reference styles simultaneously while incorporating multi-modal context from text, images, and audio — all without any fine-tuning.
Motivation
Single-reference style transfer is limited: real creative tasks often blend styles (impressionist palette + cubist geometry) or draw from context richer than a single image. Z-SAMB addresses both limitations by:
- Multi-reference blending: interpolate across multiple style references in the attention space, with configurable weights per reference.
- Multi-modal context awareness: leverage BLIP (vision-language), CLAP (audio-language), and Whisper (audio transcription) to condition style selection on multi-modal prompts — e.g., generate an image in the style suggested by a piece of music or a textual mood description.
Key Components
- AdaIN (Adaptive Instance Normalisation): fast style statistics transfer for coarse-grained style blending.
- Shared attention (StyleAligned): fine-grained texture and pattern propagation from reference to target.
- BLIP captions: automatically describe style references for text-guided selection.
- CLAP + Whisper: audio modal input — derive style keywords from music/speech to guide generation.
Technology
Python, Jupyter Notebooks, Hugging Face Diffusers, BLIP, CLAP, Whisper, Stable Diffusion XL.
