StyleAligned: Zero-Shot Style Alignment in Text-to-Image Generation
StyleAligned implements and extends the StyleAligned framework for zero-shot style transfer in diffusion-based text-to-image generation. By sharing a small subset of attention keys and values between a reference image and target generation, the model transfers artistic style without any fine-tuning, LoRA, or additional training.
Core Idea
Standard text-to-image models generate each image independently. StyleAligned conditions the denoising process on a reference image by sharing self-attention keys and values across the batch during inference. This minimal coupling is enough to transfer colour palette, brush style, and artistic texture โ while leaving semantic content free to follow the text prompt.
Extensions
Beyond the baseline StyleAligned paper, this project explores:
- ControlNet integration: use depth or Canny edge maps as structural guidance while preserving reference style.
- CLIP-guided style selection: automatically select the most stylistically consistent reference from a candidate pool using CLIP embeddings.
- Multi-reference blending: average attention features across multiple references for mixed-style outputs.
Results
Qualitative evaluations show consistent style transfer across diverse prompts (portraits, landscapes, abstract scenes) with the same reference image. CLIP-style-distance metrics confirm closer alignment to reference style than naรฏve prompt engineering.
Technology
Python, Jupyter Notebooks, Hugging Face Diffusers, CLIP, ControlNet, Stable Diffusion.
