Learning Flow Fields in Attention for Controllable Person Image Generation

Controllable person image generation aims to generate a person imageconditioned on reference images, allowing precise control over the person'sappearance or pose. However, prior methods often distort fine-grained texturaldetails from the reference image, despite achieving high overall image quality.We attribute these distortions to inadequate attention to corresponding regionsin the reference image. To address this, we thereby propose learning flowfields in attention (Leffa), which explicitly guides the target query to attendto the correct reference key in the attention layer during training.Specifically, it is realized via a regularization loss on top of the attentionmap within a diffusion-based baseline. Our extensive experiments show thatLeffa achieves state-of-the-art performance in controlling appearance (virtualtry-on) and pose (pose transfer), significantly reducing fine-grained detaildistortion while maintaining high image quality. Additionally, we show that ourloss is model-agnostic and can be used to improve the performance of otherdiffusion models.