HyperAIHyperAI

Command Palette

Search for a command to run...

CARE-Edit: 조건 인식형 전문가 라우팅을 활용한 문맥 기반 이미지 편집

Yucheng Wang Zedong Wang Yuetong Wu Yue Ma Dan Xu

초록

통합 확산 편집기들은 종종 다양한 작업을 위해 고정된 공유 백본에 의존하며, 이로 인해 작업 간 간섭이 발생하고 이질적인 요구사항(예: 지역적 대 전역적, 의미적 대 광도적)에 대한 적응력이 저하되는 문제를 겪습니다. 특히, 널리 사용되는 ControlNet 및 OmniControl 변형들은 텍스트, 마스크, 참조 이미지와 같은 다중 조건 신호를 정적 연결 또는 가산 어댑터를 통해 결합하는데, 이는 상충되는 모달리티를 동적으로 우선순위화하거나 억제하지 못합니다. 그 결과 마스크 경계에서의 색상 번짐, 정체성 또는 스타일의 변이, 그리고 다중 조건 입력 하에서의 예측 불가능한 동작과 같은 아티팩트가 발생합니다. 이를 해결하기 위해 우리는 모델 연산을 특정 편집 역량과 정렬시키는 '조건 인식 라우팅 전문가(Condition-Aware Routing of Experts, CARE-Edit)'를 제안합니다. CARE-Edit의 핵심은 경량 잠재-어텐션 라우터가 다중 모달 조건과 확산 타임스텝을 기반으로 인코딩된 확산 토큰을 텍스트, 마스크, 참조, 베이스라는 네 가지 전문화된 전문가에게 할당하는 것입니다. 구체적으로, (i) 마스크 리페인팅 모듈이 먼저 사용자 정의된 대략적인 마스크를 정제하여 정밀한 공간적 지침을 제공하고, (ii) 라우터는 희소성 기반 Top-K 선택을 적용하여 가장 관련성 높은 전문가에게 연산 자원을 동적으로 할당하며, (iii) 잠재 혼합 모듈이 이후 전문가들의 출력을 융합하여 의미적, 공간적, 스타일적 정보를 베이스 이미지에 일관되게 통합합니다. 실험을 통해 CARE-Edit이 지우기, 교체, 텍스트 기반 편집, 스타일 전이 등 문맥적 편집 작업에서 탁월한 성능을 보임을 입증하였습니다. 또한 실증 분석을 통해 각 전문화된 전문가가 작업 고유의 동작 특성을 나타냄을 확인하였으며, 이는 다중 조건 간 충돌을 완화하기 위해 동적이고 조건 인식적인 처리가 필수적임을 시사합니다.

One-sentence Summary

Researchers from The Hong Kong University of Science and Technology propose CARE-Edit, a unified diffusion editor that replaces static adapters with a dynamic latent-attention router. This innovation allocates tokens to specialized experts based on multi-modal conditions, effectively eliminating artifacts like color bleeding in complex tasks such as subject replacement and style transfer.

Key Contributions

  • Existing unified diffusion editors suffer from task interference and artifacts like color bleeding because they rely on static fusion of multi-modal signals that cannot dynamically prioritize conflicting conditions.
  • We propose CARE-Edit, a framework that employs a lightweight latent-attention router to dynamically dispatch diffusion tokens to four specialized experts for text, mask, reference, and base processing based on input conditions and timesteps.
  • Experiments on diverse tasks including object erasure, replacement, and style transfer demonstrate that CARE-Edit achieves superior edit faithfulness and boundary cleanliness compared to static-fusion baselines by effectively resolving multi-condition conflicts.

Introduction

Diffusion-based models have revolutionized image editing by enabling tasks like object replacement and style transfer, yet current unified editors struggle when handling multiple, conflicting input signals such as text prompts, masks, and reference images. Existing approaches typically rely on static fusion mechanisms that force all conditions through a shared backbone, which often leads to artifacts like color bleeding, identity drift, or inconsistent behavior because the model cannot dynamically allocate its capacity to prioritize specific signals at different stages of the generation process. To address these limitations, the authors introduce CARE-Edit, a framework that employs condition-aware routing to dynamically dispatch diffusion tokens to specialized heterogeneous experts tailored for text, spatial masks, reference features, and global coherence. This approach utilizes a lightweight router to adaptively select the most relevant experts based on the input conditions and diffusion timestep, while complementary modules like Mask Repaint and Latent Mixture further refine spatial precision and resolve conflicts between competing signals.

Dataset

  • Dataset Composition and Sources The authors construct a training corpus of approximately 120K triplets by aggregating data from four primary sources: MagicBrush and OmniEdit for instruction-based edits, UNO for object removal and replacement tasks, and AnyEdit for style transfer. To address spatial ambiguity in purely instruction-based data, they curate a specialized 20K subset from Subjects200K that focuses on diverse objects and humans with precise identity preservation.

  • Key Details for Each Subset

    • Instruction-based: Sourced from MagicBrush and OmniEdit, these samples pair real-world images with rich natural language instructions.
    • Removal and Replacement: Derived from a subset of UNO to target specific object manipulation tasks.
    • Style Transfer: Enriched using AnyEdit to include fine-grained appearance and style-level instructions.
    • Subject-Centric (20K): Built from Subjects200K, this subset provides high-quality foreground masks and reference images on clean white backgrounds to facilitate region-specific editing.
  • Model Usage and Training Strategy The model utilizes a curated mixture of these datasets to teach diverse editing capabilities while maintaining data efficiency. The training pipeline emphasizes a mask-aware, subject-centric curriculum that allows the model to outperform larger baselines despite using significantly fewer training samples. The data is structured to support condition-aware expert routing, where coarse bounding boxes guide the routing mechanism and fine masks provide pixel-accurate supervision.

  • Processing and Metadata Construction

    • Mask-Aware Generation: The authors employ a GPT-Image-1 and VLM-based pipeline to synthesize image pairs with consistent backgrounds but varying foregrounds. This process starts with a reference subject and generates diverse scene descriptions to create high-quality pairs annotated with precise segmentation masks.
    • Coarse and Fine Masks: For each sample, a high-resolution fine mask is extracted using an off-the-shelf segmentation model and manually filtered. A coarse axis-aligned bounding box is then derived from the fine mask to serve as a spatial prior for expert routing.
    • Prompt Taxonomy: The generation process is organized along two axes: category and operation type (e.g., replacement, addition, style change) and scene-level templates. This ensures that background layout, lighting, and camera viewpoints remain similar across paired images while foreground regions change, providing clean supervision for localized editing.

Method

The authors propose CARE-Edit, a diffusion-based editor designed to mitigate task interference in unified image editing. Unlike static fusion approaches that process all conditions with a shared backbone, CARE-Edit performs fine-grained condition-aware routing over a set of heterogeneous experts. This specialize-then-fuse manner allows the model to dynamically allocate computation, prioritize relevant modalities, and mitigate conflicts between competing edit instructions.

The framework unifies diverse editing paradigms, including instruction-based editing (text and base image) and subject-based editing (base and reference image), into a single system capable of handling text, base image, reference image, and mask inputs simultaneously.

Each input modality is first mapped to a latent token sequence by a specialized, frozen encoder. The text prompt is processed by a Text Encoder Etex()\mathcal{E}_{\text{tex}}(\cdot)Etex(), producing contextual embeddings Cp\mathbf{C}_pCp. The Image Encoder Eimage()\mathcal{E}_{\text{image}}(\cdot)Eimage() extracts latent representations Zb\mathbf{Z}_bZb for the base image and Zr\mathbf{Z}_rZr for the reference image. A Mask Encoder Emask()\mathcal{E}_{\text{mask}}(\cdot)Emask() converts the spatial mask into aligned latent tokens Zm\mathbf{Z}_mZm. These latents are projected to a shared embedding space and concatenated to form a unified token sequence h0\mathbf{h}_0h0. This sequence is propagated through a frozen diffusion transformer (DiT) backbone, where LoRA-style fine-tuning is applied to the self-attention and projection layers to adapt the model to multi-modal conditioning without fully retraining the backbone.

The core innovation lies in the Projection Layer with Experts, which introduces four heterogeneous experts corresponding to the text, mask, reference, and base modalities.

The text expert performs semantic reasoning and object synthesis through cross-attention with text tokens. The mask expert focuses on spatial precision and boundary refinement guided by the edit mask. The reference expert learns identity- and style-consistent transformations from reference features, while the base expert enforces global coherence and background consistency. To determine which modality experts should process each token, CARE-Edit performs token-wise Top-K routing. For every token, a router computes a probability distribution over the four experts by combining local content features and global task context. In practice, KKK is set to 3, achieving a favorable balance between representational diversity and computational efficiency. This scheme allows each token to adaptively attend to the most relevant experts under spatial-semantic-task joint guidance.

To address the issue where user-defined masks may misalign with object boundaries, the authors incorporate a Mask Repaint module. This module refines the coarse user-provided masks at each diffusion step by exploiting geometric correspondence between the current latent and reference features. It predicts a soft, boundary-aware mask that adapts to object contours, promoting smooth transitions between edited and preserved regions. The refined mask is fed back into the routing process of the next diffusion block, modulating the mask and base experts to ensure progressively sharper boundary control.

Finally, the output of the specialized experts must be coherently aggregated. The authors employ a Latent Mixture module that performs per-token and per-timestep processing based on routing confidence and contextual cues. The fused latent is obtained through a convex combination of expert outputs, where each channel integrates text, semantic, and mask cues according to the router's attention pattern. To maintain global coherence, this fused latent is blended with the base expert's output via a learned, timestep-dependent gate. The model is trained end-to-end by combining the standard diffusion reconstruction loss with auxiliary regularizers for load balancing, mask boundary consistency, and latent mixture smoothness.

Experiment

  • Instruction-based editing experiments on EMU-Edit and MagicBrush validate that CARE-Edit produces cleaner, more instruction-faithful results with sharper boundaries and fewer artifacts compared to both task-specific and unified baselines.
  • Subject-driven contextual editing tests on DreamBench++ confirm the model's ability to preserve subject identity and structure while effectively integrating objects into complex multi-object scenes.
  • Ablation studies demonstrate that dynamic expert routing is essential for handling diverse editing behaviors, while specific components like Latent Mixture and Mask Repaint are critical for aggregating outputs and achieving precise edits.
  • Empirical analysis reveals that the model successfully learns to disentangle editing tasks, with a Base Expert maintaining global coherence, a Mask Expert focusing on geometric restructuring, and a Reference Expert handling semantic and stylistic injection.
  • Extended qualitative comparisons show robust performance across object removal, addition, replacement, and style transfer, highlighting the model's capacity to maintain structural integrity while applying complex semantic changes.

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp
CARE-Edit: 조건 인식형 전문가 라우팅을 활용한 문맥 기반 이미지 편집 | 문서 | HyperAI초신경