HyperAIHyperAI

Command Palette

Search for a command to run...

숫자가 말할 때: Text-to-Video Diffusion Models에서의 Textual Numerals와 Visual Instances 간의 정렬 (Aligning)

Zhengyang Sun Yu Chen Xin Zhou Xiaofan Li Xiwu Chen Dingkang Liang Xiang Bai

초록

제시해주신 영문 텍스트를 요청하신 전문적인 기술 번역 기준에 따라 한국어로 번역하였습니다. (사용자께서 마지막에 '사용하여 한국어로 답변해달라'고 요청하셨으므로, 한국어 번역 결과물을 제공합니다.)[번역문]Text-to-video diffusion 모델은 개방형 비디오 합성을 가능하게 했으나, prompt에 명시된 객체의 정확한 개수를 생성하는 데 어려움을 겪는 경우가 많습니다. 본 논문에서는 수치적 정렬(numerical alignment)을 개선하기 위한 training-free 방식의 '식별 후 가이드(identify-then-guide)' 프레임워크인 NUMINA를 소개합니다. NUMINA는 판별력이 높은 self-attention 및 cross-attention head를 선택하여 계수 가능한(countable) latent layout을 도출함으로써, prompt와 layout 간의 불일치를 식별합니다. 그 다음, 이 layout을 보수적으로 정교화하고 cross-attention을 조절하여 재생성(regeneration)을 가이드합니다. 본 연구에서 도입한 CountBench를 통해 실험한 결과, NUMINA는 Wan2.1-1.3B 모델에서 counting 정확도를 최대 7.4% 향상시켰으며, 5B 및 14B 모델에서도 각각 4.9%와 5.5%의 향상을 보였습니다. 또한, temporal consistency를 유지하면서 CLIP alignment를 개선했습니다. 이러한 결과는 구조적 가이드(structural guidance)가 seed search 및 prompt enhancement를 보완할 수 있음을 보여주며, 개수 정확도가 높은 text-to-video diffusion을 구현하기 위한 실질적인 경로를 제시합니다. 코드는 다음 링크에서 확인할 수 있습니다: https://github.com/H-EmbodVis/NUMINA.

One-sentence Summary

To improve numerical alignment in text-to-video diffusion models, the authors propose NUMINA, a training-free identify-then-guide framework that derives countable latent layouts from discriminative attention heads and modulates cross-attention for guided regeneration, increasing counting accuracy on CountBench by up to 7.4% for Wan2.1-1.3B and by up to 5.5% for 5B and 14B models while improving CLIP alignment and maintaining temporal consistency.

Key Contributions

  • The paper introduces NUMINA, a training-free identify-then-guide framework designed to improve numerical alignment in text-to-video diffusion models.
  • The method derives a countable latent layout by selecting discriminative self-and cross-attention heads to identify prompt-layout inconsistencies, which is then refined and used to modulate cross-attention during regeneration.
  • Experiments on the new CountBench dataset demonstrate that the framework improves counting accuracy by up to 7.4% on the Wan2.1-1.3B model and across larger 5B and 14B models while enhancing CLIP alignment and maintaining temporal consistency.

Introduction

Text-to-video (T2V) diffusion models are essential for high-quality video synthesis in entertainment and education, but they frequently fail to generate the exact number of objects specified in a text prompt. Current models struggle with this numerical alignment due to weak semantic grounding of numeral tokens and insufficient instance separability within compressed spatiotemporal latent spaces. While retraining models could potentially address these issues, the computational cost and the need for massive, precisely annotated datasets make it impractical. The authors leverage a training-free framework called NUMINA that employs an identify-then-guide paradigm to correct these inconsistencies during the denoising process. By selecting discriminative attention heads to derive a countable latent layout and using that layout to guide regeneration, NUMINA improves counting accuracy across various model scales while maintaining temporal coherence and visual fidelity.

Method

The authors present NUMINA, a training-free framework for numerically aligned video generation that operates through a two-phase pipeline, following an identify-then-guide paradigm. As shown in the figure below, the overall framework begins with a text prompt containing numerals and a sampled noise vector, which are used to generate an initial video. The first phase, numerical misalignment identification, analyzes the attention mechanisms of the DiT model to extract an explicit layout signal that reflects the countable structure of the scene. This layout is then used in the second phase, numerically aligned video generation, to guide the re-synthesis process and correct count discrepancies.

The NUMINA framework consists of two phases: numerical misalignment identification and numerically aligned video generation. It uses attention maps to identify instances and then refines the layout to match the prompt's numerals before guiding the generation process.
The NUMINA framework consists of two phases: numerical misalignment identification and numerically aligned video generation. It uses attention maps to identify instances and then refines the layout to match the prompt's numerals before guiding the generation process.

In the first phase, the method identifies count discrepancies by analyzing the DiT's attention mechanisms. This involves selecting the most instance-discriminative self-attention head and the most text-concentrated cross-attention head, and then fusing their maps to obtain an instance-level layout that is explicitly countable. The self-attention maps are processed to measure instance separability using three complementary scores: foreground-background separation, structural richness, and edge clarity. These scores are combined into a discriminability score, and the head with the highest score is selected to provide a layout with the highest instance separability. For each target noun token in the prompt, the cross-attention map is analyzed to identify the head with the highest peak activation, which indicates the model's alignment with a specific visual region. These selected self- and cross-attention maps are then fused to construct a countable foreground layout for each target noun.

The countable layout is constructed by generating spatial proposals from the self-attention map through clustering, and processing the cross-attention map by suppressing low activations and applying density-based clustering to form a focus mask. The proposals are filtered based on their semantic overlap with the focus mask, and regions with sufficient overlap are retained as valid instances. The final layout is a 2D semantic map where each pixel belonging to a valid region is assigned the corresponding class label, resulting in a map containing disjoint foreground regions that ideally correspond to individual object instances.

In the second phase, the identified layout is used to correct count errors during generation. This is achieved through a conservative two-step approach: layout refinement and layout-guided generation. The layout refinement process adjusts the per-frame layout map to match the target count parsed from the prompt. For object removal, the smallest region of the target category is erased to minimize visual impact. For object addition, a new instance is inserted using a layout template. If existing instances are present, the smallest existing region is copied as the template; otherwise, a circle is used. The template is placed at an optimal location by minimizing a heuristic cost that balances overlap with the existing layout, proximity to the spatial center, and temporal stability across frames. The resulting refined layout preserves the original spatial organization while correcting count errors.

Finally, the refined layout guides the regeneration process through a training-free modulation of the cross-attention. This is achieved by modifying either the pre-softmax attention scores or the bias term, scaled by a monotonically decreasing intensity function that applies stronger guidance early in the denoising process. For object removal, attention suppression is performed by setting the bias term to a large negative constant in regions corresponding to the category token, effectively suppressing unwanted instance generation. For object addition, attention is boosted in the new area. If the instance is templated from a manual circle, the bias term is set to a scalar coefficient. If templated from an existing reference region, the pre-softmax scores are overwritten with the mean score from the reference region, transferring the pre-trained attention properties to the new location. This process ensures stable control superposition and preserves overall visual fidelity.

Experiment

The researchers evaluated NUMINA using CountBench, a new benchmark designed to test numerical fidelity in complex text-to-video scenarios, across various model scales and architectures including Wan and CogVideoX. The experiments demonstrate that NUMINA significantly improves counting accuracy and semantic alignment while maintaining temporal consistency and high visual quality. The results show that the method is highly scalable, effective in high-count scenarios, and provides a more efficient and reliable alternative to traditional trial-and-error strategies like seed search.

The authors evaluate NUMINA against baseline models and existing strategies on text-to-video generation tasks. Results show that NUMINA consistently improves counting accuracy across different model scales while maintaining or enhancing temporal consistency and semantic alignment. The method outperforms seed search and prompt enhancement, especially in complex scenarios with higher object counts. NUMINA significantly improves counting accuracy compared to baseline models and existing strategies. The method maintains or improves temporal consistency and semantic alignment across all tested models. NUMINA enables smaller models to surpass the performance of larger baseline models in counting accuracy.

NUMINA improves counting accuracy
NUMINA improves counting accuracy

The authors evaluate NUMINA's impact on counting accuracy and temporal consistency by comparing baseline results with those from adding object addition and removal operations. Results show that both operations improve counting accuracy, with the combination achieving the highest performance while also enhancing temporal consistency. Adding object addition significantly improves counting accuracy over the baseline Combining addition and removal operations yields the highest counting accuracy and temporal consistency The method maintains or improves temporal consistency while enhancing numerical alignment

NUMINA improves counting accuracy
NUMINA improves counting accuracy

The authors evaluate NUMINA on CogVideoX-5B, showing significant improvements in counting accuracy, temporal consistency, and CLIP score compared to baseline methods. Results demonstrate that NUMINA enhances numerical alignment while maintaining or improving generation quality. NUMINA substantially improves counting accuracy over baseline and enhancement strategies The method boosts temporal consistency and CLIP score, indicating better video quality and alignment NUMINA achieves higher performance with a single generation pass, avoiding the need for seed search or prompt enhancement

NUMINA improves numerical accuracy
NUMINA improves numerical accuracy

The authors introduce NUMINA, a training-free method that enhances numerical alignment in text-to-video generation. Results show that NUMINA significantly improves counting accuracy across various models while maintaining or improving temporal consistency and semantic quality. NUMINA substantially boosts counting accuracy compared to baseline models and existing strategies The method improves temporal consistency and semantic alignment without degrading video quality Combining NUMINA with other enhancement techniques achieves the highest performance

NUMINA improves counting accuracy
NUMINA improves counting accuracy

The authors analyze the impact of different hyperparameter values on counting accuracy. Results show that various settings yield similar performance, with only minor variations in accuracy, indicating the method's stability across a range of configurations. Different hyperparameter settings produce comparable counting accuracy with only slight variations. The method demonstrates robustness to changes in hyperparameter values, showing stable performance. Variations in hyperparameter values have minimal impact on overall counting accuracy.

Ablation on hyperparameter settings
Ablation on hyperparameter settings

NUMINA is evaluated against baseline models and existing enhancement strategies to validate its effectiveness in improving numerical alignment during text-to-video generation. The results demonstrate that the method consistently enhances counting accuracy and temporal consistency across various model scales and complex scenarios. Furthermore, the approach proves to be highly robust and stable, maintaining high performance even with variations in hyperparameter configurations.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp