HyperAIHyperAI

Command Palette

Search for a command to run...

ご提示いただいたタイトルは、学術論文のタイトルとして非常に重要ですので、その文脈(AI・コンピュータビジョン分野のトップ会議やジャーナル)にふさわしい、格調高く正確な表現をご提案します。 翻訳結果は以下の通りです: 数字が語る時:Text-to-Video Diffusion Modelsにおけるテキスト内の数字表記と視覚的インスタンスの整合

【翻訳の解説(学術的観点から)】

When Numbers Speak: 文学的な表現ですが、論文のタイトルとしては「数字が語る時」あるいは「数字の意味するもの」といったニュアンスを含ませるのが一般的です。 Aligning: 機械学習の文脈では「整合(整合性を取る)」「アラインメント」と訳されます。ここでは、テキストと画像(動画)の内容を一致させることを指すため、「整合」という言葉を用いて学術的な響きを持たせています。 Textual Numerals and Visual Instances: 「Textual Numerals」はテキストに含まれる数字の表記(例:「3」や「three」)を指し、「Visual Instances」は動画内に現れる実際の物体(インスタンス)を指します。これらを「テキスト内の数字表記と視覚的インスタンス」と訳すことで、研究の対象を明確にしています。 Text-to-Video Diffusion Models: 指示通り、専門用語であるためそのまま英語で保持しています。

Zhengyang Sun Yu Chen Xin Zhou Xiaofan Li Xiwu Chen Dingkang Liang Xiang Bai

概要

ご指定いただいた指示に基づき、提供された英文テキストを技術的な正確さと専門的な文体を維持しながら、日本語に翻訳いたしました。【翻訳文】Text-to-video diffusion modelは、オープンエンドな動画合成を可能にしましたが、promptで指定された正確なオブジェクト数を生成することに苦慮する場合が多くあります。本研究では、数値的な整合性(numerical alignment)を向上させるための、学習を必要としない「識別・誘導型(identify-then-guide)」フレームワークであるNUMINAを提案します。NUMINAは、識別力の高いself-attentionおよびcross-attention headを選択して数え上げ可能なlatent layoutを導出することにより、promptとレイアウトの不一致を特定します。次に、このlayoutを保守的に精緻化し、cross-attentionを調整することで再生成を誘導します。新たに導入されたbenchmarkであるCountBenchを用いた評価において、NUMINAはWan2.1-1.3Bモデルで最大7.4%、5Bモデルで4.9%、14Bモデルで5.5%の精度向上をカウント精度において達成しました。さらに、時間的一貫性(temporal consistency)を維持しつつ、CLIPとのalignmentも向上させています。これらの結果は、構造的なガイダンス(structural guidance)がseedの探索やpromptの拡張を補完することを証明しており、カウント精度に優れたtext-to-video diffusionモデルを実現するための実用的な道筋を示すものです。コードは以下のリポジトリで公開されています:https://github.com/H-EmbodVis/NUMINA

One-sentence Summary

To improve numerical alignment in text-to-video diffusion models, the authors propose NUMINA, a training-free identify-then-guide framework that derives countable latent layouts from discriminative attention heads and modulates cross-attention for guided regeneration, increasing counting accuracy on CountBench by up to 7.4% for Wan2.1-1.3B and by up to 5.5% for 5B and 14B models while improving CLIP alignment and maintaining temporal consistency.

Key Contributions

  • The paper introduces NUMINA, a training-free identify-then-guide framework designed to improve numerical alignment in text-to-video diffusion models.
  • The method derives a countable latent layout by selecting discriminative self-and cross-attention heads to identify prompt-layout inconsistencies, which is then refined and used to modulate cross-attention during regeneration.
  • Experiments on the new CountBench dataset demonstrate that the framework improves counting accuracy by up to 7.4% on the Wan2.1-1.3B model and across larger 5B and 14B models while enhancing CLIP alignment and maintaining temporal consistency.

Introduction

Text-to-video (T2V) diffusion models are essential for high-quality video synthesis in entertainment and education, but they frequently fail to generate the exact number of objects specified in a text prompt. Current models struggle with this numerical alignment due to weak semantic grounding of numeral tokens and insufficient instance separability within compressed spatiotemporal latent spaces. While retraining models could potentially address these issues, the computational cost and the need for massive, precisely annotated datasets make it impractical. The authors leverage a training-free framework called NUMINA that employs an identify-then-guide paradigm to correct these inconsistencies during the denoising process. By selecting discriminative attention heads to derive a countable latent layout and using that layout to guide regeneration, NUMINA improves counting accuracy across various model scales while maintaining temporal coherence and visual fidelity.

Method

The authors present NUMINA, a training-free framework for numerically aligned video generation that operates through a two-phase pipeline, following an identify-then-guide paradigm. As shown in the figure below, the overall framework begins with a text prompt containing numerals and a sampled noise vector, which are used to generate an initial video. The first phase, numerical misalignment identification, analyzes the attention mechanisms of the DiT model to extract an explicit layout signal that reflects the countable structure of the scene. This layout is then used in the second phase, numerically aligned video generation, to guide the re-synthesis process and correct count discrepancies.

The NUMINA framework consists of two phases: numerical misalignment identification and numerically aligned video generation. It uses attention maps to identify instances and then refines the layout to match the prompt's numerals before guiding the generation process.
The NUMINA framework consists of two phases: numerical misalignment identification and numerically aligned video generation. It uses attention maps to identify instances and then refines the layout to match the prompt's numerals before guiding the generation process.

In the first phase, the method identifies count discrepancies by analyzing the DiT's attention mechanisms. This involves selecting the most instance-discriminative self-attention head and the most text-concentrated cross-attention head, and then fusing their maps to obtain an instance-level layout that is explicitly countable. The self-attention maps are processed to measure instance separability using three complementary scores: foreground-background separation, structural richness, and edge clarity. These scores are combined into a discriminability score, and the head with the highest score is selected to provide a layout with the highest instance separability. For each target noun token in the prompt, the cross-attention map is analyzed to identify the head with the highest peak activation, which indicates the model's alignment with a specific visual region. These selected self- and cross-attention maps are then fused to construct a countable foreground layout for each target noun.

The countable layout is constructed by generating spatial proposals from the self-attention map through clustering, and processing the cross-attention map by suppressing low activations and applying density-based clustering to form a focus mask. The proposals are filtered based on their semantic overlap with the focus mask, and regions with sufficient overlap are retained as valid instances. The final layout is a 2D semantic map where each pixel belonging to a valid region is assigned the corresponding class label, resulting in a map containing disjoint foreground regions that ideally correspond to individual object instances.

In the second phase, the identified layout is used to correct count errors during generation. This is achieved through a conservative two-step approach: layout refinement and layout-guided generation. The layout refinement process adjusts the per-frame layout map to match the target count parsed from the prompt. For object removal, the smallest region of the target category is erased to minimize visual impact. For object addition, a new instance is inserted using a layout template. If existing instances are present, the smallest existing region is copied as the template; otherwise, a circle is used. The template is placed at an optimal location by minimizing a heuristic cost that balances overlap with the existing layout, proximity to the spatial center, and temporal stability across frames. The resulting refined layout preserves the original spatial organization while correcting count errors.

Finally, the refined layout guides the regeneration process through a training-free modulation of the cross-attention. This is achieved by modifying either the pre-softmax attention scores or the bias term, scaled by a monotonically decreasing intensity function that applies stronger guidance early in the denoising process. For object removal, attention suppression is performed by setting the bias term to a large negative constant in regions corresponding to the category token, effectively suppressing unwanted instance generation. For object addition, attention is boosted in the new area. If the instance is templated from a manual circle, the bias term is set to a scalar coefficient. If templated from an existing reference region, the pre-softmax scores are overwritten with the mean score from the reference region, transferring the pre-trained attention properties to the new location. This process ensures stable control superposition and preserves overall visual fidelity.

Experiment

The researchers evaluated NUMINA using CountBench, a new benchmark designed to test numerical fidelity in complex text-to-video scenarios, across various model scales and architectures including Wan and CogVideoX. The experiments demonstrate that NUMINA significantly improves counting accuracy and semantic alignment while maintaining temporal consistency and high visual quality. The results show that the method is highly scalable, effective in high-count scenarios, and provides a more efficient and reliable alternative to traditional trial-and-error strategies like seed search.

The authors evaluate NUMINA against baseline models and existing strategies on text-to-video generation tasks. Results show that NUMINA consistently improves counting accuracy across different model scales while maintaining or enhancing temporal consistency and semantic alignment. The method outperforms seed search and prompt enhancement, especially in complex scenarios with higher object counts. NUMINA significantly improves counting accuracy compared to baseline models and existing strategies. The method maintains or improves temporal consistency and semantic alignment across all tested models. NUMINA enables smaller models to surpass the performance of larger baseline models in counting accuracy.

NUMINA improves counting accuracy
NUMINA improves counting accuracy

The authors evaluate NUMINA's impact on counting accuracy and temporal consistency by comparing baseline results with those from adding object addition and removal operations. Results show that both operations improve counting accuracy, with the combination achieving the highest performance while also enhancing temporal consistency. Adding object addition significantly improves counting accuracy over the baseline Combining addition and removal operations yields the highest counting accuracy and temporal consistency The method maintains or improves temporal consistency while enhancing numerical alignment

NUMINA improves counting accuracy
NUMINA improves counting accuracy

The authors evaluate NUMINA on CogVideoX-5B, showing significant improvements in counting accuracy, temporal consistency, and CLIP score compared to baseline methods. Results demonstrate that NUMINA enhances numerical alignment while maintaining or improving generation quality. NUMINA substantially improves counting accuracy over baseline and enhancement strategies The method boosts temporal consistency and CLIP score, indicating better video quality and alignment NUMINA achieves higher performance with a single generation pass, avoiding the need for seed search or prompt enhancement

NUMINA improves numerical accuracy
NUMINA improves numerical accuracy

The authors introduce NUMINA, a training-free method that enhances numerical alignment in text-to-video generation. Results show that NUMINA significantly improves counting accuracy across various models while maintaining or improving temporal consistency and semantic quality. NUMINA substantially boosts counting accuracy compared to baseline models and existing strategies The method improves temporal consistency and semantic alignment without degrading video quality Combining NUMINA with other enhancement techniques achieves the highest performance

NUMINA improves counting accuracy
NUMINA improves counting accuracy

The authors analyze the impact of different hyperparameter values on counting accuracy. Results show that various settings yield similar performance, with only minor variations in accuracy, indicating the method's stability across a range of configurations. Different hyperparameter settings produce comparable counting accuracy with only slight variations. The method demonstrates robustness to changes in hyperparameter values, showing stable performance. Variations in hyperparameter values have minimal impact on overall counting accuracy.

Ablation on hyperparameter settings
Ablation on hyperparameter settings

NUMINA is evaluated against baseline models and existing enhancement strategies to validate its effectiveness in improving numerical alignment during text-to-video generation. The results demonstrate that the method consistently enhances counting accuracy and temporal consistency across various model scales and complex scenarios. Furthermore, the approach proves to be highly robust and stable, maintaining high performance even with variations in hyperparameter configurations.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています