2달 전

Brian Chao Lior Yariv Howard Xiao Gordon Wetzstein

초록

Diffusion 및 flow matching 모델은 대화형 이미지 및 스트리밍 비디오 생성과 같은 창의적인 콘텐츠 제작 분야에서 전례 없는 능력을 보여주었습니다. 그러나 고해상도, 높은 프레임 레이트 및 긴 context length에 대한 수요가 증가함에 따라, 생성되는 token 수가 늘어날수록 연산 복잡도가 이차 함수적으로 증가하기 때문에 효율적인 생성을 달성하는 것이 점점 더 어려워지고 있습니다.본 연구는 eye tracking 등을 통해 사용자의 시선 위치를 알 수 있거나 추정할 수 있는 환경에서 생성 프로세스의 효율성을 최적화하는 것을 목표로 합니다. 이러한 환경에서 우리는 인간 시각의 중심 이심률에 따른 해상도 변화(eccentricity-dependent acuity)를 활용합니다. 즉, 사용자는 시선이 머무는 좁은 영역(foveal region, 중심와 영역)에서는 매우 높은 해상도의 시각 정보를 인지하지만, 시야의 주변부(periphery)로 갈수록 세부 사항을 식별하는 능력은 급격히 저하됩니다.우리의 접근 방식은 중심와 중심의 해상도를 모델링하는 mask를 사용하여 token을 비균일하게 할당하는 것으로 시작하며, 중심와 영역에는 높은 token density를, 주변부 영역에는 낮은 density를 할당합니다. 이미지 또는 비디오는 이러한 혼합 해상도(mixed-resolution) token 설정에서 생성되며, 이를 통해 token 수와 생성 시간을 획기적으로 줄이면서도 전체 해상도(full-resolution) 생성 결과와 시각적으로 구분할 수 없는 결과를 도출합니다.이를 위해 우리는 고해상도 데이터로부터 혼합 해상도 token을 직접 구축하기 위한 원칙적인 메커니즘을 개발하였습니다. 이를 통해 foveated diffusion model이 기존의 base model로부터 해상도 간 콘텐츠 일관성을 유지하면서도 post-training될 수 있도록 합니다. 우리는 광범위한 분석과 정밀하게 설계된 사용자 연구(user study)를 통해 본 접근 방식을 검증하였으며, 효율적인 생성을 위한 실용적이고 확장 가능한 축으로서 foveation의 효용성을 입증하였습니다.

One-sentence Summary

By leveraging human visual acuity through non-uniform token allocation, the proposed Foveated Diffusion method optimizes image and video generation efficiency by assigning higher token density to foveal regions and lower density to the periphery, achieving perceptually indistinguishable results while significantly reducing computational complexity and generation time.

Key Contributions

The paper introduces Foveated Diffusion, a perceptually motivated framework that optimizes visual generation efficiency by allocating tokens non-uniformly based on a foveation mask. This method concentrates computational resources in high-acuity foveal regions while sparsifying peripheral regions to mimic human visual perception.
A principled mechanism is developed to construct mixed-resolution tokens directly from high-resolution data, enabling a foveated diffusion model to be post-trained from an existing base model. This approach maintains content consistency across varying resolutions without the need for complex multi-stage pipelines or brittle re-noising strategies.
Extensive analysis and user studies demonstrate that the framework achieves significant computational savings and reduced generation time while producing results that are perceptually indistinguishable from full-resolution generation. The method also supports saliency-guided image and video generation, where salient objects align with the high-resolution foveal area.

Introduction

As demands for higher resolution and longer context lengths in image and video generation grow, the quadratic computational complexity of Diffusion Transformer (DiT) attention mechanisms becomes a major bottleneck. Prior efficiency methods often treat all spatial regions uniformly or rely on brittle multi-stage pipelines that can cause structural inconsistencies and artifacts. The authors leverage the eccentricity-dependent acuity of human vision to introduce Foveated Diffusion, a framework that allocates higher token density to the foveal region and lower density to the periphery. By developing a principled mixed-resolution tokenization scheme and a post-training strategy, they enable existing models to generate content that is perceptually indistinguishable from full-resolution outputs while achieving significant speedups in both image and video synthesis.

Method

The authors introduce Foveated Diffusion, a framework designed to enable generative models to produce spatially foveated images and videos while significantly reducing token complexity. This approach leverages the principle of foveated rendering, where computational resources are concentrated in high-resolution (HR) foveal regions while peripheral regions are represented at a lower resolution.

The core of the method lies in Foveated Tokenization. Instead of processing a uniform grid of tokens as standard Diffusion Transformers (DiTs) do, the authors utilize a foveation mask $M$ to define a variable-length token sequence. In this setup, high-resolution tokens are retained in the foveal regions, whereas peripheral regions are represented by a reduced set of tokens. Specifically, a single low-resolution token represents a $2 \times 2$ block of high-resolution tokens, resulting in a total sequence length $L = m + (h \cdot w - m)/4$ , where $m$ is the number of effective tokens in the mask.

To facilitate this mixed-resolution structure, the authors implement Mixed-Resolution Rotary Positional Embedding (RoPE). Standard RoPE assumes a uniform grid, which is incompatible with the varying token densities in foveated layouts. The authors adapt the indexing by aligning the key RoPE phases with query RoPE phases based on their respective resolutions. As shown in the figure below, when performing high-resolution query attention, low-resolution key tokens are sub-sampled, and for low-resolution queries, high-resolution key tokens are sub-sampled and their indices are normalized to the low-resolution grid.

The generation process, referred to as Foveated Generation, begins by sampling Gaussian noise $z_{1}^{\text{fov}}$ in the reduced foveated token space. This sequence is iteratively denoised to produce a clean foveated token sequence $z_{0}^{\text{fov}}$ . To reconstruct the final image, the sequence is split into high- and low-resolution components, which are then decoded separately by a VAE decoder. The resulting images are blended using an upsampled foveation mask $M'$ . The complete pipeline for this generation process is illustrated in the framework diagram.

To ensure the model can handle the structural inconsistencies that arise from mixed-resolution denoising, the authors propose a Foveated Training procedure. This post-training method uses Low-Rank Adaptation (LoRA) to adapt pretrained DiTs. During training, a high-resolution token sequence $z_{0}^{\text{high}}$ is obtained from the original image, and a low-resolution sequence $z_{0}^{\text{low}}$ is obtained by encoding a downsampled version of the image. These are merged into a single, coherent foveated target sequence $z_{0}^{\text{fov}}$ based on the mask. The model is then optimized using the standard flow-matching objective, ensuring that the learned velocity field is consistent across both resolution scales.

Experiment

The researchers evaluated the Foveated Diffusion framework through quantitative metrics and a perceptual user study to compare its performance against full-resolution and naïve mixed-resolution baselines in image and video generation. The results demonstrate that the proposed method achieves significant computational speedups while maintaining visual quality and structural consistency comparable to full-resolution generation. Furthermore, the framework shows strong generalization capabilities, successfully adapting to various foveation patterns and enabling saliency-guided or bounding-box-guided generation for applications like immersive gaming and robotics simulation.

The authors compare their foveated video generation method against full high-resolution generation and a naïve mixed-resolution baseline. Results show that the proposed method achieves performance comparable to or better than the full high-resolution baseline across most evaluated metrics. The proposed method outperforms the naïve mixed-resolution baseline in subject consistency, background consistency, motion smoothness, aesthetic quality, and image quality. The framework maintains high levels of consistency and quality that are nearly indistinguishable from full high-resolution generation. The naïve mixed-resolution baseline exhibits higher dynamic degree but lower performance in almost all other key generative metrics.

The authors conducted a perceptual user study using a Two-Alternative Forced Choice paradigm to compare their method against full high-resolution generation and a naïve mixed-resolution baseline. Results indicate that the proposed method is perceptually indistinguishable from full high-resolution generation while being significantly preferred over the naïve baseline. The proposed method achieves perceptual parity with full high-resolution generation, as evidenced by the lack of statistical significance in preference between them. The method is strongly preferred over the naïve mixed-resolution baseline due to the latter's visual artifacts. Statistical tests confirm that the preference for the proposed method over the naïve baseline is highly significant.

The authors compare their Foveated Diffusion method against a naïve mixed-resolution baseline and full high-resolution generation across various token ratios. Results show that the proposed method maintains high perceptual quality and prompt alignment while significantly reducing runtime compared to full-resolution generation. The proposed method achieves performance comparable to full high-resolution generation in terms of human preference and precision. Foveated Diffusion provides substantial speedups over full-resolution generation as the token ratio decreases. The method consistently outperforms the naïve mixed-resolution baseline in human preference and prompt alignment metrics.

The authors evaluate their foveated video generation method against full high-resolution generation and a naïve mixed-resolution baseline through quantitative metrics, perceptual user studies, and varying token ratio analyses. The results demonstrate that the proposed method achieves visual quality and consistency nearly indistinguishable from full high-resolution generation while significantly outperforming the naïve baseline. Furthermore, the framework provides substantial computational speedups and maintains high prompt alignment across different token ratios.

소스 PDF

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

2달 전

Brian Chao Lior Yariv Howard Xiao Gordon Wetzstein

초록

One-sentence Summary

Key Contributions

The paper introduces Foveated Diffusion, a perceptually motivated framework that optimizes visual generation efficiency by allocating tokens non-uniformly based on a foveation mask. This method concentrates computational resources in high-acuity foveal regions while sparsifying peripheral regions to mimic human visual perception.
A principled mechanism is developed to construct mixed-resolution tokens directly from high-resolution data, enabling a foveated diffusion model to be post-trained from an existing base model. This approach maintains content consistency across varying resolutions without the need for complex multi-stage pipelines or brittle re-noising strategies.
Extensive analysis and user studies demonstrate that the framework achieves significant computational savings and reduced generation time while producing results that are perceptually indistinguishable from full-resolution generation. The method also supports saliency-guided image and video generation, where salient objects align with the high-resolution foveal area.

Introduction

Method

Experiment

소스 PDF

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

Foveated Diffusion: 효율적인 공간 적응형 이미지 및 비디오 생성

Brian Chao Lior Yariv Howard Xiao Gordon Wetzstein

초록

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

Foveated Diffusion: 효율적인 공간 적응형 이미지 및 비디오 생성

Brian Chao Lior Yariv Howard Xiao Gordon Wetzstein

초록

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

Foveated Diffusion: 효율적인 공간 적응형 이미지 및 비디오 생성

Brian Chao Lior Yariv Howard Xiao Gordon Wetzstein

초록

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AI로 AI 구축

HyperAI Newsletters