Command Palette
Search for a command to run...
TC-AE: Deep Compression Autoencoder의 Token Capacity를 극대화하는 방법
TC-AE: Deep Compression Autoencoder의 Token Capacity를 극대화하는 방법
Teng Li Ziyuan Huang Cong Chen Yangfu Li Yuanhuiyi Lyu Dandan Zheng Chunhua Shen Jun Zhang
초록
제시하신 영문 텍스트를 전문적인 기술 및 학술 스타일의 한국어로 번역한 결과입니다. (요청하신 대로 한국어로 답변드립니다.)[번역문]본 논문에서는 딥 컴프레션(deep compression) 오토인코더를 위한 ViT 기반 아키텍처인 TC-AE를 제안한다. 기존 방식들은 높은 압축률에서도 재구성 품질을 유지하기 위해 잠재 표현(latent representation)의 채널 수를 늘리는 방식을 흔히 사용한다. 그러나 이러한 전략은 종종 잠재 표현의 붕괴(latent representation collapse)를 초래하여 생성 성능을 저하시키는 결과를 낳는다. TC-AE는 점점 더 복잡해지는 아키텍처나 다단계 학습(multi-stage training) 방식에 의존하는 대신, 픽셀과 이미지 latent 사이의 핵심 가교 역할을 하는 token space의 관점에서 이 문제를 해결하고자 하며, 이를 위해 두 가지 상호 보완적인 혁신을 도입한다.첫째, 고정된 latent budget 하에서 ViT의 patch size를 조정함으로써 token 수 스케일링(token number scaling)을 연구하였으며, 공격적인 token-to-latent 압축이 효과적인 스케일링을 제한하는 핵심 요인임을 확인하였다. 이 문제를 해결하기 위해 우리는 token-to-latent 압축을 두 단계로 분해하여, 구조적 정보의 손실을 줄이고 생성을 위한 효과적인 token 수 스케일링을 가능하게 했다. 둘째, 잠재 표현의 붕괴를 더욱 완화하기 위해 공동 자기지도 학습(joint self-supervised training)을 통해 이미지 token의 시맨틱 구조를 강화함으로써, 생성에 더욱 유리한(generative-friendly) latents를 도출한다. 이러한 설계를 통해 TC-AE는 딥 컴프레션 환경에서도 재구성 및 생성 성능을 실질적으로 향상시킨다. 본 연구가 시각적 생성을 위한 ViT 기반 tokenizer 발전에 기여하기를 기대한다.
One-sentence Summary
TC-AE is a Vision Transformer-based architecture for deep compression autoencoders that addresses latent representation collapse by decomposing token-to-latent compression into two stages and employing joint self-supervised training to enhance semantic structure, thereby enabling effective token scaling and achieving superior reconstruction and generative performance.
Key Contributions
- The paper introduces TC-AE, a Vision Transformer-based architecture designed for deep compression autoencoders that optimizes the token space to prevent latent representation collapse.
- This work proposes a staged token compression strategy that redistributes the compression process across encoder stages to mitigate information loss and enable effective scaling of token numbers.
- The method incorporates a self-supervised joint training mechanism to enhance the semantic structure of image tokens, which results in improved reconstruction and generative performance on ImageNet.
Introduction
Latent diffusion models rely on tokenizers to compress images into efficient latent representations for generative modeling. While recent research pushes for deeper compression by reducing spatial resolution, existing methods often compensate by increasing channel numbers, which frequently leads to latent representation collapse and degraded generative performance. The authors leverage the token space as a critical bridge between pixels and latents to address these limitations. They introduce TC-AE, a ViT-based architecture that utilizes staged token compression to prevent structural information loss and incorporates a joint self-supervised training objective to enhance semantic structure. This approach enables effective token number scaling, significantly improving both reconstruction and generative quality under high compression ratios.
Method
The authors leverage a Vision Transformer (ViT)-based framework for image autoencoding, where an encoder E compresses an input image X∈RH×W×3 into a latent representation z∈Rh×w×c, which is subsequently reconstructed by a decoder D. The process begins with a patch embedding layer ϕp(⋅) that partitions the image into non-overlapping p×p patches, projects each patch into a d-dimensional vector, and flattens the grid into a sequence of N=HW/p2 tokens, T∈RN×d. These tokens are processed by a stack of Transformer layers TF(⋅) and then compressed by a bottleneck layer B(⋅) to produce the final latent representation. The spatial compression ratio from pixels to latent space decomposes into two stages: pixel-to-token compression fpix→tok=p2 and token-to-latent compression ftok→lat=N/(h⋅w), with the image tokens serving as an information bridge between the input and latent domains.
To enhance the semantic structure of the token space and improve generative performance, the authors introduce a joint self-supervised learning (SSL) objective using iBOT. This framework employs a student–teacher distillation paradigm, where the teacher is an exponential moving average (EMA) of the student. The student processes the input image through two augmentation pipelines: Augstu(⋅) generates global crops with random patch masking and additional local crops, while Augtea(⋅) produces two global crops. For masked global crops, the student is trained to predict the teacher's patch-token outputs, forming a masked image modeling objective LMIM. For local crops, the student's class-token predictions are aligned with the teacher's to enforce semantic consistency, yielding the class-token distillation loss L[CLS]. The combined self-supervised objective is LiBOT=LMIM+L[CLS], which encourages both local and global semantic structure in the token representation.

As shown in the figure below, the proposed TC-AE architecture consists of a ViT encoder, a latent bottleneck, and a structurally symmetric decoder. The encoder design incorporates staged token compression to mitigate structural information loss at the bottleneck. It begins with a patch embedding layer using a small patch size p to generate high-resolution image tokens, reducing information loss at the initial pixel-to-token stage. These fine-grained tokens are processed by the first M ViT blocks to capture rich visual details and semantic structure. An intermediate bottleneck then compresses the token sequence to one-fourth its length, producing a compact and structured intermediate representation. This compressed sequence is further processed by the remaining N ViT blocks, after which a second bottleneck yields the final latent representation for downstream generative modeling.
The training scheme of TC-AE jointly optimizes the tokenizer with the self-supervised objective, enabling the ViT encoder to learn latent representations with stronger semantic regularization without requiring external large-scale pretraining. This lightweight training approach contrasts with methods like VTP, making TC-AE practical under limited computational resources. The overall training objective combines the standard reconstruction loss with the self-supervised objective: LTC-AE=αLrec+LiBOT. The reconstruction loss Lrec is defined as Lpix+λpLp+λqLq, where Lpix is a pixel-level ℓ1 loss, Lp is a perceptual loss for high-level semantic discrepancies, and Lq is an adversarial loss to enhance the realism of reconstructed images. The combination of staged token compression and joint self-supervised training accelerates diffusion model convergence.
Experiment
The experiments evaluate how scaling token numbers and compression strategies affect the reconstruction and generative performance of deep compression autoencoders. While increasing image tokens improves reconstruction quality, it fails to enhance generative performance due to severe semantic information loss at the compression bottleneck. To resolve this, the authors propose staged token compression and self-supervised learning, which effectively preserve semantic structure and enable generative quality to scale with token density. These methods work synergistically to improve training efficiency and achieve superior generative results compared to existing tokenizers at a lower computational cost.
The authors analyze the impact of increasing image token numbers on reconstruction and generation quality under a fixed latent budget. Results show that while reconstruction improves with more tokens, generative performance does not, due to severe semantic information loss during bottleneck compression. Introducing staged token compression mitigates this loss, enabling better generative quality and scaling with token number. Increasing token numbers improves reconstruction but not generation due to semantic loss at the bottleneck. Staged token compression reduces structural information loss and enables generative performance to scale with token count. The proposed method achieves strong generative quality with fewer tokens and lower computational cost compared to existing approaches.

The the the table outlines the training settings for joint training and decoder finetuning in TC-AE. It specifies differences in epochs, batch size, learning rates, optimizers, and other hyperparameters between the two phases. Joint training uses more epochs and a larger batch size compared to decoder finetuning. The base learning rate is higher for joint training than for decoder finetuning. Different optimizers are used for joint training and decoder finetuning, with AdamW used in both cases.

The authors compare the impact of latent channel dimensions on model performance, finding that increasing the channel size from 128 to 256 leads to mixed results. While some reconstruction metrics improve, generative performance degrades, suggesting that a higher channel dimension may exacerbate representation collapse. Increasing the latent channel dimension from 128 to 256 improves reconstruction quality but reduces generative performance. The trade-off between reconstruction fidelity and generatability is evident, with 128 channels providing better generative outcomes. Staged token compression and self-supervision improve performance across both channel sizes, but the benefits are more pronounced at 128 channels.

The authors compare TC-AE and DC-AE under identical settings, showing that TC-AE achieves better generative performance with significantly lower computational cost. Both models have the same latent shape, but TC-AE demonstrates superior results in both reconstruction and generation metrics. TC-AE achieves better generative performance than DC-AE with lower computational cost. Both models use the same latent shape, indicating a fair comparison. TC-AE shows improvements in both reconstruction and generation quality metrics compared to DC-AE.

The authors investigate the impact of increasing image token numbers on reconstruction and generative performance under a fixed latent budget. Results show that while reconstruction quality improves with more tokens, generative performance does not, due to severe semantic information loss during compression. Introducing staged token compression mitigates this loss and enables generative quality to scale with token count. Increasing token numbers improves reconstruction quality but not generative performance under a fixed latent budget. Staged token compression reduces semantic loss during compression, enabling generative performance to scale with token count. Self-supervision and staged token compression together enhance generative quality while maintaining reconstruction fidelity.

The authors evaluate the effects of token scaling, latent channel dimensions, and compression strategies on reconstruction and generative performance. While increasing token counts or channel dimensions can improve reconstruction, they often lead to semantic loss or representation collapse that degrades generative quality. By implementing staged token compression and self-supervision, the proposed TC-AE model effectively mitigates these issues, achieving superior generative performance and higher computational efficiency compared to DC-AE.