MaskBit: Embedding-free Image Generation via Bit Tokens

Masked transformer models for class-conditional image generation have becomea compelling alternative to diffusion models. Typically comprising two stages -an initial VQGAN model for transitioning between latent space and image space,and a subsequent Transformer model for image generation within latent space -these frameworks offer promising avenues for image synthesis. In this study, wepresent two primary contributions: Firstly, an empirical and systematicexamination of VQGANs, leading to a modernized VQGAN. Secondly, a novelembedding-free generation network operating directly on bit tokens - a binaryquantized representation of tokens with rich semantics. The first contributionfurnishes a transparent, reproducible, and high-performing VQGAN model,enhancing accessibility and matching the performance of currentstate-of-the-art methods while revealing previously undisclosed details. Thesecond contribution demonstrates that embedding-free image generation using bittokens achieves a new state-of-the-art FID of 1.52 on the ImageNet 256x256benchmark, with a compact generator model of mere 305M parameters.