HyperAI
Back to Headlines

MIT Researchers Discover New Method to Edit and Generate Images Using Highly Compressed 1D Tokenizers

7 days ago

AI image generation, which uses neural networks to create new images from various inputs like text prompts, is expected to become a billion-dollar industry by the end of the decade. Current methods involve training models on massive datasets containing millions of images, a process that can take weeks or months and consume substantial computational resources. However, a breakthrough described in a research paper presented at the International Conference on Machine Learning (ICML 2025) in Vancouver this summer suggests a novel and more efficient approach. The paper, authored by Lukas Lao Beyer, a graduate student researcher at MIT's Laboratory for Information and Decision Systems (LIDS); Tianhong Li, a postdoc at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL); Xinlei Chen from Facebook AI Research; Sertac Karaman, an MIT professor of aeronautics and astronautics and director of LIDS; and Kaiming He, an MIT associate professor of electrical engineering and computer science, outlines a method for generating and manipulating images without traditional generators. The research originated from a class project in a graduate seminar on deep generative models taught by He last autumn. Lao Beyer and He recognized the potential of the findings, which went beyond typical academic assignments. The project was inspired by a June 2024 paper from researchers at the Technical University of Munich and ByteDance, introducing a one-dimensional tokenizer that translates a 256x256-pixel image into just 32 tokens, each a 12-digit number of 1s and 0s. These tokens collectively represent a highly compressed, abstract "language" that the computer understands. Beyer's initial goal was to explore how these tokens could be manipulated to identify their functions. By removing or replacing tokens, he discovered that specific tokens controlled various aspects of the image, such as resolution, blurriness, brightness, and pose. For example, changing one token could alter the direction of a bird's head in an image. This finding, unprecedented in the field, suggested that tokens could be used to edit images in a precise manner without the need for a separate generator. The MIT team further refined this approach by combining the 1D tokenizer with a detokenizer (decoder) and leveraging the off-the-shelf CLIP model, which measures how well an image matches a text prompt. They demonstrated that this setup could generate images, convert one type of image to another (like turning a red panda into a tiger), and perform "inpainting" to fill in missing parts of images. Notably, these tasks were accomplished without training a dedicated generator, significantly reducing computational costs. The implications of this work are far-reaching. According to Saining Xie, a computer scientist at New York University, the research "redefines the role of tokenizers" by showing that they can do much more than just compress images. Specifically, a 1D tokenizer can handle tasks like inpainting and text-guided editing, making the process much more efficient. Zhuang Liu of Princeton University adds that the study demonstrates "we can generate and manipulate images in a way that is much easier than previously thought," potentially reducing the costs and computational demands of image generation several times over. Beyond computer vision, the extreme compression capabilities of 1D tokenizers could have applications in other fields. For instance, Professor Karaman suggests tokenizing the actions of robots or self-driving cars, which might expand the impact of this work. Lao Beyer envisions using tokens to represent different routes for autonomous vehicles, enabling more efficient decision-making. Industry insiders laud the potential of this research. They see it as a game-changer for AI image generation and manipulation, offering more precise control and reduced resource requirements. The simplicity and efficiency of the 1D tokenizer-decoder combination could unlock new use cases and drive innovation in various domains, from robotics to self-driving technology. This method's ability to streamline and automate image editing and generation processes may significantly lower barriers to entry for developers and enhance the practicality of AI in these areas.

Related Links