Command Palette
Search for a command to run...

Abstract
Large language models (LLMs) increasingly rely on long-context modeling fortasks such as document understanding, code analysis, and multi-step reasoning.However, scaling context windows to the million-token level brings prohibitivecomputational and memory costs, limiting the practicality of long-context LLMs.In this work, we take a different perspective-visual context scaling-to tacklethis challenge. Instead of extending token-based sequences, we propose Glyph, aframework that renders long texts into images and processes them withvision-language models (VLMs). This approach substantially compresses textualinput while preserving semantic information, and we further design anLLM-driven genetic search to identify optimal visual rendering configurationsfor balancing accuracy and compression. Through extensive experiments, wedemonstrate that our method achieves 3-4x token compression while maintainingaccuracy comparable to leading LLMs such as Qwen3-8B on various long-contextbenchmarks. This compression also leads to around 4x faster prefilling anddecoding, and approximately 2x faster SFT training. Furthermore, under extremecompression, a 128K-context VLM could scale to handle 1M-token-level texttasks. In addition, the rendered text data benefits real-world multimodaltasks, such as document understanding. Our code and model are released athttps://github.com/thu-coai/Glyph.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.