TTT-E2E: End-to-End Test-Time Training Enables LLMs to Learn from Context Like Humans, Scaling Efficiently with Long Sequences
Large language models (LLMs) continue to expand their context windows, enabling them to process vast amounts of information at once. Yet despite this progress, they often fail to learn from past interactions, repeating mistakes and requiring users to restate context. A human colleague would adapt over time, drawing on experience to improve performance. Why can’t LLMs do the same? The core issue lies in how LLM memory differs from human memory. Humans integrate experiences into lasting intuition, even when specific details fade. In contrast, standard transformers rely on full attention mechanisms that store every token in memory, leading to high computational costs that grow linearly with context length. Processing a token at the end of a 128K sequence takes significantly longer than processing one near the beginning. To address this, researchers have introduced approximations like sliding-window attention, Mamba, and Gated DeltaNet, which maintain constant latency but sacrifice accuracy as context grows. These methods lose critical predictive signals, resulting in degraded performance on long contexts. This blog post presents TTT-E2E, an end-to-end test-time training approach that reimagines LLM memory by compressing context into model weights during inference. The method leverages next-token prediction to update the model’s parameters on-the-fly using the current context. By doing so, it captures essential patterns and relationships without storing every detail. The key innovation is meta-learning at initialization. Instead of standard pre-training, TTT-E2E is trained to be ready for test-time adaptation. This enables two layers of optimization: an inner loop that fine-tunes the model on the current context via next-token prediction, and an outer loop that ensures the final output remains accurate after adaptation. Results show TTT-E2E scales effectively in both loss and latency. Unlike full attention, which becomes prohibitively slow with long context, TTT-E2E maintains constant inference cost per token—making it 2.7x faster than full attention at 128K context and 35x faster at 2M tokens on an NVIDIA H100. Meanwhile, it matches or exceeds full attention in loss performance, even at extreme lengths. This dual scalability marks a breakthrough. While other methods degrade with longer context, TTT-E2E shows no signs of hitting a performance wall across extensive experiments. The findings suggest that a viable solution to long-context challenges may be within reach by 2026. The role of retrieval-augmented generation (RAG) remains important but complementary. RAG is like using a notepad—helpful for precise recall, such as remembering a grocery list. But long-term intelligence comes from internalized understanding, not external storage. TTT-E2E enhances this internal memory, allowing models to learn from context at test time and improve over interactions. One limitation is the computational cost of meta-learning, which requires gradients of gradients. Current implementations are 3.4x slower than standard pre-training due to limitations in FlashAttention’s API. Solutions include developing custom kernels or initializing TTT-E2E from standard pre-trained models. For those interested in deeper technical details, the full paper and code are publicly available. The research opens a new path toward truly adaptive, intelligent models that learn not just from training data, but from every interaction they experience.
