12 hours ago

Sangyun Lee Sean McLeish Tom Goldstein Giulia Fanti

Table of Contents

Abstract

Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache. During the sleep, the model performs N offline recurrent passes over the accumulated context and updates the fast weights in its state-space model (SSM) blocks through a learned local rule. During inference, this shifts extra computation to the sleep while preserving the latency of wake-time prediction. We test our method on controlled synthetic tasks, including cellular automata and multi-hop graph retrieval, as well as a realistic math reasoning task, on which a regular transformer as well as SSM-attention hybrid models fail. We then show that increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning.

One-sentence Summary

To address poor attention scaling in transformer-based large language models, the authors propose a sleep-like consolidation mechanism that converts recent context into persistent fast weights within state-space model blocks through N offline recurrent passes, shifting computation to sleep periods to preserve the latency of wake-time prediction while achieving improved performance on cellular automata, multi-hop graph retrieval, and a realistic math reasoning task where regular transformers and SSM-attention hybrids fail.

Key Contributions

A sleep-like consolidation mechanism is introduced where a model periodically converts recent context into persistent fast weights before clearing its key-value cache. Offline recurrent passes update fast weights within state-space model blocks through a learned local rule, shifting computation to sleep periods without increasing inference latency.
The approach is evaluated on controlled synthetic tasks including cellular automata and multi-hop graph retrieval, as well as a realistic math reasoning task. Regular transformers and SSM-attention hybrid models fail on these tasks, while the method demonstrates performance improvements.
Increasing sleep duration N improves performance, with the largest gains observed on examples that require deeper reasoning. This indicates that additional sleep-time computation is most beneficial when reasoning depth increases.

Introduction

Large Language Models typically rely on attention mechanisms that scale poorly with context length, leading to the adoption of hybrid architectures combining attention with fixed size fast weight memories. Yet these prior models struggle with deep reasoning tasks even when memory capacity is sufficient because they lack the computation needed to transform evicted context into useful internal states. The authors leverage biological sleep as inspiration to introduce a consolidation phase where the model performs recurrent forward passes on accumulated context without external input. This process updates fast weights to preserve information for later inference, significantly improving reasoning performance on tasks requiring deep computation over evicted tokens.

Method

The proposed architecture addresses the memory scaling issues of standard transformers by interleaving attention layers with State Space Model (SSM) blocks. In this hybrid design, attention layers maintain a Key-Value (KV) cache that grows linearly with the sequence length, while SSM layers store information in a fixed-size fast-weight state. The model is constructed by stacking these blocks, where an attention block is denoted as $\mathcal{B}_{\ell}^{\text{attn}}$ and an SSM block as $\mathcal{B}_{\ell}^{\text{ssm}}$ . The SSM blocks utilize a gated Hebbian-like update rule to compress past information into their internal state $\mathbf{S}_t$ :

$\mathbf{S}_t = \alpha_t \mathbf{S}_{t-1} + \beta_t \mathbf{v}_t \mathbf{k}_t^\top$

Here, $\alpha_t$ and $\beta_t$ serve as data-dependent forget and input gates, enabling the model to retain relevant history without expanding memory requirements.

To manage contexts that exceed the attention window, the system implements a consolidation mechanism known as "LLM Sleep." This process involves performing multiple offline recurrent passes over the context before discarding the attention cache.

As illustrated in the framework diagram, the model processes input tokens until it reaches the eviction boundary. At this point, the system executes $N$ recurrent passes over the current context, indicated by the green dashed loop labeled $\times N$ . During these passes, the fast weights in the SSM blocks are iteratively refined to encode the accumulated information. Simultaneously, the KV cache in the attention blocks, represented by purple squares, is cleared. The refined fast weights, shown as green network icons, persist across the boundary to support subsequent predictions. This approach allows the model to perform deep reasoning on evicted context during the sleep phase while maintaining constant latency during the inference phase. Training is conducted by backpropagating through the entire computational graph, including the recurrent consolidation steps.

Experiment

This study evaluates attention-SSM hybrid models under hard context eviction constraints using synthetic reasoning tasks like Rule 110 and Depo, alongside the GSM-Infinite math benchmark. By varying the number of offline sleep loops during memory consolidation, the results demonstrate that additional recurrence significantly improves performance on deep sequential computation and multi-hop retrieval where standard single-pass models fail. These findings confirm that extending sleep-time computation allows models to encode evicted context into fast weights more effectively, a trend that persists across both controlled synthetic environments and realistic pretrained LLMs.

Source PDF

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

12 hours ago

Sangyun Lee Sean McLeish Tom Goldstein Giulia Fanti

Table of Contents

Abstract

One-sentence Summary

Key Contributions

A sleep-like consolidation mechanism is introduced where a model periodically converts recent context into persistent fast weights before clearing its key-value cache. Offline recurrent passes update fast weights within state-space model blocks through a learned local rule, shifting computation to sleep periods without increasing inference latency.
The approach is evaluated on controlled synthetic tasks including cellular automata and multi-hop graph retrieval, as well as a realistic math reasoning task. Regular transformers and SSM-attention hybrid models fail on these tasks, while the method demonstrates performance improvements.
Increasing sleep duration N improves performance, with the largest gains observed on examples that require deeper reasoning. This indicates that additional sleep-time computation is most beneficial when reasoning depth increases.

Introduction

Method

$\mathbf{S}_t = \alpha_t \mathbf{S}_{t-1} + \beta_t \mathbf{v}_t \mathbf{k}_t^\top$

Here, $\alpha_t$ and $\beta_t$ serve as data-dependent forget and input gates, enabling the model to retain relevant history without expanding memory requirements.

Experiment

Source PDF

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Language Models Need Sleep

Sangyun Lee Sean McLeish Tom Goldstein Giulia Fanti

Abstract

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

Language Models Need Sleep

Sangyun Lee Sean McLeish Tom Goldstein Giulia Fanti

Abstract

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

Language Models Need Sleep

Sangyun Lee Sean McLeish Tom Goldstein Giulia Fanti

Abstract

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Build AI with AI

HyperAI Newsletters