Video World Models with Long-term Spatial Memory

Emerging world models autoregressively generate video frames in response toactions, such as camera movements and text prompts, among other controlsignals. Due to limited temporal context window sizes, these models oftenstruggle to maintain scene consistency during revisits, leading to severeforgetting of previously generated environments. Inspired by the mechanisms ofhuman memory, we introduce a novel framework to enhancing long-term consistencyof video world models through a geometry-grounded long-term spatial memory. Ourframework includes mechanisms to store and retrieve information from thelong-term spatial memory and we curate custom datasets to train and evaluateworld models with explicitly stored 3D memory mechanisms. Our evaluations showimproved quality, consistency, and context length compared to relevantbaselines, paving the way towards long-term consistent world generation.