HyperAIHyperAI

Command Palette

Search for a command to run...

Language Models Are Injective and Invertible: Proving Exact Input Recovery from Hidden Activations

Transformer-based language models are widely used for tasks ranging from text generation to semantic understanding, yet their internal representational properties remain poorly understood. A common assumption is that components like non-linear activations and normalization layers are inherently non-injective—meaning multiple distinct inputs could produce identical outputs—making exact input recovery from hidden representations impossible. This paper challenges that assumption, presenting a rigorous mathematical and empirical case for the injectivity of language models. The authors prove that, under standard architectural assumptions, transformer language models that map discrete input sequences to continuous sequence representations are injective at initialization and maintain this property throughout training. This injectivity implies that each input sequence corresponds to a unique representation, enabling lossless encoding and, crucially, exact invertibility. The proof relies on the structure of token embeddings, positional encodings, and the sequential nature of attention and feed-forward layers, showing that no two distinct input sequences can produce identical hidden states across all layers. To validate this theoretical claim, the researchers conducted billions of collision tests across six state-of-the-art language models, including models from the Llama, Mistral, and GPT families. Despite extensive testing, no collisions were observed—supporting the theoretical assertion that these models are effectively injective in practice. Building on this insight, the authors introduce SipIt (Sequence Inversion via Provable Injectivity and Tracking), the first algorithm that provably and efficiently reconstructs the original input text from hidden activations. SipIt operates in linear time relative to the input length and guarantees exact recovery under the injectivity assumption. The algorithm leverages the structure of the model’s internal representations to reverse-engineer the input sequence step by step, demonstrating practical invertibility even in large-scale models. These findings have significant implications for the transparency, interpretability, and safety of language models. The ability to invert model representations opens new pathways for debugging, auditing, and understanding model behavior. It also enables new forms of model forensics, such as detecting data leakage or identifying sensitive inputs, and may help in building more accountable and controllable AI systems. In sum, this work establishes injectivity as a fundamental and exploitable property of modern language models, challenging long-held beliefs about the irreversibility of their internal representations. By proving and operationalizing invertibility, the research paves the way for a new class of tools and methods that can directly access and interpret the input from a model’s internal state.

Related Links