HyperAIHyperAI

Command Palette

Search for a command to run...

il y a 3 ans

Encodages positionnels rotatifs comme modulation de phase : Bornes théoriques sur la base RoPE pour les transformateurs à contexte long

Feilong Liu

Embeddings de position rotatifs (RoPE)

20 heures de calcul sur RTX 5090 pour seulement $1 (valeur $7)
Aller à Notebook

Résumé

Les embeddings de position rotatifs (RoPE) sont largement utilisés dans les grands modèles de langage pour encoder les positions des tokens au moyen de rotations multiplicatives, mais leur comportement aux longueurs de contexte élevées reste mal caractérisé. Dans ce travail, nous réinterprétons RoPE comme une modulation de phase appliquée à un ensemble d’oscillateurs complexes, permettant une analyse via la théorie classique du traitement du signal. Dans cette formulation, nous établissons des bornes inférieures rigoureuses sur le paramètre de base de RoPE, nécessaires pour préserver la cohérence positionnelle sur une longueur de contexte cible. Ces bornes incluent une limite fondamentale de repliement spectral (aliasing), analogue à la limite de Nyquist, ainsi qu’une borne de stabilité de la composante continue, qui contraint la dérive de phase dans les modes positionnels de basse fréquence. Nous étendons ensuite cette analyse aux transformateurs profonds, montrant que la modulation rotative répétée à travers les couches amplifie les désalignements angulaires, resserrant la contrainte sur le paramètre de base à mesure que la profondeur augmente. En complément de ces résultats, nous établissons une borne supérieure dépendante de la précision sur le paramètre de base de RoPE, découlant de la résolution finie des nombres en virgule flottante. Au-delà de cette limite, les mises à jour incrémentales de phase deviennent numériquement indiscernables, entraînant une effacement positionnel même en l’absence de repliement spectral. Ensemble, les bornes inférieure et supérieure définissent une région de faisabilité dépendante de la précision et de la profondeur — une « zone de Goldilocks » — pour les transformateurs à contexte long. Nous validons ce cadre par une étude de cas approfondie de modèles de pointe, incluant les variantes LLaMA, Mistral et DeepSeek, montrant que les succès, les échecs et les correctifs communautaires observés correspondent étroitement aux bornes prédites. Il est notable que les modèles qui violent la borne de stabilité présentent un effondrement de l’attention et une dégradation à longue portée, tandis que les tentatives de mise à l’échelle au-delà d’un million de tokens se heurtent à un mur de précision absolu, indépendant de l’architecture ou de l’entraînement. Notre analyse établit la sélection du paramètre de base de RoPE comme une contrainte architecturale fondamentale et nécessaire, plutôt que comme un hyperparamètre ajustable, et fournit des orientations pratiques pour la conception, la mise à l’échelle et le retrofitting des transformateurs à contexte long dans des limites numériques réalistes.

One-sentence Summary

By reinterpreting Rotary Positional Embeddings as phase modulation on complex oscillators through classical signal processing theory, this work derives precision- and depth-dependent lower and upper bounds on the RoPE base parameter that define a feasibility region for long-context transformers, with case studies of LLaMA, Mistral, and DeepSeek variants confirming that base selection operates as a fundamental architectural constraint rather than a tunable hyperparameter.

Key Contributions

  • Reinterprets rotary positional embeddings as phase modulation on complex oscillators, enabling a signal processing analysis that derives principled lower bounds on the base parameter to maintain positional coherence across extended contexts and deep transformer layers.
  • Establishes a precision-dependent upper bound on the base parameter imposed by finite floating-point resolution, defining a depth- and context-aware feasibility region that prevents positional signal erasure during long-sequence processing.
  • Validates the theoretical bounds through case studies of LLaMA, Mistral, and DeepSeek variants, demonstrating that the derived limits accurately predict attention collapse, long-range degradation, and numerical barriers encountered when scaling beyond one million tokens.

Introduction

As large language models scale to hundreds of thousands of tokens, rotary positional embeddings have become the standard for preserving long-range dependencies, yet their reliability at extreme context lengths remains unpredictable. Prior research relies on empirical scaling rules and geometric interpretations that overlook how rotational phase errors compound across transformer layers and degrade under finite floating-point precision. The authors leverage a signal-processing framework to reframe rotary positional embeddings as phase modulation across complex oscillators, deriving explicit stability and precision bounds for the base frequency parameter. This analysis establishes a strict feasibility region that explains long-context failure modes and provides principled, architecture-aware guidance for model design without introducing new heuristic modifications.

Dataset

  • Dataset composition and sources: The authors have only provided the paper title and contact information. The input text contains no dataset composition or source details.
  • Key details for each subset: No subset sizes, origins, or filtering rules are included in the provided content.
  • Data usage and processing: The authors do not specify training splits, mixture ratios, or any processing pipeline in the given text.
  • Cropping and metadata: No cropping strategies, metadata construction, or additional processing steps are described.

The provided material lacks the necessary methodology or data sections. Please share the relevant paragraphs so I can draft the complete dataset description.

Method

The authors leverage a signal-processing interpretation of Rotary Positional Embeddings (RoPE), reformulating the mechanism as phase modulation applied to a bank of complex oscillators. This perspective enables a rigorous analysis of positional encoding stability in long-context transformers. The framework begins with the standard RoPE construction, where query and key representations are transformed by applying position-dependent rotations to paired feature dimensions. This rotation is equivalently expressed using complex-valued features, where each 2D vector pair is identified with a complex number zi=x2i1+jx2iz_i = x_{2i-1} + j x_{2i}zi=x2i1+jx2i, and the RoPE transformation becomes a simple multiplication: zi(p)=ziejpθiz_i'(p) = z_i \cdot e^{jp\theta_i}zi(p)=ziejpθi. This formulation reveals that RoPE performs phase modulation with angular frequency θi=base2(i1)/d\theta_i = \text{base}^{-2(i-1)/d}θi=base2(i1)/d, where the base parameter controls the geometric spacing of the oscillator frequencies.

As shown in the figure below, this oscillator-bank view connects RoPE to classical signal processing concepts. The figure illustrates the attention score (cosine similarity) between a token at position 0 and tokens at various positions in a sequence, using a base of 10,000. The plot shows a smooth cosine decay, but a catastrophic failure occurs at the Nyquist limit, marked by a red dashed line at 2πbase62,8322\pi \cdot \text{base} \approx 62,8322πbase62,832. Beyond this point, the model fundamentally cannot distinguish position 0 from position 62,832, as the phase of the fundamental oscillator completes a full cycle, leading to a "collision horizon" and the collapse of the global positional grid. This visualizes the fundamental aliasing limit derived in the analysis.

The framework further analyzes the stability of the lowest-frequency (quasi-DC) component, which is critical for preserving long-range alignment. The authors derive a stability bound, showing that to maintain a minimum cosine similarity ϵ\epsilonϵ between adjacent rotations of the lowest-frequency mode over a context length LLL, the RoPE base must satisfy baseL/arccos(ϵ)\text{base} \geq L / \arccos(\epsilon)baseL/arccos(ϵ). This condition ensures that the global positional reference frame does not drift excessively. The analysis is extended to deep transformers, where repeated application of RoPE across layers compounds small angular misalignments. This layer compounding effect tightens the stability requirement, leading to a depth-dependent bound: baseL/arccos(ϵ1/N)\text{base} \geq L / \arccos(\epsilon^{1/N})baseL/arccos(ϵ1/N), where NNN is the number of layers. This explains why deeper models require larger bases to maintain long-range coherence.

Finally, the framework incorporates numerical precision constraints. In finite-precision arithmetic, the phase increment Δθ=1/base\Delta\theta = 1/\text{base}Δθ=1/base must exceed the machine epsilon ϵmach\epsilon_{\text{mach}}ϵmach to be distinguishable. This establishes an upper bound on the RoPE base: base<1/ϵmach\text{base} < 1 / \epsilon_{\text{mach}}base<1/ϵmach. This "Precision Wall" limits the maximum achievable context length, as increasing the base beyond this threshold causes incremental phase updates to become numerically indistinguishable, erasing positional information even in the absence of aliasing. The combination of the depth- and length-dependent lower bound and the hardware-dependent upper bound defines a precision- and depth-dependent feasibility region, or "Goldilocks zone," for RoPE base selection in long-context transformers.

Experiment

The evaluation compares the deployed RoPE configurations of widely used state-of-the-art transformer models against established empirical heuristics and newly derived theoretical stability bounds. By classifying each architecture based on whether its RoPE base falls within a theoretically defined safe operating range, the analysis reveals that several long-context models systematically degrade due to configurations that violate these fundamental constraints. Ultimately, the study demonstrates that the proposed theoretical bounds successfully diagnose real-world architectural limitations and explain performance failures that existing empirical rules cannot account for.

The authors analyze several state-of-the-art transformer models to evaluate their RoPE base configurations against theoretical stability bounds derived from coherence and numerical precision constraints. Results show that some models operate outside the safe range, leading to instability, while others remain within the bounds and exhibit stable behavior. Some models are classified as unstable due to their RoPE base falling outside the theoretical stability bounds. Certain models are marked as very stable, indicating their RoPE base lies within the derived theoretical range. A hypothetical target model is deemed infeasible, suggesting its RoPE base exceeds practical limits for stability.

The experimental setup evaluates state-of-the-art transformer models by comparing their RoPE base configurations against theoretical stability bounds derived from coherence and numerical precision constraints. This analysis validates whether specific embedding settings maintain numerical stability or risk operational failure. Results qualitatively categorize the models as stable, unstable, or infeasible based on their alignment with these limits, demonstrating that proper RoPE base selection is essential for reliable deployment since configurations exceeding practical thresholds inevitably compromise stability.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp