Command Palette
Search for a command to run...
التضمينات الموضعية الدوارة كتعديل طور: حدود نظرية لأساس RoPE في محولات السياق الطويل
التضمينات الموضعية الدوارة كتعديل طور: حدود نظرية لأساس RoPE في محولات السياق الطويل
Feilong Liu
التضمينات الموضعية الدوارة (RoPE)
الملخص
التضمينات الموضعية الدوارة (RoPE) تُستخدم على نطاق واسع في نماذج اللغات الكبيرة لتشفير أوضاع الرموز (tokens) من خلال دورات ضربية، ومع ذلك لا يزال سلوكها عند أطوال سياقية طويلة غير مُوصف بدقة. في هذا العمل، نعيد تفسير RoPE على أنه تعديل طور يُطبق على بنك من المذبذبات المعقدة، مما يتيح التحليل من خلال نظرية معالجة الإشارات الكلاسيكية. تحت هذا الصياغة، نستنتج حدوداً سفلية مبدئية لمعامل الأساس في RoPE، وهي ضرورية للحفاظ على الاتساق الموضعي عبر طول سياق مستهدف. وتشمل هذه حداً أساسياً للترددات الزائفة (aliasing)، مماثل لحد ناكوست (Nyquist limit)، وحصراً لاستقرار مركبة التيار المستمر (DC-component) يحدد انحراف الطور في الأوضاع الموضعية منخفضة التردد. نوسع هذا التحليل أيضاً ليشمل المحولات العميقة (deep transformers)، موضحين أن التعديل الدوار المتكرر عبر الطبقات يضاعف عدم المحاذاة الزاوية، مما يشدد متطلبات الأساس مع زيادة العمق. وكملحق لهذه النتائج، نستنتج حداً علوياً يعتمد على الدقة لمعامل الأساس في RoPE ناشئاً عن الدقة المحدودة للأعداد العائمة (floating-point). يتجاوز هذا الحد، حيث تصبح التحديثات الطورية التزايدية غير قابلة للتمييز عددياً، مما يؤدي إلى محو الموضع حتى في غياب الترددات الزائفة. معاً، تحدد الحدود السفلية والعليا منطقة جدوى تعتمد على الدقة والعمق—ما يُعرف بـ "المنطقة الذهبية" (Goldilocks zone)—للمحولات ذات السياق الطويل. نصدق هذا الإطار من خلال دراسة حالة شاملة للنماذج المتطورة، بما في ذلك متغيرات LLaMA وMistral وDeepSeek، مُظهِرين أن النجاحات الملاحظة، والفشل، والتعديلات المجتمعية تتوافق بشكل وثيق مع الحدود المتوقعة. ومن الملاحظ أن النماذج التي تنتهك حد الاستقرار تُظهر انهياراً في الانتباه (attention collapse) وتدهوراً في المدى الطويل، بينما تواجه محاولات التوسع بما يتجاوز مليون رمز حاجز دقة صلباً، بغض النظر عن الهندسة المعمارية أو التدريب. يرسخ تحليلنا اختيار أساس RoPE كقيد معماري أساسي ضروري، وليس كمعلمة فرعية قابلة للضبط، ويوفر إرشادات عملية لتصميم، وتوسيع، وتعديل المحولات ذات السياق الطويل ضمن حدود عددية واقعية.
One-sentence Summary
By reinterpreting Rotary Positional Embeddings as phase modulation on complex oscillators through classical signal processing theory, this work derives precision- and depth-dependent lower and upper bounds on the RoPE base parameter that define a feasibility region for long-context transformers, with case studies of LLaMA, Mistral, and DeepSeek variants confirming that base selection operates as a fundamental architectural constraint rather than a tunable hyperparameter.
Key Contributions
- Reinterprets rotary positional embeddings as phase modulation on complex oscillators, enabling a signal processing analysis that derives principled lower bounds on the base parameter to maintain positional coherence across extended contexts and deep transformer layers.
- Establishes a precision-dependent upper bound on the base parameter imposed by finite floating-point resolution, defining a depth- and context-aware feasibility region that prevents positional signal erasure during long-sequence processing.
- Validates the theoretical bounds through case studies of LLaMA, Mistral, and DeepSeek variants, demonstrating that the derived limits accurately predict attention collapse, long-range degradation, and numerical barriers encountered when scaling beyond one million tokens.
Introduction
As large language models scale to hundreds of thousands of tokens, rotary positional embeddings have become the standard for preserving long-range dependencies, yet their reliability at extreme context lengths remains unpredictable. Prior research relies on empirical scaling rules and geometric interpretations that overlook how rotational phase errors compound across transformer layers and degrade under finite floating-point precision. The authors leverage a signal-processing framework to reframe rotary positional embeddings as phase modulation across complex oscillators, deriving explicit stability and precision bounds for the base frequency parameter. This analysis establishes a strict feasibility region that explains long-context failure modes and provides principled, architecture-aware guidance for model design without introducing new heuristic modifications.
Dataset
- Dataset composition and sources: The authors have only provided the paper title and contact information. The input text contains no dataset composition or source details.
- Key details for each subset: No subset sizes, origins, or filtering rules are included in the provided content.
- Data usage and processing: The authors do not specify training splits, mixture ratios, or any processing pipeline in the given text.
- Cropping and metadata: No cropping strategies, metadata construction, or additional processing steps are described.
The provided material lacks the necessary methodology or data sections. Please share the relevant paragraphs so I can draft the complete dataset description.
Method
The authors leverage a signal-processing interpretation of Rotary Positional Embeddings (RoPE), reformulating the mechanism as phase modulation applied to a bank of complex oscillators. This perspective enables a rigorous analysis of positional encoding stability in long-context transformers. The framework begins with the standard RoPE construction, where query and key representations are transformed by applying position-dependent rotations to paired feature dimensions. This rotation is equivalently expressed using complex-valued features, where each 2D vector pair is identified with a complex number zi=x2i−1+jx2i, and the RoPE transformation becomes a simple multiplication: zi′(p)=zi⋅ejpθi. This formulation reveals that RoPE performs phase modulation with angular frequency θi=base−2(i−1)/d, where the base parameter controls the geometric spacing of the oscillator frequencies.
As shown in the figure below, this oscillator-bank view connects RoPE to classical signal processing concepts. The figure illustrates the attention score (cosine similarity) between a token at position 0 and tokens at various positions in a sequence, using a base of 10,000. The plot shows a smooth cosine decay, but a catastrophic failure occurs at the Nyquist limit, marked by a red dashed line at 2π⋅base≈62,832. Beyond this point, the model fundamentally cannot distinguish position 0 from position 62,832, as the phase of the fundamental oscillator completes a full cycle, leading to a "collision horizon" and the collapse of the global positional grid. This visualizes the fundamental aliasing limit derived in the analysis.
The framework further analyzes the stability of the lowest-frequency (quasi-DC) component, which is critical for preserving long-range alignment. The authors derive a stability bound, showing that to maintain a minimum cosine similarity ϵ between adjacent rotations of the lowest-frequency mode over a context length L, the RoPE base must satisfy base≥L/arccos(ϵ). This condition ensures that the global positional reference frame does not drift excessively. The analysis is extended to deep transformers, where repeated application of RoPE across layers compounds small angular misalignments. This layer compounding effect tightens the stability requirement, leading to a depth-dependent bound: base≥L/arccos(ϵ1/N), where N is the number of layers. This explains why deeper models require larger bases to maintain long-range coherence.
Finally, the framework incorporates numerical precision constraints. In finite-precision arithmetic, the phase increment Δθ=1/base must exceed the machine epsilon ϵmach to be distinguishable. This establishes an upper bound on the RoPE base: base<1/ϵmach. This "Precision Wall" limits the maximum achievable context length, as increasing the base beyond this threshold causes incremental phase updates to become numerically indistinguishable, erasing positional information even in the absence of aliasing. The combination of the depth- and length-dependent lower bound and the hardware-dependent upper bound defines a precision- and depth-dependent feasibility region, or "Goldilocks zone," for RoPE base selection in long-context transformers.
Experiment
The evaluation compares the deployed RoPE configurations of widely used state-of-the-art transformer models against established empirical heuristics and newly derived theoretical stability bounds. By classifying each architecture based on whether its RoPE base falls within a theoretically defined safe operating range, the analysis reveals that several long-context models systematically degrade due to configurations that violate these fundamental constraints. Ultimately, the study demonstrates that the proposed theoretical bounds successfully diagnose real-world architectural limitations and explain performance failures that existing empirical rules cannot account for.
The authors analyze several state-of-the-art transformer models to evaluate their RoPE base configurations against theoretical stability bounds derived from coherence and numerical precision constraints. Results show that some models operate outside the safe range, leading to instability, while others remain within the bounds and exhibit stable behavior. Some models are classified as unstable due to their RoPE base falling outside the theoretical stability bounds. Certain models are marked as very stable, indicating their RoPE base lies within the derived theoretical range. A hypothetical target model is deemed infeasible, suggesting its RoPE base exceeds practical limits for stability.
The experimental setup evaluates state-of-the-art transformer models by comparing their RoPE base configurations against theoretical stability bounds derived from coherence and numerical precision constraints. This analysis validates whether specific embedding settings maintain numerical stability or risk operational failure. Results qualitatively categorize the models as stable, unstable, or infeasible based on their alignment with these limits, demonstrating that proper RoPE base selection is essential for reliable deployment since configurations exceeding practical thresholds inevitably compromise stability.