3年前

Feilong Liu

回転位置埋め込み（RoPE）

RTX 5090のコンピュートリソースがわずか20時間分 $1 (価値 $7)

概要

回転位置埋め込み（RoPE）は、トークン位置を乗算回転によってエンコードするために大規模言語モデルで広く使用されているが、その長文コンテキスト長における挙動は十分に解明されていない。本研究では、RoPEを複素振動子のバンクに対する位相変調として再解釈し、古典的な信号処理理論を用いた分析を可能にする。この定式化の下、対象となるコンテキスト長にわたって位置の整合性を維持するために必要なRoPE基底パラメータの原理的な下限値を導出する。これには、ナイキスト限界に類似した基本的なエイリアシング限界と、低周波数の位置モードにおける位相ドリフトを制約する直流成分安定性限界が含まれる。さらに、この分析をディープトランスフォーマーに拡張し、層をまたぐ回転変調の繰り返しによって角度の不一致が累積し、深さが増すにつれて基底の要件が厳しくなることを示す。これらの結果を補完するために、有限の浮動小数点分解能に起因するRoPE基底の精度依存上限値も導出する。この限界を超えると、増分的な位相更新は数値的に区別できなくなり、エイリアシングが存在しない場合でも位置情報が消去される。これら下限値と上限値を組み合わせることで、長文コンテキストトランスフォーマーのための精度および深さに依存する実現可能性領域、すなわち「ゴールドロックスゾーン」が定義される。LLaMA、Mistral、DeepSeekの各バリアントを含む最先端モデルの包括的なケーススタディを通じて、このフレームワークを検証し、観察された成功、失敗、およびコミュニティによる後付け修正が予測された限界値と密接に一致することを示す。特筆すべきは、安定性限界に違反するモデルがアテンションの崩壊と長距離依存性の劣化を示す一方で、100万トークンを超えるスケーリングを試みると、アーキテクチャやトレーニングに依存しない硬質な精度の壁に直面することである。本研究の分析は、RoPE基底の選択を調整可能なハイパーパラメータではなく、根本的な必須のアーキテクチャ制約として確立し、現実的な数値制限の下で長文コンテキストトランスフォーマーの設計、スケーリング、および後付け修正に関する実践的な指針を提供する。

One-sentence Summary

By reinterpreting Rotary Positional Embeddings as phase modulation on complex oscillators through classical signal processing theory, this work derives precision- and depth-dependent lower and upper bounds on the RoPE base parameter that define a feasibility region for long-context transformers, with case studies of LLaMA, Mistral, and DeepSeek variants confirming that base selection operates as a fundamental architectural constraint rather than a tunable hyperparameter.

Key Contributions

Reinterprets rotary positional embeddings as phase modulation on complex oscillators, enabling a signal processing analysis that derives principled lower bounds on the base parameter to maintain positional coherence across extended contexts and deep transformer layers.
Establishes a precision-dependent upper bound on the base parameter imposed by finite floating-point resolution, defining a depth- and context-aware feasibility region that prevents positional signal erasure during long-sequence processing.
Validates the theoretical bounds through case studies of LLaMA, Mistral, and DeepSeek variants, demonstrating that the derived limits accurately predict attention collapse, long-range degradation, and numerical barriers encountered when scaling beyond one million tokens.

Introduction

As large language models scale to hundreds of thousands of tokens, rotary positional embeddings have become the standard for preserving long-range dependencies, yet their reliability at extreme context lengths remains unpredictable. Prior research relies on empirical scaling rules and geometric interpretations that overlook how rotational phase errors compound across transformer layers and degrade under finite floating-point precision. The authors leverage a signal-processing framework to reframe rotary positional embeddings as phase modulation across complex oscillators, deriving explicit stability and precision bounds for the base frequency parameter. This analysis establishes a strict feasibility region that explains long-context failure modes and provides principled, architecture-aware guidance for model design without introducing new heuristic modifications.

Dataset

Dataset composition and sources: The authors have only provided the paper title and contact information. The input text contains no dataset composition or source details.
Key details for each subset: No subset sizes, origins, or filtering rules are included in the provided content.
Data usage and processing: The authors do not specify training splits, mixture ratios, or any processing pipeline in the given text.
Cropping and metadata: No cropping strategies, metadata construction, or additional processing steps are described.

The provided material lacks the necessary methodology or data sections. Please share the relevant paragraphs so I can draft the complete dataset description.

Method

The authors leverage a signal-processing interpretation of Rotary Positional Embeddings (RoPE), reformulating the mechanism as phase modulation applied to a bank of complex oscillators. This perspective enables a rigorous analysis of positional encoding stability in long-context transformers. The framework begins with the standard RoPE construction, where query and key representations are transformed by applying position-dependent rotations to paired feature dimensions. This rotation is equivalently expressed using complex-valued features, where each 2D vector pair is identified with a complex number $z_i = x_{2i-1} + j x_{2i}$ , and the RoPE transformation becomes a simple multiplication: $z_i'(p) = z_i \cdot e^{jp\theta_i}$ . This formulation reveals that RoPE performs phase modulation with angular frequency $\theta_i = \text{base}^{-2(i-1)/d}$ , where the base parameter controls the geometric spacing of the oscillator frequencies.

As shown in the figure below, this oscillator-bank view connects RoPE to classical signal processing concepts. The figure illustrates the attention score (cosine similarity) between a token at position 0 and tokens at various positions in a sequence, using a base of 10,000. The plot shows a smooth cosine decay, but a catastrophic failure occurs at the Nyquist limit, marked by a red dashed line at $2\pi \cdot \text{base} \approx 62,832$ . Beyond this point, the model fundamentally cannot distinguish position 0 from position 62,832, as the phase of the fundamental oscillator completes a full cycle, leading to a "collision horizon" and the collapse of the global positional grid. This visualizes the fundamental aliasing limit derived in the analysis.

The framework further analyzes the stability of the lowest-frequency (quasi-DC) component, which is critical for preserving long-range alignment. The authors derive a stability bound, showing that to maintain a minimum cosine similarity $\epsilon$ between adjacent rotations of the lowest-frequency mode over a context length $L$ , the RoPE base must satisfy $\text{base} \geq L / \arccos(\epsilon)$ . This condition ensures that the global positional reference frame does not drift excessively. The analysis is extended to deep transformers, where repeated application of RoPE across layers compounds small angular misalignments. This layer compounding effect tightens the stability requirement, leading to a depth-dependent bound: $\text{base} \geq L / \arccos(\epsilon^{1/N})$ , where $N$ is the number of layers. This explains why deeper models require larger bases to maintain long-range coherence.

Finally, the framework incorporates numerical precision constraints. In finite-precision arithmetic, the phase increment $\Delta\theta = 1/\text{base}$ must exceed the machine epsilon $\epsilon_{\text{mach}}$ to be distinguishable. This establishes an upper bound on the RoPE base: $\text{base} < 1 / \epsilon_{\text{mach}}$ . This "Precision Wall" limits the maximum achievable context length, as increasing the base beyond this threshold causes incremental phase updates to become numerically indistinguishable, erasing positional information even in the absence of aliasing. The combination of the depth- and length-dependent lower bound and the hardware-dependent upper bound defines a precision- and depth-dependent feasibility region, or "Goldilocks zone," for RoPE base selection in long-context transformers.

Experiment

The evaluation compares the deployed RoPE configurations of widely used state-of-the-art transformer models against established empirical heuristics and newly derived theoretical stability bounds. By classifying each architecture based on whether its RoPE base falls within a theoretically defined safe operating range, the analysis reveals that several long-context models systematically degrade due to configurations that violate these fundamental constraints. Ultimately, the study demonstrates that the proposed theoretical bounds successfully diagnose real-world architectural limitations and explain performance failures that existing empirical rules cannot account for.

The authors analyze several state-of-the-art transformer models to evaluate their RoPE base configurations against theoretical stability bounds derived from coherence and numerical precision constraints. Results show that some models operate outside the safe range, leading to instability, while others remain within the bounds and exhibit stable behavior. Some models are classified as unstable due to their RoPE base falling outside the theoretical stability bounds. Certain models are marked as very stable, indicating their RoPE base lies within the derived theoretical range. A hypothetical target model is deemed infeasible, suggesting its RoPE base exceeds practical limits for stability.

The experimental setup evaluates state-of-the-art transformer models by comparing their RoPE base configurations against theoretical stability bounds derived from coherence and numerical precision constraints. This analysis validates whether specific embedding settings maintain numerical stability or risk operational failure. Results qualitatively categorize the models as stable, unstable, or infeasible based on their alignment with these limits, demonstrating that proper RoPE base selection is essential for reliable deployment since configurations exceeding practical thresholds inevitably compromise stability.

ソースPDF

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

HyperAI

このノートブックを実行 Discordで議論

3年前

Feilong Liu

回転位置埋め込み（RoPE）

RTX 5090のコンピュートリソースがわずか20時間分 $1 (価値 $7)

ノートブックへ移動

概要

One-sentence Summary

Key Contributions

Reinterprets rotary positional embeddings as phase modulation on complex oscillators, enabling a signal processing analysis that derives principled lower bounds on the base parameter to maintain positional coherence across extended contexts and deep transformer layers.
Establishes a precision-dependent upper bound on the base parameter imposed by finite floating-point resolution, defining a depth- and context-aware feasibility region that prevents positional signal erasure during long-sequence processing.
Validates the theoretical bounds through case studies of LLaMA, Mistral, and DeepSeek variants, demonstrating that the derived limits accurately predict attention collapse, long-range degradation, and numerical barriers encountered when scaling beyond one million tokens.

Introduction

Dataset

Dataset composition and sources: The authors have only provided the paper title and contact information. The input text contains no dataset composition or source details.
Key details for each subset: No subset sizes, origins, or filtering rules are included in the provided content.
Data usage and processing: The authors do not specify training splits, mixture ratios, or any processing pipeline in the given text.
Cropping and metadata: No cropping strategies, metadata construction, or additional processing steps are described.

The provided material lacks the necessary methodology or data sections. Please share the relevant paragraphs so I can draft the complete dataset description.

Method

Experiment

ソースPDF

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

Command Palette

回転位置埋め込みを位相変調として：長期コンテキストTransformerにおけるRoPE基底の理論的限界

Feilong Liu

回転位置埋め込み（RoPE）

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

回転位置埋め込みを位相変調として：長期コンテキストTransformerにおけるRoPE基底の理論的限界

Feilong Liu

回転位置埋め込み（RoPE）

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

回転位置埋め込みを位相変調として：長期コンテキストTransformerにおけるRoPE基底の理論的限界

Feilong Liu

回転位置埋め込み（RoPE）

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters