HyperAIHyperAI

Command Palette

Search for a command to run...

Contrastive Pair Searchによる標的ニューロン修飾

Sam Herring Jake Naviasky Karan Malhotra

概要

言語モデルは有害なリクエストを拒否するように指示微調整(instruction-tuning)が施されているものの、この動作の基盤となるメカニズムについては依然として不明な点が多い。従来の代表的な制御(steering)手法は残差ストリーム(residual stream)上で動作し、介入強度が高いと出力の整合性が損なわれるため、実用的な応用には限界があった。本研究では、対照的ニューロン寄与度(contrastive neuron attribution: CNA)を提案する。CNAは、有害なプロンプトと良性のプロンプトの活性化を最も明確に区別するMLP(Multi-Layer Perceptron)ニューロンのうち0.1%を特定する。この手法は、勾配計算や補助的な学習を必要とせず、順方向パス(forward pass)のみで実行可能である。指示学習済みモデルにおいて、CNAにより発見された回路を遮断(ablation)すると、標準的な Jailbreak ベンチマークにおいて拒否率が50%以上低下する一方、すべての介入強度において流暢さと劣化なし(non-degeneracy)が維持される。LlamaおよびQwenアーキテクチャ(パラメータ数1B〜72B)に基づくマッチングされたベースモデルと指示学習済みモデルにCNAを適用した結果、ベースモデルにも同様の後層における識別構造が存在することが明らかとなった。しかし、これらのニューロンを制御しても、生成内容の変更は生じるものの、行動の変化は起こらない。これらの結果は、ニューロンレベルでの介入が、残差ストリームに基づく手法が抱える品質とのトレードオフ 없이、信頼性の高い行動制御を可能であることを示している。より一般的に、本研究の知見は、アライメント(alignment)ファインチューニングが、既存の識別構造を変換し、スパースで標的可能な拒否ゲートへと再構成することを示唆している。

One-sentence Summary

The authors introduce Contrastive Neuron Attribution to identify the 0.1% of MLP neurons distinguishing harmful from benign prompts across Llama and Qwen architectures ranging from 1B to 72B parameters using only forward passes without gradients or auxiliary training, demonstrating that ablating this circuit in instruct models reduces refusal rates by over 50% on a standard jailbreak benchmark while preserving fluency and avoiding the coherence degradation of residual-stream methods, thereby suggesting that alignment fine-tuning transforms pre-existing discrimination structure into a sparse, targetable refusal gate.

Key Contributions

  • The paper introduces contrastive neuron attribution (CNA), a technique that identifies the 0.1% of MLP neurons distinguishing harmful from benign prompts using only forward passes without gradients or auxiliary training. This method operates at the individual-neuron level rather than the residual stream to avoid the quality tradeoffs associated with residual-stream methods.
  • Ablating the discovered circuit reduces refusal rates by over 50% on a standard jailbreak benchmark while preserving fluency and non-degeneracy across all steering strengths. These results demonstrate that neuron-level intervention enables reliable behavioral steering without the coherence degradation seen in residual-stream methods.
  • Applying CNA across Llama and Qwen architectures reveals that base models contain similar late-layer discrimination structures that produce content shifts rather than behavioral changes when steered. These findings suggest that alignment fine-tuning transforms pre-existing discrimination structure into a sparse, targetable refusal gate.

Introduction

Modern language models rely on fine-tuning to refuse harmful requests, yet the mechanistic origin of this safety behavior remains unclear. Prior representation engineering methods steer behavior by modifying the entire residual stream, which is too coarse to isolate specific drivers, while sparse autoencoders require expensive training and struggle with noise. To address this, the authors introduce Contrastive Neuron Attribution, a technique that identifies a sparse subset of individual MLP neurons responsible for distinguishing harmful from benign prompts. Their experiments show that ablating just 0.1% of these neurons reduces refusal rates by over 50% across various model sizes without compromising output quality, demonstrating that safety circuits crystallize specifically during alignment fine-tuning.

Method

The authors leverage a method termed Contrastive Neuron Attribution (CNA) to identify specific behavioral circuits within language models. This approach focuses on isolating a sparse subset of neurons responsible for distinguishing between harmful and benign prompts without requiring gradient computations or auxiliary training. The overall framework operates through a process called contrastive discovery followed by a filtering stage to ensure robustness.

In the contrastive discovery phase, the method defines two distinct sets of prompts for a given task. One set consists of positive prompts that exhibit the target property, while the other comprises negative prompts that do not. The model processes all prompts through a forward pass, and the system records the MLP activations at the last token position. Specifically, the down projection of the MLP activations is captured for each task using forward pre-hooks. For a neuron jjj in layer \ell, the activation on prompt xxx is denoted as aj(x)a_{j}^{\ell}(x)aj(x). The core calculation involves determining the mean contrastive difference between the positive and negative sets.

δj=1P+xP+aj(x)  1PxPaj(x)\delta _ { j } ^ { \ell } = \frac { 1 } { | \mathcal { P } ^ { + } | } \sum _ { x \in \mathcal { P } ^ { + } } a _ { j } ^ { \ell } ( x ) \ - \ \frac { 1 } { | \mathcal { P } ^ { - } | } \sum _ { x \in \mathcal { P } ^ { - } } a _ { j } ^ { \ell } ( x )δj=P+1xP+aj(x)  P1xPaj(x)

This metric quantifies how much a specific neuron activates differently depending on the prompt type. The authors then select the circuit Ck\mathcal{C}_kCk by taking the top kkk neurons with the highest absolute difference values across all layers. The value of kkk is set to 0.1% of the total MLP activations, a threshold found to reliably produce steering effects across various model sizes. This selection process interprets contrastive attribution at the neuron level rather than the residual stream level, relying solely on forward pass comparisons.

To refine the discovered circuit, the method incorporates a universal neuron filtering step. Some neurons tend to fire regardless of the specific prompt content, which could introduce noise into the steering mechanism. The system detects these by running diverse prompts and flagging any neuron that appears in the top 0.1% of MLP activations for at least 80% of the prompts. These universal neurons are excluded from all discovered neuron subsets to ensure the identified circuit specifically relates to the target behavior.

Experiment

This study evaluates neuron-level ablation across various Llama and Qwen architectures to verify causal links between specific activations and refusal behaviors. Experiments demonstrate that targeting a sparse subset of MLP activations effectively reduces refusal rates while maintaining generation coherence, whereas residual-stream steering methods degrade output quality at high intervention strengths. Additionally, comparisons between base and instruct models reveal that alignment fine-tuning transforms pre-existing late-layer discrimination structures into functional safety gates without changing the underlying network architecture.

The authors compare neuron-level ablation against residual-stream steering methods to evaluate their impact on refusal behavior and output quality. The data demonstrates that the ablation method consistently lowers refusal rates while maintaining near-baseline generation quality across all model sizes. Conversely, the residual-stream method often causes significant quality degradation, leading to repetitive or incoherent responses at maximum intervention levels. Neuron-level ablation effectively reduces refusal rates while preserving high generation quality. Residual-stream steering methods often degrade output coherence and cause repetitive text at high strengths. The ablation technique remains stable across different model architectures and parameter scales.

The the the table compares the performance of CNA and CAA intervention methods across various Llama and Qwen models, focusing on refusal rates and output quality. Results show that while both methods reduce baseline refusal rates, CNA generally maintains higher output quality compared to CAA. This trend suggests that CNA is more effective at steering model behavior without degrading generation coherence. CNA generally yields higher quality scores than CAA across the majority of tested models. Both intervention methods successfully lower refusal rates compared to baseline performance. The quality advantage of CNA is particularly evident in larger parameter models.

The the the table analyzes the overlap of top neurons between base and instruct model variants for tasks involving refusal, capitalization, and subject-verb agreement. Results indicate that instruction tuning replaces the majority of specific neurons identified in base models, with only a small portion of the circuitry remaining consistent. Overlap between base and instruct neuron circuits is consistently low across all tasks and model architectures. The Qwen model demonstrates a higher average overlap of neurons compared to the Llama model. Tasks related to capitalization show greater neuron retention from the base model than refusal tasks in both architectures.

The experiment analyzes the spatial distribution of discrimination circuits within Llama-1B and Qwen-3B models across refusal, factual capitalization, and subject-verb agreement tasks. Data indicates that these functional circuits are heavily localized in the late layers of the network architecture. This concentration is particularly pronounced within the final quarter of the layers for both model families. Discrimination circuits for behavioral and factual tasks consistently localize in the late layers of the network. The final quarter of layers contains the vast majority of top discrimination neurons across all tested tasks and models. Llama models exhibit a higher density of these circuits in the final three layers compared to Qwen models.

The the the table compares the performance of Base and Instruct variants of Llama-3.2-1B and Qwen2.5-3B models across Refusal, Capitals, and Subject-Verb Agreement (SVA) tasks. Results indicate that fine-tuning (Instruct) leads to higher refusal rates and average performance for Llama-3.2-1B, whereas Qwen2.5-3B shows lower refusal rates and average performance in the Instruct variant. Task-specific capabilities like Capitals and SVA exhibit mixed improvements or declines depending on the model architecture. Llama-3.2-1B Instruct model achieves higher refusal rates and average scores compared to its Base variant. Qwen2.5-3B Instruct model demonstrates lower refusal rates and average scores compared to its Base variant. Performance on general capabilities such as Capitals and SVA varies between Base and Instruct models across the two architectures.

The experiments compare neuron-level ablation against residual-stream steering, finding that ablation consistently lowers refusal rates while maintaining high generation quality across model sizes. Analysis reveals that discrimination circuits concentrate in the late network layers and instruction tuning largely replaces base model neurons, though retention varies by task. Additionally, comparisons between Base and Instruct variants demonstrate architecture-specific variations in refusal behavior and general task performance.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています