HyperAIHyperAI

Command Palette

Search for a command to run...

内部表現を用いたLLMの安全性確保:有害コンテンツの検出

Difan Jiao Yilun Liu Ye Yuan Zhenwei Tang Linfeng Du Haolun Wu Ashton Anderson

概要

ガードモデル(Guard models)は、ユーザーのプロンプトやLLMの応答に含まれる有害なコンテンツを検出するために広く利用されています。しかし、最先端のガードモデルは最終層の表現(terminal-layer representations)のみに依存しており、内部層全体に分散している安全性に関連する豊かな特徴量を見落としています。本研究では、これらの内部特徴を活用する軽量なガードモデルである「SIREN」を提案します。SIRENは、線形プロービング(linear probing)を通じて安全性に関するニューロンを特定し、適応的な層重み付け戦略(adaptive layer-weighted strategy)を用いてこれらを組み合わせることで、元のLLM自体を修正することなく、内部状態から有害性検出器を構築します。包括的な評価の結果、SIRENは学習パラメータ数を250分の1に抑えながら、複数のベンチマークにおいて最先端のオープンソースのガードモデルを大幅に上回る性能を示すことが確認されました。さらに、SIRENは未知のベンチマークに対しても優れた汎化性能を発揮し、リアルタイムのストリーミング検出を自然に実現できるほか、生成型のガードモデルと比較して推論効率を大幅に向上させています。総じて、本研究の結果は、LLMの内部状態が、実用的かつ高性能な有害性検出のための有望な基盤であることを示しています。

One-sentence Summary

By identifying safety neurons through linear probing and combining them via an adaptive layer-weighted strategy, the lightweight guard model SIREN leverages internal LLM representations to outperform state-of-the-art open-source models across multiple benchmarks while using 250× fewer trainable parameters and enabling superior generalization and real-time streaming detection.

Key Contributions

  • The paper introduces SIREN, a lightweight and plug-and-play guard model that detects harmfulness by leveraging internal neuron representations of LLMs instead of relying solely on terminal-layer outputs.
  • The method identifies safety-relevant neurons through L1-regularized linear probing and aggregates these features across multiple layers using an adaptive, performance-weighted combination strategy.
  • Experimental results demonstrate that SIREN outperforms state-of-the-art open-source guard models across multiple benchmarks while using 250 times fewer trainable parameters and providing superior generalization to unseen datasets and real-time streaming detection.

Introduction

As large language models (LLMs) scale, implementing robust guardrails to detect harmful user prompts and model responses has become essential for safe deployment. Current state of the art guard models typically treat safety detection as a generative task by relying solely on terminal layer representations. This approach overlooks the rich, safety-relevant features distributed across the internal layers of the model and incurs high computational costs due to autoregressive token generation. The authors leverage these internal representations to introduce SIREN, a lightweight, plug-and-play framework that identifies safety neurons via linear probing and aggregates them using an adaptive layer-weighted strategy. SIREN outperforms existing generative guard models across multiple benchmarks while using 250 times fewer trainable parameters and providing superior inference efficiency.

Method

The authors leverage a two-stage framework for content safety classification that operates entirely on the internal representations of a transformer-based language model (LLM), without modifying its weights. This approach, termed SIREN, is designed to identify and aggregate safety-relevant neurons across layers to construct a robust feature representation for harmfulness detection. The overall architecture consists of two primary stages: safety neuron identification and adaptive neuron aggregation.

Refer to the framework diagram . In the first stage, the internal representations of each layer are extracted from the LLM for a given input sequence sss of length TTT. These representations, denoted as xl=LLMl(s)RT×D\pmb{x}_l = \mathrm{LLM}_l(s) \in \mathbb{R}^{T \times D}xl=LLMl(s)RT×D for layer lll, are derived from either residual streams or feedforward network activations. To capture the overall semantic content of the sentence, a mean pooling operation is applied to the token-level representations, resulting in a pooled representation xlRD\pmb{x}_l^* \in \mathbb{R}^DxlRD. A layer-wise linear probe is then trained on these pooled representations using a classification task with ground-truth harmfulness labels yyy. The objective is to minimize the cross-entropy loss with L1 regularization on the probe weights Wl\mathbf{W}_lWl, which is justified by the linear representation hypothesis that semantic concepts are often linearly encoded in LLMs. The magnitude of the trained weights wl,jw_{l,j}wl,j for each neuron jjj in layer lll is used to determine its relevance to harmfulness detection. These weights are normalized, and the top-ranked neurons whose cumulative normalized magnitude exceeds a threshold η\etaη are selected as the safety neurons for that layer, forming the set Sl\mathcal{S}_lSl.

In the second stage, the framework aggregates the identified safety neurons across all layers to form a more comprehensive feature representation. The authors note that LLMs exhibit a hierarchical learning structure where representations evolve from low-level patterns to high-level semantics, motivating the aggregation of safety-relevant features from multiple layers. To account for the varying contribution of each layer to the task, an adaptive weighting strategy is introduced. The weight αl\alpha_lαl for layer lll is computed based on its validation F1 score flf_lfl from the linear probe, normalized between the maximum and minimum F1 scores across all layers. This assigns higher weights to more informative layers. The safety neuron activations from each layer are extracted, weighted by their respective αl\alpha_lαl, and concatenated to form the final feature vector zzz. This aggregated feature zzz is then fed into a multi-layer perceptron (MLP) classifier for harmfulness prediction. The MLP learns to combine the complementary signals from the cross-layer features, and the αl\alpha_lαl values serve as a prior on layer importance rather than a final feature weighting.

The framework is designed to be plug-and-play, operating on top of any transformer-based LLM without requiring architectural changes. It is also transferable to token-level attribution tasks. By removing the mean pooling operation, the same safety neurons and classifier can be applied to each token's hidden representation, directly producing per-token harmfulness scores. This capability is demonstrated in the visualization of token-level streaming detection results , where the model processes a sequence incrementally, extracting features for each prefix and applying the classifier to generate a continuous harmfulness score at every token position. This streaming evaluation is achieved by re-evaluating the feature extractor on prefix-restricted internal states, enabling a zero-shot assessment of how safety information manifests in early generation stages.

Experiment

SIREN is evaluated against state-of-the-art specialized guard models by training on the internal representations of general-purpose LLM backbones across multiple safety benchmarks. The results demonstrate that SIREN achieves superior detection performance, maintains higher policy consistency, and generalizes effectively to unseen reasoning traces and streaming detection tasks. Furthermore, the approach offers significant advantages in both training and inference efficiency due to its sparse parameter usage and the elimination of autoregressive generation.

The authors compare SIREN, a lightweight classifier trained on internal representations of general-purpose LLMs, against safety-specialized guard models across multiple benchmarks. Results show that SIREN achieves higher average performance and more consistent detection across different datasets compared to guard models, while also demonstrating strong generalization to unseen benchmarks and streaming detection. SIREN is significantly more computationally efficient during inference due to its reliance on a single forward pass through the base model. SIREN outperforms guard models across all backbone pairs and maintains consistent precision-recall tradeoffs across diverse benchmarks. SIREN generalizes effectively to unseen benchmarks and streaming detection without additional training or architectural changes. SIREN achieves substantial inference efficiency by operating on precomputed internal representations, requiring far fewer computational operations than autoregressive guard models.

The authors evaluate SIREN, a method that leverages internal representations of general-purpose LLMs for harmfulness detection, against safety-specialized guard models across multiple benchmarks. Results show that SIREN achieves higher detection performance and maintains more consistent precision-recall trade-offs across datasets compared to guard models, which exhibit significant variance in their classification behavior. SIREN also demonstrates robust generalization to unseen benchmarks and streaming detection tasks without additional training, while requiring substantially fewer parameters and computational resources during inference. SIREN achieves higher and more consistent detection performance across benchmarks compared to guard models, which show large variance in precision and recall. SIREN generalizes effectively to unseen benchmarks and streaming detection tasks without additional training or architectural modifications. SIREN requires significantly fewer parameters and computational resources during inference than guard models, which rely on autoregressive generation.

The authors compare SIREN with safety-specialized guard models across different detection latency positions, showing that SIREN consistently achieves higher detection rates than the guard models. SIREN maintains strong performance across all latency stages, while guard models exhibit lower detection rates, particularly in earlier stages. The results indicate that SIREN's approach to harmfulness detection is more effective and stable under varying detection latencies. SIREN achieves higher detection rates than guard models across all detection latency positions. SIREN maintains consistent performance across different latency stages, while guard models show lower detection rates, especially in early stages. SIREN outperforms guard models in timely detection, indicating better real-time harmful content identification.

The authors conduct ablation studies to analyze key design choices in SIREN, focusing on hyperparameter sensitivity and internal safety encoding. The results show that SIREN's performance is stable across a range of regularization strengths and neuron selection thresholds, with optimal settings found through grid search. The model leverages internal representations from multiple layers, with middle layers contributing most to safety detection, and cross-layer aggregation improves performance over single-layer probes. The training process uses a lightweight MLP classifier on extracted neurons, with hyperparameters selected to balance performance and efficiency. SIREN's performance is stable across a range of regularization strengths and neuron selection thresholds, with optimal settings found through grid search. Middle layers of the LLM contribute most to safety detection, and cross-layer aggregation improves performance over single-layer probes. The training process uses a lightweight MLP classifier on extracted neurons, with hyperparameters selected to balance performance and efficiency.

The authors evaluate SIREN, a lightweight classifier that operates on internal representations of language models, against safety-specialized guard models across multiple benchmarks. Results show that SIREN achieves higher detection performance than guard models, maintains consistent policy behavior across datasets, and generalizes effectively to unseen and streaming detection tasks. SIREN also demonstrates significant training and inference efficiency due to its reliance on sparse neuron selection and minimal parameter updates. SIREN outperforms safety-specialized guard models across all evaluated benchmarks and model sizes. SIREN maintains consistent precision and recall across different datasets, indicating stable safety policy learning. SIREN generalizes well to unseen benchmarks and streaming detection without additional training, while requiring significantly fewer parameters and computational resources than guard models.

The authors evaluate SIREN, a lightweight classifier utilizing the internal representations of general-purpose LLMs, by comparing it against specialized guard models across various benchmarks, latency stages, and ablation configurations. SIREN demonstrates superior detection performance and more stable precision-recall tradeoffs than guard models, while also showing robust generalization to unseen benchmarks and streaming tasks. Furthermore, the method achieves significant computational efficiency and benefits from cross-layer aggregation, particularly by leveraging safety information encoded in the middle layers of the backbone model.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています