HyperAIHyperAI

Command Palette

Search for a command to run...

Erkennung von Trojaned DNNs durch spektrale Regressionsanalyse

Samuele Pasini Jinhan Kim Paolo Tonella

Zusammenfassung

Moderne DNNs werden wiederholt feinabgestimmt, um neue Daten und Funktionalität zu integrieren. Dieser evolutionäre Arbeitsablauf birgt ein Sicherheitsrisiko, wenn die aktualisierten Daten nicht vollständig vertrauenswürdig sind, da Angreifer während der Feinabstimmung Trojaner einschleusen können. Wir präsentieren MIST, einen Ansatz zur Trojaner-Erkennung, der analysiert, wie sich die internen Repräsentationen eines Modells während der Feinabstimmung verändern. Anstatt die Auslösungsbedingungen zu rekonstruieren, charakterisiert MIST die harmlose Modellentwicklung mittels Prä-Aktivierungsspektren und markiert Aktualisierungen, deren spektrale Abweichungen nicht mit dieser Referenz übereinstimmen. Diese Formulierung behandelt die Trojaner-Erkennung als Regressionsproblem über Modellaktualisierungen. Eine empirische Evaluation über vier Datensätze und acht Trojaner-Angriffe hinweg zeigt, dass spektrale Distanzen Trojaner-Aktualisierungen zuverlässig von sauberer Feinabstimmung unterscheiden. MIST übertrifft den Stand der Technik bei der Erkennungsgenauigkeit nach einer einzigen Aktualisierung, ohne dass Kenntnisse über die vergifteten Daten oder den Auslöser erforderlich sind, und bleibt auch bei mehrstufiger harmloser Entwicklung wirksam, mit einer sanften und begrenzten Degradation. Diese Ergebnisse deuten darauf hin, dass die spektrale Entwicklung ein stabiles und annahmesarmes Signal zur Erkennung bösartiger Modellaktualisierungen bietet.

One-sentence Summary

MIST detects Trojans in fine-tuned neural networks by framing detection as a spectral regression problem that characterizes benign pre-activation evolution to flag malicious updates via spectral deviations, operating without knowledge of poisoned data or triggers while achieving state-of-the-art accuracy across four datasets and eight attacks and maintaining robust performance under multi-step benign fine-tuning.

Key Contributions

  • MIST introduces a regression-based framework that treats Trojan detection as a problem of identifying anomalous deviations in spectral model evolution. The method characterizes benign fine-tuning trajectories using pre-activation spectra to isolate malicious updates without requiring trigger reconstruction or poisoned data.
  • The approach validates updated model checkpoints against a clean reference baseline by computing spectral distances that quantify internal representation shifts. This reference-driven mechanism flags updates that deviate from established benign patterns while remaining independent of specific attack implementations.
  • Empirical evaluation across four datasets and eight Trojan attacks demonstrates that spectral distances reliably distinguish poisoned updates from clean fine-tuning. The method outperforms three state-of-the-art detectors in single-step scenarios and maintains robust performance with bounded degradation under multi-step benign evolution.

Introduction

Deep neural networks are routinely fine-tuned in production to adapt to new data, a practice essential for safety-critical systems but vulnerable to backdoor attacks when update datasets are compromised. Prior detection methods typically analyze isolated models and attempt to reconstruct unknown trigger patterns, a strategy that struggles with imperceptible inputs and relies on restrictive assumptions about trigger visibility. The authors reframe this challenge by treating model updates as regression events, leveraging pre-activation spectra to establish a baseline for benign evolution. By measuring spectral deviations against this reference, their approach, MIST, reliably flags malicious fine-tuning without requiring any knowledge of the trigger, demonstrating superior accuracy and robustness across diverse datasets and attack vectors.

Dataset

  • Dataset composition and sources: The authors do not specify dataset composition or sources in this section.
  • Key subset details: Information regarding subset sizes, origins, or filtering criteria is not provided.
  • Model usage and processing: The authors do not outline training splits, mixture ratios, or data processing steps here.
  • Processing and metadata: No cropping strategies, metadata construction, or preprocessing workflows are described.
  • Code and checkpoint availability: The authors release implementations and source code on a public GitHub repository. Due to storage constraints, they do not host model checkpoints publicly and distribute them upon request.

Method

The authors leverage spectral analysis of neural network activations to develop MIST, a method for detecting Trojaned models by monitoring internal changes during fine-tuning. The core approach operates under the model evolution scenario, where a deployed model is periodically updated with new data that may be partially untrusted. The method assumes access to a clean baseline model, a trusted test set, and a small clean subset of the new data for probing internal behavior, without requiring access to poisoned samples or triggers. MIST operates in two distinct phases: Clean Spectra Tracking and Anomaly Detection.

In the Clean Spectra Tracking phase, the framework establishes a statistical baseline for benign model evolution. This is achieved by repeatedly simulating clean training-to-fine-tuning transitions using only trusted data. For each simulation, the clean training set is split into two subsets. A model G0G_0G0 is trained on the first subset, and then fine-tuned on the second to produce G1G_1G1. The internal change induced by this update is quantified by comparing the activation spectra of G0G_0G0 and G1G_1G1. The spectral representation of a model at a specific layer \ell and class ccc is constructed by first filtering inputs from a test set that the model predicts as class ccc, then extracting the pre-activation values z()(x)z^{(\ell)}(x)z()(x) for these inputs, normalizing them, and discretizing them into a histogram over a fixed number of bins. This histogram is normalized to form a probability distribution, which is the activation spectrum. The L2L_2L2 distance between the per-class spectra of G0G_0G0 and G1G_1G1 is computed, and this process is repeated for multiple simulated updates to populate the Clean Spectra Distances Distribution (CSDD). This distribution captures the typical magnitude and variability of spectral changes under benign fine-tuning.

In the Anomaly Detection phase, the method evaluates a newly produced model Mi+1M^{i+1}Mi+1 against its predecessor MiM^iMi. The spectral distance between the two models is computed on a clean test set, ensuring no assumption is made about the availability of poisoned inputs. This distance is represented as a vector xxx summarizing the internal change across all classes. The deviation of this observed change from the baseline CSDD is quantified using the squared Mahalanobis distance DM2D_M^2DM2, which accounts for the correlations between class-wise spectral changes. The mean μ\muμ and covariance Σ\SigmaΣ of the CSDD are used to define the reference distribution. To ensure numerical stability, the covariance matrix is regularized using the Ledoit-Wolf shrinkage estimator. The squared Mahalanobis distance DM2D_M^2DM2 is then compared against a threshold τ\tauτ, which is determined as the α\alphaα-quantile of a χ2\chi^2χ2 distribution with CCC degrees of freedom, where CCC is the number of classes. If DM2D_M^2DM2 exceeds τ\tauτ, the update is flagged as anomalous and the model is classified as potentially Trojaned; otherwise, it is deemed consistent with benign evolution.

Experiment

The evaluation assesses MIST, a Trojan detection method leveraging activation spectra, across diverse image classification datasets and multiple attack types to validate whether malicious updates induce distinguishable spectral deviations from benign fine-tuning. Results confirm that these spectral differences reliably separate compromised models from clean ones, enabling the approach to consistently outperform existing detectors after a single update while maintaining a highly favorable error profile with minimal false positives. Under repeated fine-tuning scenarios, the method demonstrates robust resilience, as performance degrades gracefully through a controlled increase in false alarms rather than missed detections, ultimately establishing spectral tracking as a stable and practical approach for validating evolving neural networks.

The authors evaluate the effectiveness of MIST, a Trojan detection technique, by assessing its ability to distinguish between benign and malicious model updates through spectral analysis. The results demonstrate that spectral distances reliably separate Trojaned models from clean ones, with high detection accuracy in single-step fine-tuning scenarios and sustained performance under multiple benign updates. MIST consistently outperforms baseline methods, showing fewer false positives and maintaining detection capability even as models drift from the original reference. Spectral distances effectively separate Trojaned models from clean fine-tuned models across multiple datasets and attack types. MIST achieves high detection accuracy in single-step fine-tuning scenarios, outperforming state-of-the-art detectors with fewer false positives. Detection performance degrades gracefully under multi-step evolution, primarily due to increased false positives rather than missed Trojans.

The authors evaluate MIST, a Trojan detection technique, by analyzing spectral changes in fine-tuned models compared to a clean reference checkpoint. Results show that spectral distances effectively separate Trojaned models from clean ones, with high detection accuracy across multiple datasets and attacks. The method remains robust under repeated model updates, though performance degrades slightly due to increased false positives. The detection effectiveness is consistently superior to state-of-the-art baselines, particularly in minimizing false positives. Spectral distances reliably distinguish Trojaned models from clean fine-tuned models across various attacks and datasets. MIST achieves high detection accuracy, consistently outperforming existing Trojan detectors with fewer false positives. Detection performance degrades gracefully under multi-step evolution, primarily due to increased false positives rather than missed detections.

The authors evaluate MIST, a Trojan detection technique, by assessing its ability to distinguish between benign and malicious model updates through spectral analysis. Results show that MIST achieves high detection accuracy across various datasets and attack types, consistently outperforming state-of-the-art detectors. The method remains effective under multi-step model evolution, though performance degrades slightly due to increased false positives from benign drift. MIST achieves high detection accuracy across datasets and attack types, outperforming existing Trojan detectors. The method reliably separates Trojaned models from clean fine-tuned ones based on spectral differences, even under multi-step evolution. Detection performance degrades gracefully with repeated benign updates, primarily due to increased false positives rather than missed detections.

The experiments evaluate MIST, a Trojan detection method that leverages spectral analysis to differentiate between benign and malicious model updates by comparing fine-tuned checkpoints against a clean reference. The results validate that spectral distances reliably isolate compromised models in single-step fine-tuning scenarios while consistently outperforming existing detectors with significantly fewer false positives. Although repeated benign updates cause a gradual increase in false alarms, the technique maintains robust detection capabilities and gracefully degrades without missing actual threats. Overall, the study demonstrates that spectral-based monitoring offers a highly effective and resilient approach for identifying backdoored updates in evolving machine learning pipelines.


KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren
Sofort einsatzbereite GPUs
Die besten Preise

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates
Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen
Unterstützt von MailChimp