HyperAIHyperAI

Command Palette

Search for a command to run...

كشف الشبكات العصبية العميقة المصابة بالحصان الطروادي عبر تحليل الانحدار الطيفي

Samuele Pasini Jinhan Kim Paolo Tonella

الملخص

تُعدّل الشبكات العصبية العميقة الحديثة بشكل متكرر (Fine-tuning) لدمج بيانات ووظائف جديدة. يُدخل هذا التدفق التطوري خطرًا أمنيًا عندما لا يمكن الوثوق بالبيانات المحدثة بشكل كامل، إذ قد يزرع الخصوم برمجيات خبيثة (Trojans) أثناء عملية التعديل. نقدم MIST، وهو نهج لكشف البرمجيات الخبيثة يحلل كيفية تغير التمثيلات الداخلية للنموذج أثناء عملية التعديل. وبدلاً من محاولة إعادة بناء شروط الزناد (Trigger conditions)، يميز MIST التطور السليم للنموذج باستخدام أطياف ما قبل التنشيط (Pre-activation spectra)، ويكشف التحديثات التي تكون انحرافاتها الطيفية غير متسقة مع هذا المرجع. يعالج هذا الإطار كشف البرمجيات الخبيثة كمشكلة انحدار (Regression) عبر تحديثات النموذج. أظهرت التقييمات التجريبية عبر أربع مجموعات بيانات وثمان هجمات خبيثة أن المسافات الطيفية تميز بشكل موثوق بين التحديثات المصابة بتلك البرمجيات والتعديل السليم. يتفوق MIST على دقة الكشف في أحدث الأساليب بعد تحديث واحد، دون الحاجة إلى أي معرفة حول البيانات المسمومة أو الزناد، ويبقى فعالاً تحت التطور السليم متعدد الخطوات، مع تدهور تدريجي ومحدود. تشير هذه النتائج إلى أن التطور الطيفي يوفر إشارة مستقرة وخفيفة الافتراضات لكشف تحديثات النموذج الخبيثة.

One-sentence Summary

MIST detects Trojans in fine-tuned neural networks by framing detection as a spectral regression problem that characterizes benign pre-activation evolution to flag malicious updates via spectral deviations, operating without knowledge of poisoned data or triggers while achieving state-of-the-art accuracy across four datasets and eight attacks and maintaining robust performance under multi-step benign fine-tuning.

Key Contributions

  • MIST introduces a regression-based framework that treats Trojan detection as a problem of identifying anomalous deviations in spectral model evolution. The method characterizes benign fine-tuning trajectories using pre-activation spectra to isolate malicious updates without requiring trigger reconstruction or poisoned data.
  • The approach validates updated model checkpoints against a clean reference baseline by computing spectral distances that quantify internal representation shifts. This reference-driven mechanism flags updates that deviate from established benign patterns while remaining independent of specific attack implementations.
  • Empirical evaluation across four datasets and eight Trojan attacks demonstrates that spectral distances reliably distinguish poisoned updates from clean fine-tuning. The method outperforms three state-of-the-art detectors in single-step scenarios and maintains robust performance with bounded degradation under multi-step benign evolution.

Introduction

Deep neural networks are routinely fine-tuned in production to adapt to new data, a practice essential for safety-critical systems but vulnerable to backdoor attacks when update datasets are compromised. Prior detection methods typically analyze isolated models and attempt to reconstruct unknown trigger patterns, a strategy that struggles with imperceptible inputs and relies on restrictive assumptions about trigger visibility. The authors reframe this challenge by treating model updates as regression events, leveraging pre-activation spectra to establish a baseline for benign evolution. By measuring spectral deviations against this reference, their approach, MIST, reliably flags malicious fine-tuning without requiring any knowledge of the trigger, demonstrating superior accuracy and robustness across diverse datasets and attack vectors.

Dataset

  • Dataset composition and sources: The authors do not specify dataset composition or sources in this section.
  • Key subset details: Information regarding subset sizes, origins, or filtering criteria is not provided.
  • Model usage and processing: The authors do not outline training splits, mixture ratios, or data processing steps here.
  • Processing and metadata: No cropping strategies, metadata construction, or preprocessing workflows are described.
  • Code and checkpoint availability: The authors release implementations and source code on a public GitHub repository. Due to storage constraints, they do not host model checkpoints publicly and distribute them upon request.

Method

The authors leverage spectral analysis of neural network activations to develop MIST, a method for detecting Trojaned models by monitoring internal changes during fine-tuning. The core approach operates under the model evolution scenario, where a deployed model is periodically updated with new data that may be partially untrusted. The method assumes access to a clean baseline model, a trusted test set, and a small clean subset of the new data for probing internal behavior, without requiring access to poisoned samples or triggers. MIST operates in two distinct phases: Clean Spectra Tracking and Anomaly Detection.

In the Clean Spectra Tracking phase, the framework establishes a statistical baseline for benign model evolution. This is achieved by repeatedly simulating clean training-to-fine-tuning transitions using only trusted data. For each simulation, the clean training set is split into two subsets. A model G0G_0G0 is trained on the first subset, and then fine-tuned on the second to produce G1G_1G1. The internal change induced by this update is quantified by comparing the activation spectra of G0G_0G0 and G1G_1G1. The spectral representation of a model at a specific layer \ell and class ccc is constructed by first filtering inputs from a test set that the model predicts as class ccc, then extracting the pre-activation values z()(x)z^{(\ell)}(x)z()(x) for these inputs, normalizing them, and discretizing them into a histogram over a fixed number of bins. This histogram is normalized to form a probability distribution, which is the activation spectrum. The L2L_2L2 distance between the per-class spectra of G0G_0G0 and G1G_1G1 is computed, and this process is repeated for multiple simulated updates to populate the Clean Spectra Distances Distribution (CSDD). This distribution captures the typical magnitude and variability of spectral changes under benign fine-tuning.

In the Anomaly Detection phase, the method evaluates a newly produced model Mi+1M^{i+1}Mi+1 against its predecessor MiM^iMi. The spectral distance between the two models is computed on a clean test set, ensuring no assumption is made about the availability of poisoned inputs. This distance is represented as a vector xxx summarizing the internal change across all classes. The deviation of this observed change from the baseline CSDD is quantified using the squared Mahalanobis distance DM2D_M^2DM2, which accounts for the correlations between class-wise spectral changes. The mean μ\muμ and covariance Σ\SigmaΣ of the CSDD are used to define the reference distribution. To ensure numerical stability, the covariance matrix is regularized using the Ledoit-Wolf shrinkage estimator. The squared Mahalanobis distance DM2D_M^2DM2 is then compared against a threshold τ\tauτ, which is determined as the α\alphaα-quantile of a χ2\chi^2χ2 distribution with CCC degrees of freedom, where CCC is the number of classes. If DM2D_M^2DM2 exceeds τ\tauτ, the update is flagged as anomalous and the model is classified as potentially Trojaned; otherwise, it is deemed consistent with benign evolution.

Experiment

The evaluation assesses MIST, a Trojan detection method leveraging activation spectra, across diverse image classification datasets and multiple attack types to validate whether malicious updates induce distinguishable spectral deviations from benign fine-tuning. Results confirm that these spectral differences reliably separate compromised models from clean ones, enabling the approach to consistently outperform existing detectors after a single update while maintaining a highly favorable error profile with minimal false positives. Under repeated fine-tuning scenarios, the method demonstrates robust resilience, as performance degrades gracefully through a controlled increase in false alarms rather than missed detections, ultimately establishing spectral tracking as a stable and practical approach for validating evolving neural networks.

The authors evaluate the effectiveness of MIST, a Trojan detection technique, by assessing its ability to distinguish between benign and malicious model updates through spectral analysis. The results demonstrate that spectral distances reliably separate Trojaned models from clean ones, with high detection accuracy in single-step fine-tuning scenarios and sustained performance under multiple benign updates. MIST consistently outperforms baseline methods, showing fewer false positives and maintaining detection capability even as models drift from the original reference. Spectral distances effectively separate Trojaned models from clean fine-tuned models across multiple datasets and attack types. MIST achieves high detection accuracy in single-step fine-tuning scenarios, outperforming state-of-the-art detectors with fewer false positives. Detection performance degrades gracefully under multi-step evolution, primarily due to increased false positives rather than missed Trojans.

The authors evaluate MIST, a Trojan detection technique, by analyzing spectral changes in fine-tuned models compared to a clean reference checkpoint. Results show that spectral distances effectively separate Trojaned models from clean ones, with high detection accuracy across multiple datasets and attacks. The method remains robust under repeated model updates, though performance degrades slightly due to increased false positives. The detection effectiveness is consistently superior to state-of-the-art baselines, particularly in minimizing false positives. Spectral distances reliably distinguish Trojaned models from clean fine-tuned models across various attacks and datasets. MIST achieves high detection accuracy, consistently outperforming existing Trojan detectors with fewer false positives. Detection performance degrades gracefully under multi-step evolution, primarily due to increased false positives rather than missed detections.

The authors evaluate MIST, a Trojan detection technique, by assessing its ability to distinguish between benign and malicious model updates through spectral analysis. Results show that MIST achieves high detection accuracy across various datasets and attack types, consistently outperforming state-of-the-art detectors. The method remains effective under multi-step model evolution, though performance degrades slightly due to increased false positives from benign drift. MIST achieves high detection accuracy across datasets and attack types, outperforming existing Trojan detectors. The method reliably separates Trojaned models from clean fine-tuned ones based on spectral differences, even under multi-step evolution. Detection performance degrades gracefully with repeated benign updates, primarily due to increased false positives rather than missed detections.

The experiments evaluate MIST, a Trojan detection method that leverages spectral analysis to differentiate between benign and malicious model updates by comparing fine-tuned checkpoints against a clean reference. The results validate that spectral distances reliably isolate compromised models in single-step fine-tuning scenarios while consistently outperforming existing detectors with significantly fewer false positives. Although repeated benign updates cause a gradual increase in false alarms, the technique maintains robust detection capabilities and gracefully degrades without missing actual threats. Overall, the study demonstrates that spectral-based monitoring offers a highly effective and resilient approach for identifying backdoored updates in evolving machine learning pipelines.


بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp