HyperAIHyperAI

Command Palette

Search for a command to run...

il y a un an

vLLM Hook v0 : Un module enfichable pour les internes du modèle de programmation sur vLLM

Ching-Yun Ko Pin-Yu Chen

Déployez Gemma-3-27B-IT avec vLLM

20 heures de calcul sur RTX 5090 pour seulement $1 (valeur $7)
Aller à Notebook

Résumé

Les modèles modernes d'intelligence artificielle (IA) sont déployés sur des moteurs d'inférence afin d'optimiser l'efficacité en temps d'exécution et l'allocation des ressources, en particulier pour les grands modèles de langage (LLM) basés sur l'architecture Transformer. Le projet vLLM est une bibliothèque open-source majeure destinée à prendre en charge le déploiement et l'inférence de modèles. Toutefois, l'implémentation actuelle de vLLM limite la programmabilité des états internes des modèles déployés. Cela empêche l'utilisation de méthodes populaires d'alignement et d'amélioration des modèles au moment de l'inférence (test-time). Par exemple, cela empêche la détection d'invites adversariales (adversarial prompts) basée sur les motifs d'attention ou l'ajustement des réponses du modèle par le biais du pilotage des activations (activation steering). Pour combler cette lacune critique, nous présentons vLLM Hook, un plug-in open-source permettant la programmation des états internes des modèles vLLM. Sur la base d'un fichier de configuration spécifiant quels états internes capturer, vLLM Hook assure une intégration transparente avec vLLM et prend en charge deux fonctionnalités essentielles : la programmation passive et la programmation active. Pour la programmation passive, vLLM Hook sonde les états internes sélectionnés pour une analyse ultérieure, tout en préservant l'intégrité de la génération du modèle. Pour la programmation active, vLLM Hook permet une intervention efficace dans la génération du modèle en modifiant les états internes sélectionnés.

One-sentence Summary

The authors present vLLM Hook, an open-source plug-in for the vLLM inference engine that enables configurable programming of internal model states through passive probing and active intervention, thereby overcoming existing programmability constraints to support test-time alignment, adversarial prompt detection, and activation steering for large language models.

Key Contributions

  • vLLM Hook is an open-source plugin that enables configuration-driven programming of internal states within the vLLM inference engine, directly addressing the limitation that restricts test-time model alignment and enhancement methods.
  • The system implements two core programming modes, passive programming for non-intrusive state probing that preserves generation, and active programming for real-time intervention via the alteration of selected internal states.
  • Three practical demonstrations validate the plugin, showcasing prompt injection detection, enhanced retrieval-augmented retrieval, and activation steering to verify its utility for runtime model monitoring and adjustment.

Introduction

Modern large language models rely on inference engines like vLLM to optimize deployment efficiency and resource allocation. However, the current vLLM implementation restricts access to and modification of internal model states during inference, which blocks essential test-time alignment techniques such as adversarial prompt detection and activation steering. To address this limitation, the authors develop vLLM Hook, an open-source plug-in that enables precise programming of internal states through a simple configuration file. The framework supports passive probing for real-time analysis and active intervention to directly alter model outputs, effectively unlocking practical applications like enhanced retrieval-augmented generation and secure prompt monitoring.

Dataset

  • The authors do not provide a dataset description in the submitted text.
  • Dataset composition and sources: The content only outlines a GitHub contribution workflow and references a repository URL. No data sources or composition details are included.
  • Key details for each subset: The text contains no information regarding subset sizes, origins, or filtering criteria.
  • How the paper uses the data: No training splits, mixture ratios, or data processing steps are described.
  • Cropping strategy, metadata construction, or other processing details: None are mentioned in the provided material.

Method

The vLLM-Hook framework is designed as a modular plugin system that enables both passive and active programming within the vLLM inference pipeline. At its core, the framework operates through two primary abstractions: the worker and the analyzer, which are orchestrated by a configuration file that defines the behavior of each component. The worker integrates directly into the vLLM runtime and is responsible for either capturing internal model states during inference (passive programming) or modifying the model's behavior in real time (active programming). This integration is achieved by subclassing the standard vLLM GPU worker and overriding the load_model method to install PyTorch forward hooks on selected model modules. These hooks are applied to specific attention layers and heads, as specified in the configuration, allowing for targeted observation or intervention during the forward pass.

As shown in the figure below, the framework begins with a native vLLM system that receives an input prompt. The user specifies the components to probe via a configuration file, which is then used to guide the vLLM-Hook system. The system captures internal states during inference, which can be either saved for later analysis or used to enable active programming, such as model steering or customized generation. The configuration file defines the model identity, important layers and attention heads, and the mode of signal capture—such as whether to collect data for all tokens or only the last token. These configurations are managed through a lightweight registry and a HookLLM wrapper class that initializes the LLM instance and interfaces with the core vLLM engine.

The workflow proceeds in three stages: configuration identification, probing, and programming. In the configuration stage, the user identifies the components to probe, potentially using external data. During probing, the worker measures targeted model internals via hooks during inference, capturing relevant activations or attention weights. The final stage involves programming, where the saved states are used either for passive monitoring—such as evaluating prompt injection risks—or for active intervention, such as steering model behavior. This process is illustrated in the framework diagram, where the vLLM-Hook plugin wraps the vLLM system and interacts with the LLMEngine, which manages input processing, scheduling, model execution, and output processing.

The analyzer component operates on the saved states after inference completion. It retrieves the cached data using a unique run identifier and reassembles the desired statistics, such as attention weights, to compute metrics like prompt injection attack scores or document relevance scores. This is achieved through a modular analyzer class that takes the hook directory and layer-to-head mappings as inputs and processes the cached data to compute specific metrics. The analyzer is triggered via the llm.analyze method, which allows users to perform post-inference analysis without modifying the core model or runtime. This modular design enables the framework to support a wide range of applications, including safety monitoring, model steering, and selective retrieval, by combining different workers and analyzers within the same orchestration system.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp