il y a 5 heures

Keisuke Kamahori Shihang Li Simon Peter Baris Kasikci

Table des matières

Résumé

Pendant des années, nous avons conçu des systèmes d’inférence de grands modèles de langage (LLM) comme toute autre infrastructure critique : une pile logicielle unique à usage général, finement ajustée au fil de nombreuses années d’ingénierie, destinée à supporter tous les modèles et toutes les charges de travail. Dans cet article, nous prenons le pari inverse : une boucle multi-agent qui synthétise automatiquement des systèmes d’inférence sur mesure pour différents scénarios d’utilisation. Nous proposons VibeServe, la première boucle agentic qui génère l’intégralité des piles d’inférence de LLM de bout en bout. VibeServe utilise une boucle externe pour planifier et suivre la recherche dans l’espace des conceptions de systèmes, et une boucle interne pour implémenter les candidats, vérifier leur exactitude et mesurer leurs performances sur le benchmark cible. Dans le scénario de déploiement standard, où les piles existantes sont hautement optimisées, VibeServe reste compétitif face à vLLM, démontrant que la spécialisation au moment de la génération (generation-time specialization) ne doit pas se faire au détriment des performances. De manière plus intéressante, dans des scénarios non standards, VibeServe surpasse les systèmes existants en exploitant des opportunités que les systèmes génériques négligent, dans six scénarios impliquant des architectures de modèles non standards, des connaissances sur la charge de travail et des optimisations spécifiques au matériel. Ensemble, ces résultats suggèrent un autre point dans l’espace de conception des logiciels d’infrastructure : la spécialisation au moment de la génération plutôt que la généralité au moment de l’exécution.

One-sentence Summary

VibeServe is the first agentic loop that generates entire bespoke LLM serving stacks end-to-end by employing an outer loop for design planning and an inner loop for implementation and evaluation, prioritizing generation-time specialization over runtime generality to remain competitive with vLLM in standard deployments while outperforming existing systems in six scenarios involving non-standard model architectures, workload knowledge, and hardware-specific optimizations.

Key Contributions

The paper introduces VibeServe, the first agentic loop that generates entire LLM serving stacks end-to-end for different usage scenarios. It uses an outer loop to plan system designs and an inner loop to implement candidates while checking correctness and measuring performance.
The architecture employs role-based agents that collaborate on a shared workspace using a skills library of serving-systems knowledge and direct profiler access. These agents fold performance analysis into every implementation change to target performant code during the synthesis process.
Experiments demonstrate that the method remains competitive with vLLM in standard deployment settings while outperforming existing systems in six non-standard scenarios. Two of these specific scenarios involving non-standard model architectures and hardware cannot run on any generic stack.

Introduction

Large language model serving infrastructure typically relies on general-purpose stacks that are manually optimized for mainstream workloads. This one-size-fits-all approach incurs a performance tax on emerging model architectures and specialized hardware where standard abstractions fail. Existing agentic systems address optimization at a granular level but struggle to coordinate end-to-end system synthesis due to context window limitations and state drift. The authors propose VibeServe, a multi-agent framework that automatically generates bespoke serving systems tailored to specific deployment targets. Their design employs an outer planning loop and an inner implementation loop to manage long-horizon tasks without losing context. This approach achieves parity with hand-tuned baselines on standard deployments while delivering significant speedups in non-standard scenarios involving complex models or unconventional hardware.

Method

The authors propose VibeServe, a framework that generates a bespoke serving system specialized to a user-specified model, hardware platform, and workload. Rather than relying on general-purpose runtimes to cover every case, the system iteratively produces an end-to-end serving system from a small set of user-provided artifacts.

As illustrated in the framework diagram, this approach contrasts with generic serving today, where a single runtime attempts to cover common cases across various workloads and hardware. VibeServe instead creates one bespoke serving system per target, optimizing for the specific intersection of workload, model, and hardware.

The detailed architecture, shown in the figure below, decomposes the generation process into an outer planning loop and an inner implementation loop. The system accepts user inputs including model weights, reference code, accuracy-checking scripts, and benchmark metrics. These inputs define the per-target contract that parameterizes the framework.

The outer loop manages the search policy and search state. It reads prior state from a git repository, dispatches a single task to the inner loop per round, and receives the resulting commit with performance metrics. This loop supports coordination mechanisms such as reverting to earlier checkpoints if a later round passes correctness but regresses on the headline metric. Policies implemented include evolutionary search and an issue-tracker approach that maintains a backlog of structured optimization tasks.

Within each round, the inner loop employs multiple agents to separate code edit proposals from validation. The Implementer agent produces and revises the candidate serving system in an isolated workspace. This workspace mounts user artifacts as read-only and exposes the target execution environment. Once the Implementer produces a build, the Accuracy Judge agent gates the overall correctness. It verifies end-to-end model accuracy against the reference implementation and checks for reward-hacking patterns, such as schema-only synthesis or bypassing model inference.

If the implementation passes correctness checks, the Performance Evaluator profiles the system. It starts with end-to-end performance on the user-provided benchmark and drills down with platform-specific profilers when finer measurements are needed. The Evaluator generates performance hints for subsequent rounds, which are fed back to the outer loop.

To support these agents, VibeServe provides an extensible skills library. This library contains operational knowledge organized by abstraction layers, such as model architectures, serving algorithms, and hardware platforms. Agents retrieve focused guidance from this library, allowing them to apply techniques like continuous batching or utilize specific libraries like FlashInfer without reimplementing kernels.

Experiment

The evaluation assesses VibeServe across six scenarios spanning diverse workload patterns, model architectures, and hardware configurations to determine if generated systems match the performance of human engineered systems or address niche limitations. In standard serving environments, the system achieves parity with established frameworks while autonomously optimizing throughput and latency. For specialized cases such as hybrid architectures and local constrained decoding, VibeServe delivers substantial performance gains by implementing optimizations that general purpose systems lack, validating that bespoke systems can effectively handle complex use cases where existing infrastructure falls short.

The the the table details the resource consumption of the VibeServe agent across six diverse serving scenarios, highlighting the distribution of work among the Orchestrator, Implementer, Judge, and Performance Evaluator roles. The Implementer role consistently consumes the largest portion of the total duration, while the Orchestrator requires the least time. Total agent execution time varies significantly by scenario, with the standard Llama-3.1 serving task demanding the most resources and JSON constrained decoding being the most efficient. The Implementer role dominates the execution time, typically accounting for the largest share of duration across all scenarios. Scenario A requires the highest number of calls and total duration, reflecting the difficulty of optimizing a mature setting. The Orchestrator role consistently maintains the lowest share of time and call volume across all experiments.

The evaluation assesses the VibeServe agent's resource consumption across six diverse serving scenarios to determine workload distribution among the Orchestrator, Implementer, Judge, and Performance Evaluator roles. Results demonstrate that the Implementer role consistently consumes the largest portion of execution time and call volume, whereas the Orchestrator maintains the lowest share across all scenarios. Overall, the experiments highlight that total agent efficiency varies by task complexity, with standard Llama-3.1 serving requiring the most resources while JSON constrained decoding proves the most efficient.

PDF source

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

il y a 5 heures

Keisuke Kamahori Shihang Li Simon Peter Baris Kasikci

Table des matières

Résumé

One-sentence Summary

Key Contributions

The paper introduces VibeServe, the first agentic loop that generates entire LLM serving stacks end-to-end for different usage scenarios. It uses an outer loop to plan system designs and an inner loop to implement candidates while checking correctness and measuring performance.
The architecture employs role-based agents that collaborate on a shared workspace using a skills library of serving-systems knowledge and direct profiler access. These agents fold performance analysis into every implementation change to target performant code during the synthesis process.
Experiments demonstrate that the method remains competitive with vLLM in standard deployment settings while outperforming existing systems in six non-standard scenarios. Two of these specific scenarios involving non-standard model architectures and hardware cannot run on any generic stack.

Introduction

Method

Experiment

PDF source

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

Command Palette

VibeServe : Les agents d'IA peuvent-ils concevoir des systèmes de services dédiés aux LLM ?

Keisuke Kamahori Shihang Li Simon Peter Baris Kasikci

Résumé

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

VibeServe : Les agents d'IA peuvent-ils concevoir des systèmes de services dédiés aux LLM ?

Keisuke Kamahori Shihang Li Simon Peter Baris Kasikci

Résumé

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

VibeServe : Les agents d'IA peuvent-ils concevoir des systèmes de services dédiés aux LLM ?

Keisuke Kamahori Shihang Li Simon Peter Baris Kasikci

Résumé

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters