Exécuter ce Notebook Discuter sur Discord

il y a un an

Aaditya Shukla Sidney Knowles Meenakshi Madugula Dave Farris Ryan Angilly Santiago Pombo Anbang Xu Lu An Abhinav Balasubramanian Tan Yu

Création d'un agent IA basé sur OpenManus + QwQ-32B

20 heures de calcul sur RTX 5090 pour seulement $1 (valeur $7)

Table des matières

Résumé

Les agents d’intelligence artificielle (IA) d’entreprise doivent s’adapter continuellement afin de maintenir leur précision, de réduire la latence et de rester alignés sur les besoins des utilisateurs. Nous présentons une mise en œuvre pratique d’une roue de données (data flywheel) dans NVInfo IA, l’assistant de connaissances à architecture Mixture-of-Experts (MoE) de NVIDIA, qui sert plus de 30 000 employés. En opérationnalisant une roue de données pilotée par le cadre MAPE, nous avons conçu un système en boucle fermée qui traite systématiquement les défaillances des pipelines de génération augmentée par récupération (RAG) et permet un apprentissage continu. Au cours d’une période de trois mois suivant le déploiement, nous avons surveillé les retours et collecté 495 échantillons négatifs. L’analyse a révélé deux modes de défaillance majeurs : des erreurs de routage (5,25 %) et des erreurs de reformulation de requête (3,2 %). À l’aide des microservices NVIDIA NeMo, nous avons mis en œuvre des améliorations ciblées par le biais d’un affinage (fine-tuning). Pour le routage, nous avons remplacé un modèle Llama 3.1 70B par une variante affinée de 8B, atteignant une précision de 96 %, une réduction de la taille du modèle par un facteur 10 et une amélioration de la latence de 70 %. Pour la reformulation de requête, l’affinage a permis une augmentation de la précision de 3,7 % et une réduction de la latence de 40 %. Notre approche démontre comment les retours humains dans la boucle (HITL), lorsqu’ils sont structurés au sein d’une roue de données, transforment les agents IA d’entreprise en systèmes auto-améliorants. Les enseignements clés incluent des approches pour garantir la robustesse des agents malgré un nombre limité de retours utilisateurs, la gestion des contraintes de confidentialité et l’exécution de déploiements échelonnés en production. Ce travail propose un modèle reproductible pour la conception d’agents IA d’entreprise robustes et adaptatifs, capables d’apprendre à partir de l’utilisation réelle à grande échelle.

One-sentence Summary

By operationalizing a MAPE-driven data flywheel with NVIDIA NeMo microservices, the authors fine-tuned routing and query rephrasal components for the NVInfo AI knowledge assistant, replacing a Llama 3.1 70B model with an 8B variant that achieved 96% accuracy, a 10× size reduction, and a 70% latency improvement while boosting query rephrasal accuracy by 3.7% and cutting its latency by 40% through structured human feedback collected over three months.

Key Contributions

This work introduces a MAPE-driven data flywheel framework that operationalizes a closed-loop system for continuous learning in enterprise AI agents. The architecture systematically routes user feedback into the optimization pipeline to enable incremental system evolution.
An empirical analysis of 495 post-deployment feedback samples identifies routing errors (5.25%) and query rephrasing inaccuracies (3.2%) as the primary failure modes. These findings establish a data-driven baseline for prioritizing targeted component optimizations.
A modular implementation blueprint leveraging NVIDIA NeMo microservices executes parameter-efficient fine-tuning to resolve the identified pipeline failures. Targeted optimizations replace a Llama 3.1 70B model with an 8B variant to achieve 96% routing accuracy with a 10× size reduction and 70% latency decrease, while query rephrasing accuracy improves by 3.7% with a 40% latency reduction.

Introduction

The authors address the critical need for enterprise AI agents to maintain accuracy and efficiency as user intent and domain data evolve post-deployment. Existing production systems typically rely on static architectures that isolate feedback from model improvement, leading to performance degradation and high latency without enabling cost-effective continuous learning. The authors introduce a MAPE-driven data flywheel framework that operationalizes a closed-loop pipeline within NVIDIA's NVInfo AI assistant to systematically identify failure modes and apply targeted parameter-efficient fine-tuning. By integrating human feedback with automated monitoring and execution, this approach allows the system to self-correct routing and query rephrasing errors, delivering a scalable blueprint for building robust, adaptive agents that improve incrementally based on real-world usage.

Dataset

Dataset Composition and Sources: The authors build the training corpus by combining production user feedback, subject matter expert corrections, and internal enterprise documentation. Primary sources include a thumbs-down feedback loop, SharePoint expert system logs, and corporate knowledge bases covering benefits, IT policies, and organizational information.
Subset Details:
- Routing Error Remediation: The final collection contains 685 deduplicated samples derived from 729 original entries and 32 SME-verified corrections. An LLM-as-a-Judge pipeline initially flagged 140 potential issues, which were manually validated down to 32 high-confidence errors.
- Rephrasal Error Remediation: This subset comprises 5,000 synthetic samples generated from 250 manually reviewed feedback instances. The authors distilled 10 problematic queries down to 4 representative few-shot examples, which guided the synthetic expansion process.
- Regression Evaluation Set: A curated collection of approximately 300 queries spanning corporate policies, benefits, holidays, and IT support. Each entry includes ground truth answers and expected citation metadata.
Data Usage and Splits: The routing dataset is allocated using a 60/40 train/test split, while the rephrasal dataset follows an 80/10/10 train/validation/test split. The regression set remains held out for periodic LLM-as-a-Judge evaluation focusing on correctness, helpfulness, and conscientiousness. The authors leverage these subsets to fine-tune routing logic and query rephrasing capabilities within their enterprise agent.
Processing and Metadata Construction: Data cleaning and normalization are handled by NeMo Curator, with strict PII removal and GDPR/CCPA compliance applied to all query-response pairs. The synthetic generation pipeline uses Llama 3.1 405B as a generator, injecting SharePoint document context and a structured prompt template to produce aligned question, answer, and rephrased query pairs. The final output format includes structured metadata fields such as Thought, Process, Action, and Action Input to guide downstream tool-use fine-tuning.

Method

The authors leverage a modular, Mixture of Experts (MoE) architecture as the foundation for the NVInfo AI system, which serves as NVIDIA's internal enterprise chatbot. This architecture is designed to handle diverse enterprise information requests by routing user queries to specialized expert models. The core of the system is a router module that employs a large language model (Llama 3.1 70B) to classify incoming user queries and direct them to one of seven domain-specific experts: Financial Info, IT Help & HR Benefits, SharePoint, Holidays, Cafe Menu, People, or NVIDIA Policies. This modular design enables task-specific alignment and enhances efficiency by offloading complex queries to the most appropriate model. The query processing pipeline, which operates after routing, includes several critical stages: conversation rephrasing to incorporate context from prior turns, generating multiple query variations to improve retrieval coverage, a semantic retriever that searches across document collections, re-ranking and de-duplication to prioritize relevant results, answer generation, citation generation for source verification, and suggested follow-up question generation to enhance user interaction.

The system's continuous improvement is governed by an Adaptive Data Flywheel, which implements the MAPE-K control loop (Monitor, Analyze, Plan, Execute) to create a self-improving feedback cycle. The monitoring phase collects both direct user feedback, such as thumbs up/down ratings, and implicit signals like re-queries and session abandonment to identify system failures. This data is then fed into the analysis phase, where systematic error attribution techniques, combining manual analysis with automated classification, are used to pinpoint the root cause of failures within the pipeline, such as routing errors or query rephrasing mistakes. The planning phase leverages NVIDIA's NeMo microservices to develop targeted data curation and fine-tuning strategies. Specifically, the authors employ Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning of the router and query rephrasal components using curated failure samples, enabling significant model size reduction and latency improvements without sacrificing accuracy. The execution phase involves deploying these fine-tuned models back into the system, completing the flywheel cycle and enabling continuous optimization.

The data collection process is a critical component of this architecture, capturing both response metrics and user feedback through a unified pipeline. Response metrics, including the original query, generated response, expert selection, and system latency, are stored in a DynamoDB database for observability. User feedback, recorded as thumbs up/down with optional contextual reasons, is stored in a SQL database. These two data streams are ingested into a central Data Lake via a data transformation pipeline, which standardizes the schema and enables comprehensive analysis. This collected data is then used to train the LLM-as-a-Judge model, which is employed to classify and validate routing errors, as seen in the provided prompt example. The system's ability to collect, analyze, and act on this data forms the basis of its adaptive capabilities, allowing it to identify and correct failure modes such as incorrect expert routing, query rephrasing errors, and hallucinations, thereby improving the overall reliability and performance of the enterprise AI system.

Experiment

Evaluated on NVIDIA’s NVInfo bot using production user feedback, the router and rephrasal experiments validated that fine-tuning smaller models can match larger baselines in accuracy while drastically improving response times. Qualitative analysis of user interactions further demonstrated that a continuous data flywheel effectively corrects routing and query expansion failures without requiring extensive retraining. The deployment process highlighted that staged rollouts, robust monitoring, and cross-team coordination are critical for maintaining system stability at scale. Ultimately, the findings confirm that adaptive, feedback-driven AI agents can continuously evolve to deliver reliable enterprise solutions while significantly reducing computational overhead.

The authors compare a fine-tuned Llama 3.1 8B model against a baseline Llama 3.1 70B model for query rephrasal, showing improved accuracy and reduced latency. The results demonstrate that a smaller model can achieve better performance than the larger baseline in this specific task. The fine-tuned Llama 3.1 8B model achieved higher accuracy than the Llama 3.1 70B baseline model. The fine-tuned Llama 3.1 8B model showed reduced latency compared to the Llama 3.1 70B baseline model. A smaller model achieved better performance than the larger baseline model in the query rephrasal task.

The authors analyze user feedback to identify system errors and improve model performance. Results show that routing and rephrasal errors constitute a small fraction of total failures, indicating that targeted refinement of these areas can lead to significant improvements in system accuracy and efficiency. Routing and rephrasal errors combined make up a small percentage of total system failures. The majority of errors are categorized as other, suggesting that non-routing and non-rephrasal issues dominate system failures. The analysis supports focused improvements on routing and rephrasal to enhance overall system performance.

{"summary": "The authors evaluate a data flywheel system for an enterprise AI assistant, focusing on model optimization and error correction through user feedback. Results show significant improvements in model efficiency and accuracy while maintaining high performance across different domains.", "highlights": ["Model size was reduced by 10 times while maintaining high routing accuracy and reducing latency significantly.", "Query rephrasal accuracy improved with a notable reduction in latency, enhancing user experience.", "Analysis of user feedback revealed that routing and rephrasal errors combined made up a small fraction of total failures, indicating targeted improvements were effective."]

The authors evaluate the performance of various fine-tuned models compared to a baseline Llama 3.1 70B model, focusing on accuracy and latency improvements. Results show that smaller models achieve comparable or near-comparable accuracy with significantly reduced latency, demonstrating the effectiveness of fine-tuning for efficiency gains. The experiments highlight a trade-off between model size and performance, with fine-tuned smaller models achieving high accuracy and much lower latency than the large baseline. Smaller fine-tuned models achieve comparable accuracy to the large baseline model while significantly reducing latency. Fine-tuning enables substantial performance improvements in accuracy and speed for smaller models compared to no-tuning. The results demonstrate that optimized smaller models can outperform larger models in terms of efficiency without sacrificing accuracy.

The experiments compare fine-tuned smaller language models against a larger baseline for query rephrasal and routing tasks, while also analyzing user feedback to categorize system errors. These evaluations validate that targeted optimization allows smaller architectures to match or exceed baseline accuracy while substantially improving response speed. Additionally, the error analysis confirms that routing and rephrasal failures represent only a minor portion of total system issues, demonstrating that focused refinements effectively enhance overall reliability. Collectively, the findings establish that efficient model scaling through fine-tuning successfully balances performance with computational demands.

PDF source

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

HyperAI

Exécuter ce Notebook Discuter sur Discord

il y a un an

Aaditya Shukla Sidney Knowles Meenakshi Madugula Dave Farris Ryan Angilly Santiago Pombo Anbang Xu Lu An Abhinav Balasubramanian Tan Yu

Création d'un agent IA basé sur OpenManus + QwQ-32B

20 heures de calcul sur RTX 5090 pour seulement $1 (valeur $7)

Aller à Notebook

Table des matières

Résumé

One-sentence Summary

Key Contributions

This work introduces a MAPE-driven data flywheel framework that operationalizes a closed-loop system for continuous learning in enterprise AI agents. The architecture systematically routes user feedback into the optimization pipeline to enable incremental system evolution.
An empirical analysis of 495 post-deployment feedback samples identifies routing errors (5.25%) and query rephrasing inaccuracies (3.2%) as the primary failure modes. These findings establish a data-driven baseline for prioritizing targeted component optimizations.
A modular implementation blueprint leveraging NVIDIA NeMo microservices executes parameter-efficient fine-tuning to resolve the identified pipeline failures. Targeted optimizations replace a Llama 3.1 70B model with an 8B variant to achieve 96% routing accuracy with a 10× size reduction and 70% latency decrease, while query rephrasing accuracy improves by 3.7% with a 40% latency reduction.

Introduction

Dataset

Dataset Composition and Sources: The authors build the training corpus by combining production user feedback, subject matter expert corrections, and internal enterprise documentation. Primary sources include a thumbs-down feedback loop, SharePoint expert system logs, and corporate knowledge bases covering benefits, IT policies, and organizational information.
Subset Details:
- Routing Error Remediation: The final collection contains 685 deduplicated samples derived from 729 original entries and 32 SME-verified corrections. An LLM-as-a-Judge pipeline initially flagged 140 potential issues, which were manually validated down to 32 high-confidence errors.
- Rephrasal Error Remediation: This subset comprises 5,000 synthetic samples generated from 250 manually reviewed feedback instances. The authors distilled 10 problematic queries down to 4 representative few-shot examples, which guided the synthetic expansion process.
- Regression Evaluation Set: A curated collection of approximately 300 queries spanning corporate policies, benefits, holidays, and IT support. Each entry includes ground truth answers and expected citation metadata.
Data Usage and Splits: The routing dataset is allocated using a 60/40 train/test split, while the rephrasal dataset follows an 80/10/10 train/validation/test split. The regression set remains held out for periodic LLM-as-a-Judge evaluation focusing on correctness, helpfulness, and conscientiousness. The authors leverage these subsets to fine-tune routing logic and query rephrasing capabilities within their enterprise agent.
Processing and Metadata Construction: Data cleaning and normalization are handled by NeMo Curator, with strict PII removal and GDPR/CCPA compliance applied to all query-response pairs. The synthetic generation pipeline uses Llama 3.1 405B as a generator, injecting SharePoint document context and a structured prompt template to produce aligned question, answer, and rephrased query pairs. The final output format includes structured metadata fields such as Thought, Process, Action, and Action Input to guide downstream tool-use fine-tuning.

Method

Experiment

PDF source

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

Command Palette

Roue de données adaptative : Application des boucles de contrôle MAPE à l'amélioration des agents IA

Aaditya Shukla Sidney Knowles Meenakshi Madugula Dave Farris Ryan Angilly Santiago Pombo Anbang Xu Lu An Abhinav Balasubramanian Tan Yu2 more

Création d'un agent IA basé sur OpenManus + QwQ-32B

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

Roue de données adaptative : Application des boucles de contrôle MAPE à l'amélioration des agents IA

Aaditya Shukla Sidney Knowles Meenakshi Madugula Dave Farris Ryan Angilly Santiago Pombo Anbang Xu Lu An Abhinav Balasubramanian Tan Yu2 more

Création d'un agent IA basé sur OpenManus + QwQ-32B

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

Roue de données adaptative : Application des boucles de contrôle MAPE à l'amélioration des agents IA

Aaditya Shukla Sidney Knowles Meenakshi Madugula Dave Farris Ryan Angilly Santiago Pombo Anbang Xu Lu An Abhinav Balasubramanian Tan Yu2 more

Création d'un agent IA basé sur OpenManus + QwQ-32B

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Aaditya Shukla Sidney Knowles Meenakshi Madugula Dave Farris Ryan Angilly Santiago Pombo Anbang Xu Lu An Abhinav Balasubramanian Tan Yu

Aaditya Shukla Sidney Knowles Meenakshi Madugula Dave Farris Ryan Angilly Santiago Pombo Anbang Xu Lu An Abhinav Balasubramanian Tan Yu

Aaditya Shukla Sidney Knowles Meenakshi Madugula Dave Farris Ryan Angilly Santiago Pombo Anbang Xu Lu An Abhinav Balasubramanian Tan Yu