HyperAIHyperAI

Command Palette

Search for a command to run...

Agents d'Insight : un système multi-agents basé sur un modèle de langage pour des analyses de données

Jincheng Bai Zhenyu Zhang Jennifer Zhang Zhihuai Zhu

Résumé

Aujourd’hui, les vendeurs de commerce électronique font face à plusieurs défis majeurs, notamment la difficulté à découvrir et à exploiter efficacement les programmes et outils disponibles, ainsi qu’à comprendre et tirer parti des données riches provenant de divers outils. Nous nous proposons donc de développer les Insight Agents (IA), un système conversationnel à agents multiples basé sur l’analyse de données, afin d’offrir aux vendeurs de commerce électronique des informations et des insights personnalisés grâce à une récupération automatisée des données. Nous supposons que les IA agiront comme un amplificateur de performance pour les vendeurs, favorisant ainsi une adoption accrue en réduisant l’effort requis et en accélérant la prise de décisions commerciales pertinentes. Dans cet article, nous présentons un système novateur, entièrement intégré et soutenu par des modèles linguistiques à grande échelle (LLM), fondé sur un paradigme planifier-exécuter, conçu pour une couverture complète, une grande précision et une faible latence. Ce système repose sur une architecture hiérarchique à plusieurs agents, composée d’un agent gestionnaire et de deux agents exécutants : un agent de présentation des données et un agent de génération d’insights, permettant une récupération efficace de l’information et une résolution de problèmes optimisée. Nous avons conçu une solution d’apprentissage automatique simple mais efficace pour l’agent gestionnaire, combinant la détection de données hors domaine (OOD) à l’aide d’un modèle léger encodeur-décodeur et le routage des agents via un classificateur basé sur BERT, afin d’optimiser à la fois la précision et la latence. Dans les deux agents exécutants, une planification stratégique est mise en œuvre pour le modèle de données basé sur les API, qui décompose les requêtes en composants granulaires afin de générer des réponses plus précises, tandis que des connaissances spécifiques au domaine sont injectées dynamiquement pour renforcer le générateur d’insights. Les Insight Agents ont été déployés pour les vendeurs Amazon aux États-Unis, où ils ont atteint une précision élevée de 90 % selon une évaluation humaine, avec une latence au seuil P90 inférieure à 15 secondes.

One-sentence Summary

Amazon researchers propose Insight Agents, an LLM-backed multi-agent system using a plan-and-execute framework to deliver personalized, accurate business insights for e-commerce sellers, reducing decision latency to under 15s while achieving 90% human-validated accuracy in the US market.

Key Contributions

  • Insight Agents (IA) is a hierarchical multi-agent LLM system built on a plan-and-execute paradigm, designed to help e-commerce sellers overcome tool discovery and data utilization challenges by delivering personalized, automated business insights with high coverage, accuracy, and low latency.
  • The system employs a manager agent with lightweight OOD detection and BERT-based routing, alongside two worker agents—data presenter and insight generator—that decompose queries into granular API calls and dynamically inject domain knowledge to improve response precision and relevance.
  • Deployed for Amazon US sellers, IA achieves 90% accuracy via human evaluation and maintains P90 latency under 15 seconds, demonstrating practical effectiveness in real-world e-commerce decision support.

Introduction

The authors leverage a hierarchical multi-agent system powered by large language models to help e-commerce sellers quickly extract personalized, actionable insights from complex data sources. Prior systems often struggled with accuracy, latency, or coverage when handling diverse seller queries, requiring manual effort and limiting scalability. Their main contribution is Insight Agents (IA), which combines a plan-and-execute architecture with OOD detection and BERT-based routing to route queries efficiently, while worker agents use granular API planning and dynamic domain knowledge injection to deliver high-accuracy responses—achieving 90% accuracy and under 15s latency in production for Amazon sellers.

Dataset

  • The authors use a custom dataset of 301 questions: 178 in-domain (120 for data presenter, 59 for insight generator) and 123 out-of-domain, collected to train OOD detection and agent routing models.
  • To balance training data for the lightweight BERT model, they augment raw in-domain subsets using an LLM, upsampling both data presenter and insight generator questions to 300 each by introducing semantic variations.
  • A separate benchmarking set of 100 carefully selected popular questions with ground truth is used for end-to-end evaluation of the IA system.
  • The model uses “bge-small-en-v1.5” (33M parameters) as the base BERT encoder and “anthropic.claude-3-sonnet-20240229-v1:0” via Amazon Bedrock for LLM augmentation.
  • OOD detection uses a hyperparameter λ = 4 and a 64-dimensional hidden layer; performance is measured via precision, recall, and accuracy.
  • End-to-end IA responses are evaluated by human auditors on three metrics: Relevance (coverage of key question terms), Correctness (accuracy of insights), and Completeness (coverage of required data points).
  • Question-level accuracy is defined as the percentage of questions where all three metrics exceed 0.8.

Method

The authors leverage a hierarchical manager-worker multi-agent architecture to construct the Insight Agent (IA) system, designed to deliver accurate and low-latency business insights to sellers through conversational interaction. The overall framework, illustrated in the first figure, consists of a central Manager Agent that orchestrates two distinct worker agents: the Data Presenter Agent and the Insight Generator Agent. This structure enables the system to decompose incoming queries into appropriate resolution paths based on their nature, ensuring efficient and targeted processing.

Upon receiving a query from a seller, the Manager Agent first performs an Out-of-Domain (OOD) detection check to determine if the request falls within the scope of data insight capabilities. This initial screening uses a specialized auto-encoder (AE) model trained on in-domain questions to compute a reconstruction error, which is then compared against a threshold derived from the mean and standard deviation of the in-domain loss distribution. This high-precision filter ensures that only potentially valid requests proceed, minimizing unnecessary processing. Concurrently, the agent router, a lightweight BERT-based classifier, categorizes the query to route it to the appropriate resolution path. The query augmenter then refines the input by resolving ambiguities, particularly concerning time ranges, by injecting contextual information such as the current date and calendar week definitions into the prompt.

The system then branches into two parallel workflows. The Data Presenter Agent, responsible for descriptive analytics, employs a data workflow planner that uses a large language model (LLM) to decompose the query into executable steps. This planner leverages a robust data model based on the company's internal APIs, which provides a structured and precise method for data retrieval compared to unstructured text-based approaches. The planner performs task decomposition, selects the appropriate APIs or functions, and generates the necessary input payloads. The Data Workflow Executor then carries out the retrieval and aggregation via the selected APIs, followed by post-processing tasks like data formatting and column matching. The final response is generated through standard prompting, guided by few-shot examples to ensure the correct output format.

The Insight Generator Agent, designed for diagnostic analysis, follows a similar planning and execution process but with domain-specific enhancements. Its data workflow planner also uses LLM-based task decomposition and planning, but it incorporates domain-aware routing to select the appropriate analytical techniques, such as benchmarking or trend analysis. This routing is achieved through a few-shot learning-based LLM classifier that directs the query to predefined resolution paths. The execution phase involves retrieval via API and function calling, with the addition of analytical tools for data transformation. The generation process for the Insight Generator is more complex, utilizing customized prompting that incorporates domain-specific knowledge, prompt templates, and few-shot examples provided by domain experts to produce comprehensive and contextually relevant insights. Both agents conclude with a response guardrail step to prevent the exposure of sensitive or harmful content.

Experiment

  • AE-based OOD detection achieves <0.01s per sample with higher precision than LLM methods; recall can be improved by expanding the in-domain training set.
  • Branch routing attains 83% accuracy on 178 in-domain samples with 0.3s latency per case, outperforming LLM-based classifier (60% accuracy, >2s latency).
  • Human evaluation on 100 questions shows 89.5% overall question-level accuracy and 13.56s P90 end-to-end latency, with 57 questions deemed in-scope.

The authors use a finetuned BERT model for branch routing, achieving an accuracy of 0.83 with a running time of 0.31 seconds per case, which significantly outperforms the LLM-based few-shot approach in both accuracy and speed. Results show that the finetuned BERT model reduces latency by over 85% while improving classification accuracy by 0.23 compared to the LLM-based method.

The authors use an auto-encoder-based method for out-of-distribution detection, which achieves higher precision and significantly faster running time compared to the LLM-based few-shot approach. Results show the auto-encoder model attains 0.969 precision and 0.721 recall with a running time of 0.009 seconds, while the LLM-based method has lower precision (0.616) and higher latency (1.665 seconds).

The authors use a human evaluation to assess end-to-end IA response quality, with 100 questions evaluated and 57 classified as in-scope. Results show a question-level accuracy of 89.5%, with 51 correct responses and 6 incorrect ones, indicating high overall performance.

The authors use human evaluation to assess end-to-end IA responses, with 57 questions in-scope and 100 total questions evaluated. Results show high average scores across relevance (0.977), correctness (0.958), and completeness (0.993), with low standard deviations indicating consistent performance.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp