HyperAIHyperAI

Command Palette

Search for a command to run...

LLMはあなたの混乱を整理できるか?LLMを用いたアプリケーション対応データ準備のサーベイ

Abstract

データ準備の目的は、原始データセットのノイズ除去、複数データセット間の関係性の解明、およびその中から価値ある知見の抽出を行うことであり、データ中心型アプリケーションの多岐にわたる分野において不可欠である。近年、(i)分析・可視化・意思決定などの用途に即したデータに対する需要の増大、(ii)ますます強力な大規模言語モデル(LLM)技術の進展、(iii)柔軟なエージェント構築を可能にするインフラの登場(例:Databricks Unity Catalogの活用)といった要因により、LLMを活用したデータ準備手法は、急速に変革的かつ支配的なパラダイムへと進化しつつある。本稿では、最近の数百の学術論文を調査・分析し、こうした進化する分野を体系的にレビューする。特に、さまざまな下流タスク向けにデータを準備するためのLLM技術の活用に焦点を当てる。まず、ルールベースかつモデル固有のパイプラインから、プロンプト駆動型、文脈認識型、エージェント型の準備ワークフローへと移行する根本的なパラダイムシフトを明らかにする。次に、タスク中心の分類体系を提示し、本分野を3つの主要なタスクに整理する:データクリーニング(例:標準化、誤り処理、欠損値補完)、データ統合(例:エンティティマッチング、スキーママッチング)、データ拡張(例:データアノテーション、プロファイリング)。各タスクについて代表的な手法を調査し、それぞれの強み(例:汎化性能の向上、意味理解の深化)と課題(例:LLMのスケーラビリティに伴う膨大なコスト、高度なエージェントにおいても継続する幻覚現象、高度な手法と弱い評価指標との乖離)を明示する。さらに、一般的に用いられるデータセットおよび評価指標(実証的側面)について分析する。最後に、未解決の研究課題を議論し、スケーラブルなLLM-データシステム、信頼性の高いエージェントワークフローの原理的設計、堅牢な評価プロトコルの構築を重視する前向きなロードマップを提示する。

One-sentence Summary

Wei Zhou et al. survey LLM-enhanced data preparation, proposing a task-centric taxonomy for cleaning, integration, and enrichment that shifts from rule-based to agentic, context-aware workflows, highlighting innovations like retrieval-augmented generation and hybrid LLM-ML systems while addressing scalability, hallucination, and evaluation gaps across enterprise analytics and ML pipelines.

Key Contributions

  • LLMs enable a paradigm shift from manual, rule-based data preparation to instruction-driven, agentic workflows that reduce human effort by allowing natural language prompts and automated pipeline orchestration for tasks like cleaning and integration.
  • The paper introduces a task-centric taxonomy covering data cleaning, integration, and enrichment, highlighting how LLMs improve semantic reasoning—resolving ambiguities like synonyms or domain terms—where traditional methods fail due to syntactic or statistical limitations.
  • Through analysis of recent literature and empirical benchmarks, the survey identifies key challenges including LLM hallucinations, scalability costs, and weak evaluation protocols, while calling for scalable systems, reliable agent designs, and robust evaluation frameworks.

Introduction

The authors leverage large language models to address the growing need for application-ready data in analytics, decision-making, and real-time systems. Traditional data preparation methods suffer from heavy reliance on manual rules, limited semantic understanding, poor generalization across domains, and high labeling costs — all of which hinder scalability and adaptability. In response, the authors present a comprehensive survey that organizes LLM-enhanced techniques around three core tasks: cleaning, integration, and enrichment. They introduce a task-centric taxonomy, highlight how LLMs enable prompt-driven and agentic workflows, and identify key limitations including hallucinations, high inference costs, and weak evaluation protocols — while outlining a roadmap toward scalable, reliable, and cross-modal data preparation systems.

Dataset

The authors use a structured dataset taxonomy to evaluate LLM-enhanced data preparation methods, grouping benchmarks by processing granularity: record-level, schema-level, and object-level.

  • Record-Level Datasets: Focus on individual tuples, cells, or tuple pairs. Used for cleaning, error detection, imputation, and entity matching. Examples include:

    • Tuple-level: Adult Income, Hospital, Beers, Flights, Enron Emails
    • Column-level: Paycheck Protection Program, Chicago Food Inspection
    • Cell-level: Buy, Restaurant, Walmart
    • Tuple-pair: abt-buy, Amazon–Google, Walmart–Amazon, DBLP–Scholar, DBLP–ACM, WDC Products
  • Schema-Level Datasets: Align columns or attributes across heterogeneous schemas. Used for schema matching in domains like healthcare and biomedicine. Examples:

    • OMOP, Synthea, MIMIC (clinical)
    • GDC-SM, ChEMBL-SM (scientific/biomedical)
  • Object-Level Datasets: Treat entire tables or documents as units. Used for profiling and annotation requiring global context. Examples:

    • Table-level: Public BI, Adventure Works, ChEMBL-DP, Chicago Open Data, NQ-Tables, FetaQA
    • Document-level: AG-News, DBpedia, CoNLL-2003, WNUT-17

The authors do not train models on these datasets directly but use them to benchmark LLM-enhanced methods across cleaning, integration, and enrichment tasks. No specific cropping or metadata construction is described beyond the datasets’ inherent structure and evaluation metrics. Processing involves applying LLMs to standardize formats, detect and repair errors, impute missing values, match entities and schemas, annotate semantics, and profile metadata — all aimed at improving downstream analytical and ML tasks.

Method

The authors present a comprehensive framework for leveraging large language models (LLMs) in data preparation, structured around three core domains: data cleaning, data integration, and data enrichment. The overall architecture, as illustrated in the framework diagram, positions these domains as foundational processes that transform heterogeneous and raw data into structured, trustworthy, and enriched forms suitable for downstream applications such as visual analytics, enterprise business intelligence, model training, fraud monitoring, and data marketplaces. The framework highlights that data cleaning addresses issues like inconsistent standards, errors, and incompleteness, while data integration resolves semantic ambiguities and schema conflicts, and data enrichment enhances datasets with metadata and annotations. These processes are interconnected, with outputs from one domain often serving as inputs to another, forming a cohesive pipeline for data preparation.

The data cleaning module, as shown in the framework diagram, encompasses three primary tasks: data standardization, error processing, and imputation. For data standardization, the authors describe a prompt-based end-to-end approach where structured prompts guide the LLM to transform heterogeneous data into a unified format. This includes instruction-guided prompting, which uses manually crafted instructions and in-context examples, and reasoning-enhanced batch processing, which employs step-wise reasoning and batch-wise prompting to improve robustness and efficiency. An alternative paradigm is tool-assisted agent-based standardization, where LLM agents coordinate and execute standardization pipelines by generating and executing API calls or code, as depicted in the example of CleanAgent. For data error processing, the framework outlines a two-stage process of detection and correction. This is achieved through prompt-based end-to-end processing, which uses structured prompts to guide the LLM through iterative workflows, and task-adaptive fine-tuning, where LLMs are fine-tuned on synthetic or context-enriched datasets to learn specific error patterns. A hybrid approach integrates LLMs with machine learning models, using the LLM to generate pseudo-labels for training a lightweight ML classifier or to derive interpretable decision structures that guide ML-based error detection.

The data integration module focuses on entity and schema matching. For entity matching, the framework describes a prompt-based end-to-end approach that uses structured prompts to guide LLMs in determining if two records refer to the same real-world entity. This includes guidance-driven in-context matching, which uses expert-defined rules and multi-step pipelines, and batch-clustering matching, which processes multiple entity pairs simultaneously to enhance efficiency. Task-adaptive-tuned matching involves fine-tuning LLMs using reasoning traces distilled from larger models or by improving training data quality. A multi-model collaborative approach coordinates multiple models, such as a lightweight ranker and a stronger LLM, to achieve scalable and consistent matching. For schema matching, the framework presents a prompt-based end-to-end method that uses structured prompts to identify correspondences between column names across different schemas. This is enhanced by retrieval-enriched contextual matching, which augments LLM inputs with context from external retrieval components like knowledge graphs, and model-optimized adaptive matching, which uses modality-aware fine-tuning and specialized architectural components like table encoders. An agent-guided orchestration approach uses LLM agents to manage and coordinate the entire schema matching pipeline, either through role-based task partitioning or tool-planning mechanisms.

The data enrichment module addresses data annotation and profiling. For data annotation, the framework describes a prompt-based end-to-end approach that uses structured prompts to guide LLMs in assigning semantic or structural labels to data instances. This includes instruction-guided annotation and reasoning-enhanced iterative annotation, which uses step-by-step reasoning and self-assessment. A retrieval-assisted contextual annotation approach enriches prompts by retrieving relevant context from semantically similar examples or external knowledge graphs. Fine-tuned augmented annotation improves performance in specialized domains by fine-tuning LLMs on task-specific datasets. A hybrid LLM-ML annotation approach combines LLMs with machine learning models, using the LLM to generate candidate annotations that are then distilled and filtered by a smaller model. Tool-assisted agent-based annotation uses LLM agents augmented with specialized tools to handle complex annotation tasks. For data profiling, the framework outlines a prompt-based end-to-end approach that uses carefully designed prompts to guide LLMs in generating dataset descriptions, schema summaries, and hierarchical organization. This includes instruction and constraint-based profiling and example and reasoning-enhanced profiling, which combines few-shot examples with Chain-of-Thought reasoning. A retrieval-assisted contextual profiling approach combines multiple retrieval techniques with LLM reasoning to improve profiling accuracy, especially when metadata is sparse.

Experiment

  • Evaluated data preparation methods across four dimensions: correctness, robustness, ranking quality, and semantic consistency.
  • Correctness metrics include Accuracy, Precision, F1-score (for reliability), and Recall, Matching Rate (for coverage), validated on tasks like entity matching and error detection.
  • Robustness assessed via ROC and AUC, measuring consistent performance across varying data distributions in error processing tasks.
  • Ranking quality measured by P@k and MRR for retrieval utility, and Recall@GT, 1−α, Hit Rate for enrichment completeness in annotation and matching tasks.
  • Semantic preservation evaluated using ROUGE (lexical overlap) and Cosine Similarity (embedding alignment), ensuring output consistency with reference content in standardization and profiling tasks.

The authors use a comprehensive evaluation framework to assess data preparation methods across multiple dimensions, including correctness, robustness, ranking quality, and semantic consistency, with metrics such as precision, recall, and F1-score for correctness, ROC and AUC for robustness, P@k and MRR for ranking quality, and ROUGE and cosine similarity for semantic preservation. Results show that while many methods achieve high performance in specific areas like correctness or ranking quality, there is no single approach that consistently outperforms others across all dimensions, highlighting the trade-offs between different evaluation criteria.

The authors use a diverse set of datasets across data cleaning, integration, and enrichment tasks to evaluate preparation methods, with metrics selected based on correctness, robustness, ranking quality, and semantic consistency. Results show that evaluation metrics vary significantly by task and data type, with precision, recall, and F1-score being most common, while specialized metrics like ROC-AUC, MRR, and ROUGE-1 are used for specific scenarios such as error detection and text-based enrichment.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています