Large Language Models for Data Synthesis

Tang, Yihong ; Kong, Menglin ; Sun, Lijun

발행일: 6/2/2025

Large Language Models for Data Synthesis

초록

Generating synthetic data that faithfully captures the statistical structureof real-world distributions is a fundamental challenge in data modeling.Classical approaches often depend on strong parametric assumptions or manualstructural design and struggle in high-dimensional or heterogeneous domains.Recent progress in Large Language Models (LLMs) reveals their potential asflexible, high-dimensional priors over real-world distributions. However, whenapplied to data synthesis, standard LLM-based sampling is inefficient,constrained by fixed context limits, and fails to ensure statistical alignment.Given this, we introduce LLMSynthor, a general framework for data synthesisthat transforms LLMs into structure-aware simulators guided by distributionalfeedback. LLMSynthor treats the LLM as a nonparametric copula simulator formodeling high-order dependencies and introduces LLM Proposal Sampling togenerate grounded proposal distributions that improve sampling efficiencywithout requiring rejection. By minimizing discrepancies in the summarystatistics space, the iterative synthesis loop aligns real and synthetic datawhile gradually uncovering and refining the latent generative structure. Weevaluate LLMSynthor in both controlled and real-world settings usingheterogeneous datasets in privacy-sensitive domains (e.g., e-commerce,population, and mobility) that encompass both structured and unstructuredformats. The synthetic data produced by LLMSynthor shows high statisticalfidelity, practical utility, and cross-data adaptability, positioning it as avaluable tool across economics, social science, urban studies, and beyond.

논문 세부 정보 보기