Command Palette
Search for a command to run...
T2S-Bench 및 구조적 사고 (Structure-of-Thought): 포괄적인 텍스트-구조 추론에 대한 벤치마킹 및 프롬프트 엔지니어링
T2S-Bench 및 구조적 사고 (Structure-of-Thought): 포괄적인 텍스트-구조 추론에 대한 벤치마킹 및 프롬프트 엔지니어링
초록
인간이 복잡한 읽기 과제를 수행할 때 핵심 포인트를 표시하고, 이들 간의 관계를 추론하며, 정보의 구조를 구성하여 이해와 응답을 이끈다는 점을 고려해 보십시오. 이와 유사하게, 대규모 언어 모델도 텍스트 구조를 활용하여 텍스트 처리 성능을 향상시킬 수 있을까요? 이를 탐구하기 위해 본 연구에서는 먼저 '사고의 구조화 (Structure of Thought, SoT)'라는 프롬프팅 기법을 제안합니다. SoT 는 모델이 명시적으로 중간 텍스트 구조를 구축하도록 안내하여, 8 가지 과제와 3 개 모델 계열 전반에 걸쳐 일관되게 성능을 향상시킵니다. 이러한 통찰을 바탕으로, 우리는 모델의 텍스트 - 구조 변환 (text-to-structure) 능력을 평가하고 개선하기 위해 설계된 최초의 벤치마크인 T2S-Bench 를 소개합니다. T2S-Bench 는 6 개 과학 분야와 32 가지 구조 유형에 걸쳐 1,800 개의 샘플로 구성되었으며, 정확성, 공정성, 품질을 보장하기 위해 엄격하게 구축되었습니다. 45 개의 주류 모델에 대한 평가 결과, 상당한 개선 여지가 드러났습니다. 다단계 추론 (multi-hop reasoning) 과제에서의 평균 정확도는 불과 52.1% 에 그쳤으며, 가장 첨단 모델조차도 엔드 - 투 - 엔드 추출에서 58.1% 의 노드 정확도만을 달성했습니다. 더 나아가 Qwen2.5-7B-Instruct 모델에서 SoT 만 적용해도 8 가지 다양한 텍스트 처리 과제에서 평균 5.7% 의 성능 향상을 보였으며, T2S-Bench 를 통해 미세조정 (fine-tuning) 을 수행할 경우 이 이득은 8.6% 로 추가 증가했습니다. 이러한 결과는 명시적인 텍스트 구조화의 가치와 SoT 및 T2S-Bench 의 상호 보완적 기여를 강조합니다. 데이터셋 및 평가 코드는 https://t2s-bench.github.io/T2S-Bench-Page/ 에서 공개되었습니다.
One-sentence Summary
Researchers from Duke University, UT Austin, and Meta propose Structure of Thought, a prompting technique that guides models to build intermediate text structures, alongside T2S-Bench, the first benchmark for text-to-structure capabilities, which significantly boosts performance across diverse scientific domains and reasoning tasks.
Key Contributions
- Current large language models struggle with complex text processing due to a lack of stable intermediate representations, prompting the need for a universal approach to structure information before generating answers.
- The authors introduce Structure of Thought (SoT), a prompting technique that guides models to construct intermediate text structures, and T2S-Bench, the first benchmark containing 1.8K samples across six scientific domains to evaluate these capabilities.
- Evaluations on 45 mainstream models reveal significant performance gaps, while experiments show that SoT alone improves accuracy by 5.7% and fine-tuning on the new benchmark further increases gains to 8.6% across diverse tasks.
Introduction
Large language models are increasingly deployed in critical workflows like scientific literature review and evidence-based decision making, yet they often struggle with complex, long-context tasks because they rely on unstable end-to-end generation without stable intermediate representations. Prior attempts to improve performance through structured reasoning or task-specific extraction modules have failed to generalize across diverse text types and lack a unified evaluation framework. To address these gaps, the authors introduce Structure of Thought, a prompting technique that guides models to explicitly construct intermediate text structures before answering, and T2S-Bench, the first comprehensive benchmark designed to evaluate and enhance text-to-structure capabilities across multiple scientific domains.
Dataset
T2S-Bench Dataset Overview
The authors introduce T2S-Bench, a comprehensive dataset designed to evaluate and train models on text-to-structure capabilities using high-quality academic sources. The construction process addresses verification challenges by leveraging rigorously validated scientific diagrams and their corresponding texts.
-
Dataset Composition and Sources
- The primary data source consists of academic papers across six major scientific domains: Computer Science, Life Sciences, Social Sciences, Environmental Sciences, Economics & Management Sciences, and Physical Sciences.
- The dataset covers 17 sub-disciplines and 32 distinct structural types, ensuring broad topical diversity.
- All text-structure pairs are derived from real-world diagrams that have been meticulously designed by authors and validated by peer reviewers.
-
Key Details for Each Subset
- T2S-Bench-MR (Multi-hop Reasoning): Contains approximately 1,700 high-quality text-structure-question triples. Each entry includes a text segment, a reference diagram, and a multiple-choice question requiring multi-step reasoning. Questions are categorized into four types: Fault Localization, Functional Mapping, Boundary Testing, and Counterfactual Reasoning.
- T2S-Bench-E2E (End-to-End Extraction): Comprises 87 rigorously vetted Text-KeyStructure pairs. This subset focuses on extracting key nodes and links while filtering noise, with graph complexity controlled to ensure fair evaluation.
- T2S-Train: A training split of 1,200 instruction-answer pairs derived from the collected data, used for fine-tuning structure-aware models.
-
Data Usage and Processing
- Construction Pipeline: The authors employ a four-module automated pipeline involving paper search, PDF download, figure cropping, and structural validity checks using advanced models like GPT-5.2 and Gemini-2.5-Pro.
- Human Verification: Three rounds of human filtering by PhD-level experts ensure quality. The first round removes noisy diagrams, the second validates question solvability and logic, and the third confirms the alignment between text and key structures.
- Split Strategy: The final benchmark uses a stratified 7:3 split by domain. The test set (T2S-Bench-MR) contains 500 samples for multiple-choice evaluation, while the E2E set remains separate for structure extraction tasks.
- Evaluation Metrics: The authors use Exact Match (EM) and F1 scores for multiple-choice tasks. For E2E tasks, they evaluate node extraction via average semantic similarity and link extraction via F1 scores on predicted link pairs.
-
Cropping and Metadata Details
- Figure Cropping: Figures are extracted from PDFs using
pdffigures2and validated by GPT-4o to confirm structural relevance before further processing. - Text Segmentation: The pipeline ensures that at least three text segments correspond clearly to each structural diagram, with start and end sentences explicitly identified.
- Partial Constraint Strategy: To handle the one-to-many mapping problem in E2E tasks, the authors evaluate nodes and links separately. Models are provided with either all node information to predict links or all link information to predict nodes, standardizing outputs for accurate assessment.
- Figure Cropping: Figures are extracted from PDFs using
Method
The authors propose a systematic pipeline for constructing the T2S (Text-to-Structure) dataset, which is designed to evaluate multi-hop reasoning capabilities over structural diagrams. The overall workflow, depicted in the framework diagram, consists of three main phases: Sample Collection, T2S Multi-hop Reasoning Dataset Construction, and T2S-Bench-E2E Construction.
During the Sample Collection phase, the system searches for academic papers across diverse domains including Computer, Life, and Social sciences. It identifies structural figures, downloads the corresponding PDFs, and performs automated validity checks to ensure the diagrams can be represented as connected node-link graphs. A human filter is subsequently applied to verify entity references and diagram quality, resulting in a curated set of high-quality text-structure pairs.
The dataset construction phase relies on a template-based approach to generate reasoning questions. As shown in the figure below:
The authors define a specific set of Counterfactual Reasoning templates, labeled CR-1 through CR-8, which guide the generation of questions involving operations such as removing edges, flipping polarity, or disabling feedback loops. Similar templates are utilized for Fault Localization and Functional Mapping tasks. These templates ensure that the generated questions require multi-step structural reasoning rather than simple information retrieval.
To execute these complex tasks, the system employs an LLM-based architecture with advanced memory management capabilities. Refer to the system architecture diagram for the detailed component layout.
The architecture features a Function Executor that interacts with Archival Storage and a Queue Manager that handles a FIFO Queue for prompt tokens. This design allows the system to maintain context and manage long-running processes such as paper search and schema normalization. The pipeline incorporates strict quality control steps, including Model Checks for correctness and text dependency, followed by a Human Filter to ensure node and link constraints are met.
Finally, the evaluation process utilizes specific prompt contracts for API models. The system enforces a strict output schema for multiple-choice QA and employs two-stage prompts for structure evaluation: one for node labeling and another for link extraction. This ensures robust parsing of the model's responses during benchmarking.
Experiment
- The Structure of Thought (SoT) prompting strategy was evaluated against Direct Answer and Chain of Thought methods, validating that explicitly forcing models to structure text into nodes and links significantly improves performance across diverse text-processing tasks and model families.
- Benchmarking 45 models on T2S-Bench revealed that while proprietary models currently lead, instruction-tuned open-source models are rapidly closing the gap, though all models struggle most with identifying correct nodes compared to linking them.
- Fine-tuning experiments demonstrated that enhancing a model's ability to extract text structures directly translates to improved performance on downstream long-context reasoning tasks, confirming that structural understanding is a fundamental prerequisite for effective multi-hop reasoning.
- Analysis of structural complexity showed that model accuracy declines sharply as the number of nodes in a graph increases, indicating that current systems lack the scalability to handle highly complex structural relationships.
- Correlation studies confirmed a strong positive relationship between a model's ability to perform text-to-structure extraction and its general long-context reasoning capabilities, suggesting that structural thinking is a universal indicator of reasoning proficiency.