SSRB Semi-structured Data Natural Language Query Dataset
Date
Paper URL
License
Apache 2.0
SSRB is a large-scale benchmark dataset for natural language querying of semi-structured data, released in 2025 by Harbin Institute of Technology (Shenzhen) in collaboration with Hong Kong Polytechnic University, Tsinghua University, and other institutions. Related research papers include... SSRB: Direct Natural Language Querying to Massive Heterogeneous Semi-Structured DataIt has been selected for NeurIPS 2025 Datasets and Benchmarks, which aims to evaluate and promote the model's ability to retrieve semi-structured data under complex natural language query conditions.
This dataset contains approximately 14 million semi-structured data objects and 8,485 test queries, covering six different domains and involving 99 different patterns. Each query in the dataset addresses the retrieval requirements of semi-structured data. Query conditions typically combine precise field matching constraints with fuzzy semantic matching requirements, and may involve multiple fields and implicit inference. It is used to systematically evaluate the model's ability to retrieve and understand semi-structured data under complex query conditions.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.