Text To Sql On Bird Big Bench For Large Scale

Execution Accuracy % (Dev)

Execution Accuracy % (Test)

評価結果

このベンチマークにおける各モデルのパフォーマンス結果

モデル名	Execution Accuracy % (Dev)	Execution Accuracy % (Test)	Paper Title	Repository
PURPLE + RED + GPT-4o	68.12	70.21	-	-
MSc-SQL	65.6	-	MSc-SQL: Multi-Sample Critiquing Small Language Models For Text-To-SQL Translation
Dubo-SQL, v1	59.71	60.71	-	-
SFT CodeS-15B	58.47	60.37	-	-
PURPLE + GPT-4o	62.97	64.51	-	-
DAIL-SQL + GPT-4	54.76	57.41	Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation
ChatGPT (Baseline)	37.22	39.30	Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs
SENSE	55.48	63.39	-	-
ExSL + granite-34b-code	72.43	73.17	-	-
Human Performance	-	-	Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs
SENSE-13B	55.48	63.39	-	-
Claude-2 (Baseline)	42.70	49.02	Can LLMs Effectively Leverage Graph Structural Information through Prompts, and Why?
SCL-SQL	64.73	65.23	-	-
Arcwise + GPT-4o	67.99	66.21	-	-
ByteBrain	65.45	68.87	-	-
MCS-SQL + GPT-4	63.36	65.45	-	-
Codex (Baseline)	34.35	36.47	Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs
CHASE-SQL + Gemini	73.14	74.06	CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL	-
XiYan-SQL	73.34	75.63	A Preview of XiYan-SQL: A Multi-Generator Ensemble Framework for Text-to-SQL
OpenSearch-SQL+ v2 + GPT-4o	69.3	72.28	-	-

0 of 40 row(s) selected.