Question Answering On Bamboogle
评估指标
Accuracy
评测结果
各个模型在此基准测试上的表现结果
模型名称 | Accuracy | Paper Title | Repository |
---|---|---|---|
FireAct | 44.0 | FireAct: Toward Language Agent Fine-tuning | - |
MCR (code-davinci-002) + Google Search | 66.5 | Answering Questions by Meta-Reasoning over Multiple Chains of Thought | |
Self-ask (GPT-3; davinci-002) | 57.6 | Measuring and Narrowing the Compositionality Gap in Language Models | |
ReST meets ReAct (PaLM 2-L + Google Search) | 76.1 | ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent | - |
Self-ask (GPT-3; davinci-002) + Google Search | 60.0 | Measuring and Narrowing the Compositionality Gap in Language Models | |
Google Search | 0 | Measuring and Narrowing the Compositionality Gap in Language Models | |
Chain-of-Thought (GPT-3; davinci-002) | 46.4 | Measuring and Narrowing the Compositionality Gap in Language Models | |
RALM (LLaMA2-13B + Google Search) | 62.7 | Making Retrieval-Augmented Language Models Robust to Irrelevant Context | - |
Direct Prompting (GPT-3; davinci-002) | 17.6 | Measuring and Narrowing the Compositionality Gap in Language Models |
0 of 9 row(s) selected.