Task 1 Grouping On Ocw
评估指标
Wasserstein Distance (WD)
# Correct Groups
# Solved Walls
Adjusted Mutual Information (AMI)
Adjusted Rand Index (ARI)
Fowlkes Mallows Score (FMS)
评测结果
各个模型在此基准测试上的表现结果
模型名称 | Wasserstein Distance (WD) | # Correct Groups | # Solved Walls | Adjusted Mutual Information (AMI) | Adjusted Rand Index (ARI) | Fowlkes Mallows Score (FMS) | Paper Title | Repository |
---|---|---|---|---|---|---|---|---|
BERT (BASE) | 89.5 ± .4 | 22 ± 2 | 0 ± 0 | 8.1 ± .4 | 6.4 ± .3 | 25.1 ± .2 | Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information | |
GPT-3.5-turbo (0-shot) | 82.5 | 114 | 0 | 21.6 | 18.4 | 34.0 | GPT-4 Technical Report | |
GPT-3.5-turbo (1-shot) | 82.3 | 123 | 0 | 21.2 | 18.2 | 34.4 | GPT-4 Technical Report | |
GPT-4 (1-shot) | 73.4 | 262 | 4 | 33.5 | 29.7 | 43.7 | GPT-4 Technical Report | |
GPT-3.5-turbo (10-shot) | 81.2 | 137 | 2 | 24.0 | 20.4 | 36.1 | GPT-4 Technical Report | |
E5 (LARGE) | 84.4 ± .7 | 76 ± 5 | 0 ± 0 | 18.5 ± .6 | 15.4 ± .5 | 32.3 ± .4 | Text Embeddings by Weakly-Supervised Contrastive Pre-training | |
FastText (News) | 85.5 ± .5 | 62 ± 3 | 0 ± 0 | 15.8 ± .3 | 13.0 ± .2 | 30.4 ± .2 | Learning Word Vectors for 157 Languages | |
FastText (Crawl) | 84.2 ± .5 | 80 ± 4 | 0 ± 0 | 18.4 ± .4 | 15.2 ± .3 | 32.1 ± .3 | Learning Word Vectors for 157 Languages | |
E5 (BASE) | 83.8 ± .6 | 89 ± 6 | 1 ± 0 | 19.5 ± .4 | 16.3 ± .4 | 33.1 ± .3 | Text Embeddings by Weakly-Supervised Contrastive Pre-training | |
BERT (LARGE) | 88.3 ± .5 | 33 ± 2 | 0 ± 0 | 10.3 ± .3 | 8.2 ± .3 | 26.5 ± .2 | Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information | |
GPT-4 (5-shot) | 72.9 | 269 | 7 | 32.8 | 29.1 | 43.4 | GPT-4 Technical Report | |
Human Performance | - | 1405 | 285 | - | - | - | Large Language Models are Fixated by Red Herrings: Exploring Creative Problem Solving and Einstellung Effect using the Only Connect Wall Dataset | |
GPT-3.5-turbo (5-shot) | 80.6 | 149 | 2 | 25.4 | 22.0 | 37.3 | GPT-4 Technical Report | |
GloVe | 84.9 ± .4 | 68 ± 4 | 0 ± 0 | 17.6 ± .4 | 14.4 ± .3 | 31.5 ± .3 | - | - |
GPT-4 (0-shot) | 75.8 | 239 | 6 | 30.7 | 27.2 | 41.5 | GPT-4 Technical Report | |
ELMo (LARGE) | - | 55 ± 4 | 0 ± 0 | 14.5 ± .4 | 11.8 ± .4 | 29.5 ± .3 | Deep contextualized word representations | |
DistilBERT (BASE) | - | 49 ± 4 | 0 ± 0 | 14.0 ± .3 | 11.3 ± .3 | 29.1 ± .2 | DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter | |
GPT-3.5-turbo (3-shot) | 80.9 | 140 | 0 | 24.7 | 21.3 | 36.8 | GPT-4 Technical Report | |
GPT-4 (100-shot) | 73.6 | 249 | 3 | 32.3 | 28.5 | 42.8 | GPT-4 Technical Report | |
RoBERTa (LARGE) | - | 29 ± 3 | 0 ± 0 | 9.4 ± .4 | 8.4 ± .3 | 26.7 ± .2 | RoBERTa: A Robustly Optimized BERT Pretraining Approach |
0 of 22 row(s) selected.