Reasoning
Performance metrics of mainstream AI models across various tasks, showcasing the state-of-the-art technology
AI Model Performance Benchmarks
Performance metrics of mainstream AI models across various tasks, showcasing the state-of-the-art technology
ARC
50 papers | 0 benchmarks
Discrete Choice Models
50 papers | 0 benchmarks
3D Human Reconstruction
48 papers | 10 benchmarks
Causal Identification
46 papers | 0 benchmarks
Common Sense Reasoning
45 papers | 24 benchmarks
Task Planning
42 papers | 0 benchmarks
StrategyQA
39 papers | 0 benchmarks
Decision Making Under Uncertainty
38 papers | 0 benchmarks
Temporal Sequences
35 papers | 1 benchmarks
Physical Intuition
33 papers | 1 benchmarks
Assortment Optimization
32 papers | 0 benchmarks
Natural Language Visual Grounding
32 papers | 1 benchmarks
Missing Labels
30 papers | 0 benchmarks
Model-based Reinforcement Learning
30 papers | 0 benchmarks
Abstract Argumentation
25 papers | 0 benchmarks
Zero-Shot Video Question Answer
25 papers | 16 benchmarks
Visual Reasoning
24 papers | 12 benchmarks
Systematic Generalization
22 papers | 0 benchmarks
Decision Making
20 papers | 1 benchmarks
Geometry Problem Solving
20 papers | 0 benchmarks
Odd One Out
20 papers | 1 benchmarks
Video-based Generative Performance Benchmarking
20 papers | 1 benchmarks
Abstract Algebra
18 papers | 1 benchmarks
Program Repair
18 papers | 3 benchmarks
Image Paragraph Captioning
17 papers | 1 benchmarks
Navigate
16 papers | 0 benchmarks
Video-based Generative Performance Benchmarking (Contextual Understanding)
16 papers | 1 benchmarks
Video-based Generative Performance Benchmarking (Correctness of Information)
15 papers | 1 benchmarks
Video-based Generative Performance Benchmarking (Detail Orientation))
15 papers | 1 benchmarks
Video-based Generative Performance Benchmarking (Temporal Understanding)
15 papers | 1 benchmarks
Video-based Generative Performance Benchmarking (Consistency)
15 papers | 1 benchmarks
Date Understanding
14 papers | 0 benchmarks
Visual Commonsense Reasoning
14 papers | 7 benchmarks
Formal Logic
13 papers | 1 benchmarks
Automated Theorem Proving
11 papers | 9 benchmarks
Arithmetic Reasoning
9 papers | 5 benchmarks
Error Understanding
9 papers | 2 benchmarks
Logical Sequence
9 papers | 0 benchmarks
Mathematical Induction
9 papers | 1 benchmarks
Physical Commonsense Reasoning
9 papers | 1 benchmarks
Analogical Similarity
7 papers | 1 benchmarks
Autonomous Web Navigation
7 papers | 0 benchmarks
Causal Judgment
7 papers | 0 benchmarks
Elementary Mathematics
7 papers | 1 benchmarks
Logical Reasoning
7 papers | 10 benchmarks
Theory of Mind Modeling
7 papers | 0 benchmarks
GitHub issue resolution
6 papers | 0 benchmarks
Logical Fallacy Detection
6 papers | 0 benchmarks
Math Word Problem Solving
6 papers | 13 benchmarks
Multimodal Reasoning
6 papers | 3 benchmarks
Visual Entailment
6 papers | 3 benchmarks
Human Judgment Correlation
5 papers | 2 benchmarks
Winowhy
5 papers | 0 benchmarks
Checkmate In One
4 papers | 0 benchmarks
High School Mathematics
4 papers | 1 benchmarks
Penguins In A Table
4 papers | 0 benchmarks
Anachronisms
3 papers | 0 benchmarks
College Mathematics
3 papers | 1 benchmarks
Conformal Prediction
3 papers | 0 benchmarks
Crass AI
3 papers | 1 benchmarks
Reasoning About Colored Objects
3 papers | 0 benchmarks
Analytic Entailment
2 papers | 1 benchmarks
Crash Blossom
2 papers | 1 benchmarks
Entailed Polarity
2 papers | 1 benchmarks
Evaluating Information Essentiality
2 papers | 1 benchmarks
Human Judgment Classification
2 papers | 1 benchmarks
Identify Odd Metapor
2 papers | 1 benchmarks
Logical Args
2 papers | 1 benchmarks
Metaphor Boolean
2 papers | 1 benchmarks
Novel Concepts
2 papers | 0 benchmarks
Presuppositions As NLI
2 papers | 1 benchmarks
Code Line Descriptions
1 papers | 0 benchmarks
Commonsense Reasoning for RL
1 papers | 1 benchmarks
Pre-election ratings estimation
1 papers | 0 benchmarks
Professional Accounting
1 papers | 1 benchmarks