HyperAI
HyperAI
Home
Console
Docs
News
Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
Terms of Service
Privacy Policy
English
HyperAI
HyperAI
Toggle Sidebar
Search the site…
⌘
K
Command Palette
Search for a command to run...
Console
Home
SOTA
Visual Reasoning
Visual Reasoning On Winoground
Visual Reasoning On Winoground
Metrics
Group Score
Image Score
Text Score
Results
Performance results of various models on this benchmark
Columns
Model Name
Group Score
Image Score
Text Score
Paper Title
GPT-4V (CoT, pick b/w two options)
58.75
68.75
75.25
The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task
GPT-4V (pick b/w two options)
39.25
46.25
69.25
The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task
MMICL + CoCoT
50.75
52.5
64.25
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
GPT-4V + CoCoT
44.5
49.5
58.5
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
OpenFlamingo + CoCoT
41.5
55.25
58.25
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
GPT-4V
37.75
42.5
54.5
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
FIBER (EqSim)
27.5
32.00
51.5
Equivariant Similarity for Vision-Language Foundation Models
FIBER (finetuned, Flickr30k)
23.00
26.50
51.25
Equivariant Similarity for Vision-Language Foundation Models
MMICL + CCoT
47.5
48
51
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
OpenFlamingo + DDCoT
39
47.25
47.5
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
VQ2
30.5
42.2
47
What You See is What You Read? Improving Text-Image Alignment Evaluation
MMICL + DDCoT
36.75
45
46.75
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
X-VLM 16M
21.2
24.5
46.7
Measuring Progress in Fine-grained Vision-and-Language Understanding
PaLI (ft SNLI-VE + Synthetic Data)
28.75
38
46.5
What You See is What You Read? Improving Text-Image Alignment Evaluation
FIBER
22.25
25.75
46.25
Equivariant Similarity for Vision-Language Foundation Models
MMICL (FLAN-T5-XXL)
43.00
44.99
45.50
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
METER (EqSim)
18.75
22.75
45.0
Equivariant Similarity for Vision-Language Foundation Models
PaLI (ft SNLI-VE)
28.70
41.50
45.00
What You See is What You Read? Improving Text-Image Alignment Evaluation
Gemini + DDCoT
23.75
25
45
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
X-VLM 4M
21.5
26.7
44.0
Measuring Progress in Fine-grained Vision-and-Language Understanding
0 of 113 row(s) selected.
Previous
Next
Visual Reasoning On Winoground | SOTA | HyperAI