HyperAI
HyperAI
Startseite
Neuigkeiten
Neueste Forschungsarbeiten
Tutorials
Datensätze
Wiki
SOTA
LLM-Modelle
GPU-Rangliste
Veranstaltungen
Suche
Über
Deutsch
HyperAI
HyperAI
Toggle sidebar
Seite durchsuchen…
⌘
K
Startseite
SOTA
Visuelles Schließen
Visual Reasoning On Winoground
Visual Reasoning On Winoground
Metriken
Group Score
Image Score
Text Score
Ergebnisse
Leistungsergebnisse verschiedener Modelle zu diesem Benchmark
Columns
Modellname
Group Score
Image Score
Text Score
Paper Title
Repository
ViLBERT base
4.75
7.25
23.75
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
-
METER (finetuned, Flickr30k)
14.75
20.75
43.5
Equivariant Similarity for Vision-Language Foundation Models
-
BLIP (ITM)
13.3
15.8
35.8
Revisiting the Role of Language Priors in Vision-Language Models
-
BLIP2 (SGVL)
23.3
28.5
42.8
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
-
Gemini + CoCoT
27.75
32.5
40
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
-
GPT-4V (CoT, pick b/w two options)
58.75
68.75
75.25
The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task
-
OpenFlamingo + CoCoT
41.5
55.25
58.25
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
-
COCA ViT-L14 (f.t on COCO)
8.25
11.50
28.25
What You See is What You Read? Improving Text-Image Alignment Evaluation
-
OFA large (ITM)
7.25
10.25
30.75
Simple Token-Level Confidence Improves Caption Correctness
-
VSE++ (COCO, VGG)
3.50
5.50
18.75
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
-
METER
12.00
15.75
39.25
Equivariant Similarity for Vision-Language Foundation Models
-
OpenFlamingo
33.25
41.25
39
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
-
LDM-CLIP (SelfEval)
-
7.25
22.75
SelfEval: Leveraging the discriminative nature of generative models for evaluation
-
CLIP (ViT-L/14)
-
8.0
30.25
SelfEval: Leveraging the discriminative nature of generative models for evaluation
-
BLIP 129M (CapFilt/L)
12.2
15.2
34.7
Measuring Progress in Fine-grained Vision-and-Language Understanding
-
KeyComp* (GPT-4)
18.2
28.7
43.5
Prompting Large Vision-Language Models for Compositional Reasoning
-
LLaVA-7B (GPTScore)
10.50
17.00
25.50
An Examination of the Compositionality of Large Generative Vision-Language Models
-
TIFA
11.30
12.50
19.00
What You See is What You Read? Improving Text-Image Alignment Evaluation
-
Diffusion Classifier (zero-shot)
-
-
34.00
Your Diffusion Model is Secretly a Zero-Shot Classifier
-
KeyComp* (GPT-3.5)
17.4
27.8
42.7
Prompting Large Vision-Language Models for Compositional Reasoning
-
0 of 113 row(s) selected.
Previous
Next
Visual Reasoning On Winoground | SOTA | HyperAI