HyperAI
HyperAI
Accueil
Actualités
Articles de recherche
Tutoriels
Ensembles de données
Wiki
SOTA
Modèles LLM
Classement GPU
Événements
Recherche
À propos
Français
HyperAI
HyperAI
Toggle sidebar
Rechercher sur le site...
⌘
K
Rechercher sur le site...
⌘
K
Accueil
SOTA
Raisonnement visuel
Visual Reasoning On Winoground
Visual Reasoning On Winoground
Métriques
Group Score
Image Score
Text Score
Résultats
Résultats de performance de divers modèles sur ce benchmark
Columns
Nom du modèle
Group Score
Image Score
Text Score
Paper Title
Repository
ViLBERT base
4.75
7.25
23.75
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
METER (finetuned, Flickr30k)
14.75
20.75
43.5
Equivariant Similarity for Vision-Language Foundation Models
BLIP (ITM)
13.3
15.8
35.8
Revisiting the Role of Language Priors in Vision-Language Models
BLIP2 (SGVL)
23.3
28.5
42.8
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
-
Gemini + CoCoT
27.75
32.5
40
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
GPT-4V (CoT, pick b/w two options)
58.75
68.75
75.25
The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task
-
OpenFlamingo + CoCoT
41.5
55.25
58.25
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
COCA ViT-L14 (f.t on COCO)
8.25
11.50
28.25
What You See is What You Read? Improving Text-Image Alignment Evaluation
OFA large (ITM)
7.25
10.25
30.75
Simple Token-Level Confidence Improves Caption Correctness
-
VSE++ (COCO, VGG)
3.50
5.50
18.75
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
METER
12.00
15.75
39.25
Equivariant Similarity for Vision-Language Foundation Models
OpenFlamingo
33.25
41.25
39
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
LDM-CLIP (SelfEval)
-
7.25
22.75
SelfEval: Leveraging the discriminative nature of generative models for evaluation
-
CLIP (ViT-L/14)
-
8.0
30.25
SelfEval: Leveraging the discriminative nature of generative models for evaluation
-
BLIP 129M (CapFilt/L)
12.2
15.2
34.7
Measuring Progress in Fine-grained Vision-and-Language Understanding
KeyComp* (GPT-4)
18.2
28.7
43.5
Prompting Large Vision-Language Models for Compositional Reasoning
LLaVA-7B (GPTScore)
10.50
17.00
25.50
An Examination of the Compositionality of Large Generative Vision-Language Models
TIFA
11.30
12.50
19.00
What You See is What You Read? Improving Text-Image Alignment Evaluation
Diffusion Classifier (zero-shot)
-
-
34.00
Your Diffusion Model is Secretly a Zero-Shot Classifier
KeyComp* (GPT-3.5)
17.4
27.8
42.7
Prompting Large Vision-Language Models for Compositional Reasoning
0 of 113 row(s) selected.
Previous
Next