HyperAI

Visual Reasoning On Winoground

Metrics

Group Score
Image Score
Text Score

Results

Performance results of various models on this benchmark

Model Name
Group Score
Image Score
Text Score
Paper TitleRepository
ViLBERT base4.757.2523.75Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
METER (finetuned, Flickr30k)14.7520.7543.5Equivariant Similarity for Vision-Language Foundation Models
BLIP (ITM)13.315.835.8Revisiting the Role of Language Priors in Vision-Language Models
BLIP2 (SGVL)23.328.542.8Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs-
Gemini + CoCoT27.7532.540CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
GPT-4V (CoT, pick b/w two options)58.7568.7575.25The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task-
OpenFlamingo + CoCoT41.555.2558.25CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
COCA ViT-L14 (f.t on COCO)8.2511.5028.25What You See is What You Read? Improving Text-Image Alignment Evaluation
OFA large (ITM)7.2510.2530.75Simple Token-Level Confidence Improves Caption Correctness-
VSE++ (COCO, VGG)3.505.5018.75Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
METER12.0015.7539.25Equivariant Similarity for Vision-Language Foundation Models
OpenFlamingo33.2541.2539CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
LDM-CLIP (SelfEval)-7.2522.75SelfEval: Leveraging the discriminative nature of generative models for evaluation-
CLIP (ViT-L/14)-8.030.25SelfEval: Leveraging the discriminative nature of generative models for evaluation-
BLIP 129M (CapFilt/L)12.215.234.7Measuring Progress in Fine-grained Vision-and-Language Understanding
KeyComp* (GPT-4)18.228.743.5Prompting Large Vision-Language Models for Compositional Reasoning
LLaVA-7B (GPTScore)10.5017.0025.50An Examination of the Compositionality of Large Generative Vision-Language Models
TIFA11.3012.5019.00What You See is What You Read? Improving Text-Image Alignment Evaluation
Diffusion Classifier (zero-shot)--34.00Your Diffusion Model is Secretly a Zero-Shot Classifier
KeyComp* (GPT-3.5)17.427.842.7Prompting Large Vision-Language Models for Compositional Reasoning
0 of 113 row(s) selected.