Visual Reasoning On Winoground

평가 지표

Group Score
Image Score
Text Score

평가 결과

이 벤치마크에서 각 모델의 성능 결과

모델 이름
Group Score
Image Score
Text Score
Paper TitleRepository
ViLBERT base4.757.2523.75Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality-
METER (finetuned, Flickr30k)14.7520.7543.5Equivariant Similarity for Vision-Language Foundation Models-
BLIP (ITM)13.315.835.8Revisiting the Role of Language Priors in Vision-Language Models-
BLIP2 (SGVL)23.328.542.8Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs-
Gemini + CoCoT27.7532.540CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs-
GPT-4V (CoT, pick b/w two options)58.7568.7575.25The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task-
OpenFlamingo + CoCoT41.555.2558.25CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs-
COCA ViT-L14 (f.t on COCO)8.2511.5028.25What You See is What You Read? Improving Text-Image Alignment Evaluation-
OFA large (ITM)7.2510.2530.75Simple Token-Level Confidence Improves Caption Correctness-
VSE++ (COCO, VGG)3.505.5018.75Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality-
METER12.0015.7539.25Equivariant Similarity for Vision-Language Foundation Models-
OpenFlamingo33.2541.2539CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs-
LDM-CLIP (SelfEval)-7.2522.75SelfEval: Leveraging the discriminative nature of generative models for evaluation-
CLIP (ViT-L/14)-8.030.25SelfEval: Leveraging the discriminative nature of generative models for evaluation-
BLIP 129M (CapFilt/L)12.215.234.7Measuring Progress in Fine-grained Vision-and-Language Understanding-
KeyComp* (GPT-4)18.228.743.5Prompting Large Vision-Language Models for Compositional Reasoning-
LLaVA-7B (GPTScore)10.5017.0025.50An Examination of the Compositionality of Large Generative Vision-Language Models-
TIFA11.3012.5019.00What You See is What You Read? Improving Text-Image Alignment Evaluation-
Diffusion Classifier (zero-shot)--34.00Your Diffusion Model is Secretly a Zero-Shot Classifier-
KeyComp* (GPT-3.5)17.427.842.7Prompting Large Vision-Language Models for Compositional Reasoning-
0 of 113 row(s) selected.
Visual Reasoning On Winoground | SOTA | HyperAI초신경