HyperAIHyperAI

Visual Reasoning On Winoground

المقاييس

Group Score
Image Score
Text Score

النتائج

نتائج أداء النماذج المختلفة على هذا المعيار القياسي

اسم النموذج
Group Score
Image Score
Text Score
Paper TitleRepository
ViLBERT base4.757.2523.75Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality-
METER (finetuned, Flickr30k)14.7520.7543.5Equivariant Similarity for Vision-Language Foundation Models-
BLIP (ITM)13.315.835.8Revisiting the Role of Language Priors in Vision-Language Models-
BLIP2 (SGVL)23.328.542.8Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs-
Gemini + CoCoT27.7532.540CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs-
GPT-4V (CoT, pick b/w two options)58.7568.7575.25The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task-
OpenFlamingo + CoCoT41.555.2558.25CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs-
COCA ViT-L14 (f.t on COCO)8.2511.5028.25What You See is What You Read? Improving Text-Image Alignment Evaluation-
OFA large (ITM)7.2510.2530.75Simple Token-Level Confidence Improves Caption Correctness-
VSE++ (COCO, VGG)3.505.5018.75Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality-
METER12.0015.7539.25Equivariant Similarity for Vision-Language Foundation Models-
OpenFlamingo33.2541.2539CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs-
LDM-CLIP (SelfEval)-7.2522.75SelfEval: Leveraging the discriminative nature of generative models for evaluation-
CLIP (ViT-L/14)-8.030.25SelfEval: Leveraging the discriminative nature of generative models for evaluation-
BLIP 129M (CapFilt/L)12.215.234.7Measuring Progress in Fine-grained Vision-and-Language Understanding-
KeyComp* (GPT-4)18.228.743.5Prompting Large Vision-Language Models for Compositional Reasoning-
LLaVA-7B (GPTScore)10.5017.0025.50An Examination of the Compositionality of Large Generative Vision-Language Models-
TIFA11.3012.5019.00What You See is What You Read? Improving Text-Image Alignment Evaluation-
Diffusion Classifier (zero-shot)--34.00Your Diffusion Model is Secretly a Zero-Shot Classifier-
KeyComp* (GPT-3.5)17.427.842.7Prompting Large Vision-Language Models for Compositional Reasoning-
0 of 113 row(s) selected.
Visual Reasoning On Winoground | SOTA | HyperAI