Visual Reasoning On Winoground

평가 지표

Group Score

Image Score

Text Score

평가 결과

이 벤치마크에서 각 모델의 성능 결과

모델 이름	Group Score	Image Score	Text Score	Paper Title	Repository
ViLBERT base	4.75	7.25	23.75	Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
METER (finetuned, Flickr30k)	14.75	20.75	43.5	Equivariant Similarity for Vision-Language Foundation Models
BLIP (ITM)	13.3	15.8	35.8	Revisiting the Role of Language Priors in Vision-Language Models
BLIP2 (SGVL)	23.3	28.5	42.8	Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs	-
Gemini + CoCoT	27.75	32.5	40	CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
GPT-4V (CoT, pick b/w two options)	58.75	68.75	75.25	The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task	-
OpenFlamingo + CoCoT	41.5	55.25	58.25	CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
COCA ViT-L14 (f.t on COCO)	8.25	11.50	28.25	What You See is What You Read? Improving Text-Image Alignment Evaluation
OFA large (ITM)	7.25	10.25	30.75	Simple Token-Level Confidence Improves Caption Correctness	-
VSE++ (COCO, VGG)	3.50	5.50	18.75	Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
METER	12.00	15.75	39.25	Equivariant Similarity for Vision-Language Foundation Models
OpenFlamingo	33.25	41.25	39	CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
LDM-CLIP (SelfEval)	-	7.25	22.75	SelfEval: Leveraging the discriminative nature of generative models for evaluation	-
CLIP (ViT-L/14)	-	8.0	30.25	SelfEval: Leveraging the discriminative nature of generative models for evaluation	-
BLIP 129M (CapFilt/L)	12.2	15.2	34.7	Measuring Progress in Fine-grained Vision-and-Language Understanding
KeyComp* (GPT-4)	18.2	28.7	43.5	Prompting Large Vision-Language Models for Compositional Reasoning
LLaVA-7B (GPTScore)	10.50	17.00	25.50	An Examination of the Compositionality of Large Generative Vision-Language Models
TIFA	11.30	12.50	19.00	What You See is What You Read? Improving Text-Image Alignment Evaluation
Diffusion Classifier (zero-shot)	-	-	34.00	Your Diffusion Model is Secretly a Zero-Shot Classifier
KeyComp* (GPT-3.5)	17.4	27.8	42.7	Prompting Large Vision-Language Models for Compositional Reasoning

0 of 113 row(s) selected.