HyperAI초신경

홈 뉴스 연구 논문 튜토리얼 데이터셋 백과사전 SOTA LLM 모델 GPU 랭킹 컨퍼런스

한국어

HyperAI초신경

Zero Shot Video Retrieval On Lsmdc

평가 지표

text-to-video R@1

text-to-video R@10

text-to-video R@5

평가 결과

이 벤치마크에서 각 모델의 성능 결과

모델 이름	text-to-video R@1	text-to-video R@10	text-to-video R@5	Paper Title	Repository
BT-Adapter	19.5	45.0	35.9	BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
HowToCaption	17.3	38.6	31.7	HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
HiTeA-17M	18.3	44.2	36.7	HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training	-
Y. Ge et. al.	12.2	32.2	25.9	Bridging Video-text Retrieval with Multiple Choice Questions
SSML	4.2	17.1	11.6	Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning
InternVideo2-6B	33.8	62.2	55.9	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
CLIP4Clip	15.1	36.4	28.5	CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
UMT-L (ViT-L/16)	25.2	50.5	43.0	Unmasked Teacher: Towards Training-Efficient Video Foundation Models
HiTeA-5M	15.5	39.8	31.1	HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training	-
mPLUG-2	24.1	52.0	43.8	mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
MILES	11.1	30.6	24.7	-	-
InternVideo2-1B	32.0	59.4	52.4	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
InternVideo	17.6	40.2	32.4	InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Clover	14.7	38.2	29.2	Clover: Towards A Unified Video-Language Alignment and Fusion Model
Yatai Ji et. al.	17.2	39.1	32.4	-	-
VAST, HowToCaption-finetuned	27.7	54.6	46.5	HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

0 of 16 row(s) selected.