Command Palette
Search for a command to run...
RadImageNet-VQA: مجموعة بيانات ضخمة للأشعة المقطعية والرنين المغناطيسي للإجابة على الأسئلة البصرية في الأشعة الطبية
RadImageNet-VQA: مجموعة بيانات ضخمة للأشعة المقطعية والرنين المغناطيسي للإجابة على الأسئلة البصرية في الأشعة الطبية
Leo Butsanets Charles Corbiere Julien Khlaut Pierre Manceron Corentin Dancette
الملخص
في هذا العمل، نقدم RadImageNet-VQA، وهو مجموعة بيانات ضخمة الحجم مُصمَّمة لدفع عجلة التقدم في مجال الإجابة على الأسئلة البصرية في التصوير الشعاعي (VQA) بناءً على فحوصات الأشعة المقطعية (CT) والرنين المغناطيسي (MRI). وعلى الرغم من محدودية حجم مجموعات البيانات الطبية الحالية للإجابة على الأسئلة البصرية، وهيمنتها على صور الأشعة السينية (X-ray) والرسوم التوضيحية الطبية الحيوية، وعرضتها للانحياز إلى الحلول المبسطة المعتمدة على النص، فإن RadImageNet-VQA بُنيت باستخدامannotations مُقَّمة من قِبل خبراء، وتوفر 750 ألف صورة مقترنة بـ7.5 مليون عينة من الأسئلة والأجوبة.تغطي المجموعة ثلاث مهام رئيسية: كشف التشوهات، والتعرف على التشريح، وتحديد الأمراض، وتغطي 8 مناطق تشريحية و97 فئة مرضية، وتدعم أنواعاً متنوعة من الأسئلة بما فيها الأسئلة المفتوحة، والمغلقة، والمتعددة الخيارات.أظهرت التجارب المكثفة أن أحدث نماذج الرؤية واللغة (Vision-Language Models) لا تزال تواجه صعوبات في تحديد الأمراض بدقة عالية (fine-grained pathology identification)، خاصة في الإعدادات ذات الأسئلة المفتوحة، وحتى بعد إجراء عملية الضبط الدقيق (fine-tuning). كما كشفت التحليلات النصية وحدها أن دقة النماذج تنهار لتكون قريبة من العشوائية في غياب مدخلات الصور، مما يؤكد أن RadImageNet-VQA خالية تماماً من الحلول اللغوية المبسطة (linguistic shortcuts).يمكن الوصول إلى مجموعة البيانات الكاملة والمرجع المرجعي (benchmark) بشكل مفتوح عبر الرابط التالي: https://huggingface.co/datasets/raidium/RadImageNet-VQA.
One-sentence Summary
The authors introduce RadImageNet-VQA, a large-scale CT and MRI dataset for radiologic visual question answering comprising 750,000 images paired with 7.5 million expert-curated QA samples spanning abnormality detection, anatomy recognition, and pathology identification across 8 anatomical regions and 97 pathology categories, which is confirmed free of linguistic shortcuts and reveals that state-of-the-art vision-language models still struggle with fine-grained pathology identification, particularly in open-ended settings and even after fine-tuning.
Key Contributions
- RadImageNet-VQA provides a large-scale radiologic VQA dataset with 750K CT and MRI images, 7.5M question-answer pairs, three tasks (abnormality detection, anatomy recognition, pathology identification), eight anatomical regions, 97 pathology categories, and open-ended, closed-ended, and multiple-choice formats.
- State-of-the-art vision-language models still struggle with fine-grained pathology identification, especially in open-ended settings, and text-only analysis shows near-random accuracy without image inputs, confirming the dataset is free of linguistic shortcuts.
- Fine-tuning on RadImageNet-VQA produces substantial gains across all model families, while medical-pretrained vision encoders do not improve downstream performance.
Introduction
The authors introduce RadImageNet-VQA, a large-scale radiologic dataset for visual question answering that pairs 750,000 CT and MRI images with 7.5 million generated question-answer samples. This resource targets a gap in the field: while visual question answering offers a controlled and interpretable way to evaluate clinical reasoning in vision-language models, existing benchmarks lack sufficient coverage of cross-sectional imaging modalities like CT and MRI. Prior work has focused heavily on chest X-rays and report generation, where automatic evaluation metrics correlate poorly with clinical correctness, leaving few robust options for probing fine-grained radiological understanding. The authors’ main contribution is the release of this dataset spanning eight anatomical regions and 97 pathologies, along with a systematic evaluation showing that state-of-the-art vision-language models handle anatomy and basic abnormality detection but struggle with precise pathology identification, and that fine-tuning on their data yields substantial performance gains.
Dataset
Dataset Description: RadImageNet-VQA
The authors introduce RadImageNet-VQA, a large-scale dataset for radiologic visual question answering built from expert-curated CT and MRI images. It is designed to address the limited scale and narrow anatomical coverage of existing medical VQA benchmarks while eliminating linguistic shortcuts that let models answer without relying on images.
Dataset Composition and Sources
- The source data comes from RadImageNet, an expert-annotated medical imaging dataset where each image carries modality (CT or MRI), body part, and pathology labels.
- From this resource, the authors use only the CT and MRI subsets.
- The final training corpus contains roughly 750K images paired with 7.5M samples, consisting of 750K image-caption pairs and 6.75M QA pairs.
- A separate curated benchmark subset contains 1,000 CT/MRI images and 9,000 QA pairs.
Key Details for Each Subset
- Training set: Created by applying the full captioning and VQA generation pipeline to the entire RadImageNet training split. It covers 8 anatomical regions and 97 pathology categories.
- Benchmark subset: A curated evaluation set with 1,000 images and 9,000 QA pairs, balanced across anatomy, abnormality, and pathology tasks in open-ended, closed-ended, and multiple-choice formats.
How the Data Is Used
- The training set serves two purposes: radiologic captions for medical visual-text alignment and VQA data for instruction tuning.
- The benchmark subset is used to evaluate state-of-the-art vision-language models across three tasks:
- Anatomy recognition (identifying the imaged region)
- Abnormality detection (determining whether any abnormal finding exists)
- Pathology identification (distinguishing specific diseases within an anatomical context)
Processing and Construction Details
- Caption generation: Metadata fields (modality, anatomical region, pathology category) are converted into structured radiologic captions by sampling randomly from a diverse set of radiology-aware templates, producing formulations like "A [modality] scan of the [anatomy] showing [pathology]."
- VQA sample generation: Each image is converted into structured VQA samples using task-specific question-answer templates. For each task and question type, 2 to 7 linguistic variations are defined to prevent models from exploiting textual shortcuts.
- Closed-ended design: For anatomy and pathology questions, both positive and negative variants are generated, where the expected answer is "yes" (present) or "no" (absent). This probes whether models hallucinate organs or pathologies when they are not present.
- Multiple-choice distractors: For anatomy recognition, distractors are sampled from other anatomical regions in the dataset. For pathology identification, distractors are restricted to clinically plausible diseases from the same anatomical region, preventing models from solving questions by matching anatomy alone. The option "no pathology seen" is included in every pathology multiple-choice to reduce the bias of models assuming pathology is always present.
Experiment
The RadImageNet-VQA benchmark assesses vision-language models on CT and MRI images across three tasks—anatomy recognition, abnormality detection, and pathology identification—using open-ended, closed-ended, and multiple-choice questions. Zero-shot evaluation shows that anatomy is nearly solved but fine-grained pathology identification remains a severe bottleneck, especially in open-ended responses, where even the best models score low. Fine-tuning on radiologic data brings large improvements, nearly saturating anatomy and abnormality detection, yet pathology discrimination stays challenging, and standard vision encoders match or outperform medically pretrained ones. Text-only ablations demonstrate that the dataset effectively removes linguistic shortcuts, forcing models to rely on visual evidence rather than textual priors.
The authors evaluate state-of-the-art general-purpose and medical-oriented vision-language models on a radiologic visual question answering benchmark. Results indicate that while anatomy recognition is largely solved, fine-grained pathology identification remains a significant bottleneck, particularly for open-ended questions. General-purpose models generally achieve higher overall accuracy than medical-oriented variants, though the latter show specific advantages in open-ended pathology tasks. Anatomy recognition scores are consistently high across models, especially in multiple-choice and closed-ended formats, whereas open-ended pathology identification yields the lowest performance. General-purpose models lead in average accuracy and excel in abnormality detection and closed-ended pathology questions, suggesting broad pretraining provides a strong foundation. Medical-oriented models demonstrate superior performance specifically on open-ended pathology questions but do not consistently outperform general-purpose models across all tasks.
Thetable compares general-purpose and medical specialized models across 0%, 30%, and 100% conditions. Medical specialized models initially show significantly higher values than general-purpose models at 0%, but these values decrease as the percentage increases. The performance gap between the two model types narrows at 100%, suggesting convergence in capability under that condition. Medical specialized models like medgemma-4b-it and Lingshu-7B start with higher values at 0% compared to general-purpose models. There is a consistent downward trend in the values for medical specialized models as the percentage increases to 100%. General-purpose models maintain relatively stable values across all conditions, leading to a narrower gap with medical models at 100%.
The authors fine-tune multiple vision-language models on a radiologic corpus and observe consistent performance gains across all architectures. Anatomy recognition reaches near-saturation levels after training, while abnormality detection shows significant relative improvements. However, pathology identification remains the primary bottleneck, achieving lower accuracy even after fine-tuning. Fine-tuning yields substantial accuracy improvements across most evaluated models and task categories. Anatomy recognition tasks are nearly solved post-training, with models achieving very high accuracy across all question formats. Pathology identification remains challenging, showing the lowest performance among the three tasks despite the gains from fine-tuning.
The authors validate the reliability of the LLM-as-a-judge evaluation framework for open-ended responses by comparing it against human annotations. The results demonstrate substantial agreement between the automated system and human reviewers, supporting its use for benchmarking. Notably, the alignment is stronger for anatomy recognition tasks than for pathology identification, reflecting the varying difficulty of the tasks. Human reviewers exhibit high agreement with the LLM judge across both anatomy and pathology categories. The automated evaluation shows higher consistency for anatomy recognition compared to the more challenging pathology identification. Cohen's kappa scores indicate substantial reliability, validating the LLM-as-a-judge approach for open-ended assessment.
The authors present a detailed breakdown of zero-shot accuracy across different question templates and radiologic tasks. The data confirms that anatomical recognition and structure presence detection are highly accurate for most models, particularly in multiple-choice formats. In contrast, open-ended pathology identification proves extremely difficult, with all models scoring poorly, though medical-specific models like MedGemma show a slight edge. General-purpose models like InternVL generally outperform others in closed-ended pathology tasks for positive cases, while GPT-5 exhibits notably low performance on abnormality detection questions. Anatomical region selection and structure presence detection consistently yield high accuracy across models, suggesting these tasks are nearly solved. Open-ended pathology identification remains the most challenging task, with accuracy scores remaining very low across all evaluated models. General-purpose models like InternVL achieve high accuracy on positive pathology detection cases, whereas medical-oriented models perform significantly better on negative cases.
The authors evaluate general-purpose and medical-specific vision-language models on a radiologic visual question answering benchmark covering anatomy, abnormality detection, and pathology tasks across multiple-choice, closed-ended, and open-ended formats. Anatomy recognition is nearly solved across all models, while fine-grained pathology identification, especially in open-ended settings, remains the primary bottleneck even after fine-tuning. General-purpose models generally achieve higher overall accuracy, but medical-oriented models demonstrate specific advantages on open-ended pathology questions and in low-data regimes, though their edge narrows as training data increases. An LLM-as-a-judge evaluation protocol is validated against human annotations, showing strong agreement particularly for anatomy tasks and confirming the reliability of automated open-ended assessment.