Medical Slice Transformer: Improved Diagnosis and Explainability on 3D Medical Images with DINOv2

MRI and CT are essential clinical cross-sectional imaging techniques fordiagnosing complex conditions. However, large 3D datasets with annotations fordeep learning are scarce. While methods like DINOv2 are encouraging for 2Dimage analysis, these methods have not been applied to 3D medical images.Furthermore, deep learning models often lack explainability due to their"black-box" nature. This study aims to extend 2D self-supervised models,specifically DINOv2, to 3D medical imaging while evaluating their potential forexplainable outcomes. We introduce the Medical Slice Transformer (MST)framework to adapt 2D self-supervised models for 3D medical image analysis. MSTcombines a Transformer architecture with a 2D feature extractor, i.e., DINOv2.We evaluate its diagnostic performance against a 3D convolutional neuralnetwork (3D ResNet) across three clinical datasets: breast MRI (651 patients),chest CT (722 patients), and knee MRI (1199 patients). Both methods were testedfor diagnosing breast cancer, predicting lung nodule dignity, and detectingmeniscus tears. Diagnostic performance was assessed by calculating the AreaUnder the Receiver Operating Characteristic Curve (AUC). Explainability wasevaluated through a radiologist's qualitative comparison of saliency maps basedon slice and lesion correctness. P-values were calculated using Delong's test.MST achieved higher AUC values compared to ResNet across all three datasets:breast (0.94$\pm$0.01 vs. 0.91$\pm$0.02, P=0.02), chest (0.95$\pm$0.01 vs.0.92$\pm$0.02, P=0.13), and knee (0.85$\pm$0.04 vs. 0.69$\pm$0.05, P=0.001).Saliency maps were consistently more precise and anatomically correct for MSTthan for ResNet. Self-supervised 2D models like DINOv2 can be effectivelyadapted for 3D medical imaging using MST, offering enhanced diagnostic accuracyand explainability compared to convolutional neural networks.