DocXClassifier: High Performance Explainable Deep Network for Document Image Classification
Convolutional Neural Networks (ConvNets) have been thoroughly researched for documentimage classification and are known for their exceptional performance in unimodal image-based documentclassification. Recently, however, there has been a sudden shift in the field towards multimodal approachesthat simultaneously learn from the visual and textual features of the documents. While this has led tosignificant advances in the field, it has also led to a waning interest in improving pure ConvNets-basedapproaches. This is not desirable, as many of the multimodal approaches still use ConvNets as their visualbackbone, and thus improving ConvNets is essential to improving these approaches. In this paper, we presentDocXClassifier, a ConvNet-based approach that, using state-of-the-art model design patterns together withmodern data augmentation and training strategies, not only achieves significant performance improvementsin image-based document classification, but also outperforms some of the recently proposed multimodalapproaches. Moreover, DocXClassifier is capable of generating transformer-like attention maps, whichmakes it inherently interpretable, a property not found in previous image-based classification models. Ourapproach achieves a new peak performance in image-based classification on two popular document datasets,namely RVL-CDIP and Tobacco3482, with a top-1 classification accuracy of 94.17% and 95.57% on the twodatasets, respectively. Moreover, it sets a new record for the highest image-based classification accuracy of90.14% on Tobacco3482 without transfer learning from RVL-CDIP. Finally, our proposed model may serveas a powerful visual backbone for future multimodal approaches, by providing much richer visual featuresthan existing counterparts.