Bridging the Performance Gap between DETR and R-CNN for Graphical Object Detection in Document Images

This paper takes an important step in bridging the performance gap betweenDETR and R-CNN for graphical object detection. Existing graphical objectdetection approaches have enjoyed recent enhancements in CNN-based objectdetection methods, achieving remarkable progress. Recently, Transformer-baseddetectors have considerably boosted the generic object detection performance,eliminating the need for hand-crafted features or post-processing steps such asNon-Maximum Suppression (NMS) using object queries. However, the effectivenessof such enhanced transformer-based detection algorithms has yet to be verifiedfor the problem of graphical object detection. Essentially, inspired by thelatest advancements in the DETR, we employ the existing detection transformerwith few modifications for graphical object detection. We modify object queriesin different ways, using points, anchor boxes and adding positive and negativenoise to the anchors to boost performance. These modifications allow for betterhandling of objects with varying sizes and aspect ratios, more robustness tosmall variations in object positions and sizes, and improved imagediscrimination between objects and non-objects. We evaluate our approach on thefour graphical datasets: PubTables, TableBank, NTable and PubLaynet. Uponintegrating query modifications in the DETR, we outperform prior works andachieve new state-of-the-art results with the mAP of 96.9\%, 95.7\% and 99.3\%on TableBank, PubLaynet, PubTables, respectively. The results from extensiveablations show that transformer-based methods are more effective for documentanalysis analogous to other applications. We hope this study draws moreattention to the research of using detection transformers in document imageanalysis.