Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement

DETR-like methods have significantly increased detection performance in anend-to-end manner. The mainstream two-stage frameworks of them perform denseself-attention and select a fraction of queries for sparse cross-attention,which is proven effective for improving performance but also introduces a heavycomputational burden and high dependence on stable query selection. This paperdemonstrates that suboptimal two-stage selection strategies result in scalebias and redundancy due to the mismatch between selected queries and objects intwo-stage initialization. To address these issues, we propose hierarchicalsalience filtering refinement, which performs transformer encoding only onfiltered discriminative queries, for a better trade-off between computationalefficiency and precision. The filtering process overcomes scale bias through anovel scale-independent salience supervision. To compensate for the semanticmisalignment among queries, we introduce elaborate query refinement modules forstable two-stage initialization. Based on above improvements, the proposedSalience DETR achieves significant improvements of +4.0% AP, +0.2% AP, +4.4% APon three challenging task-specific detection datasets, as well as 49.2% AP onCOCO 2017 with less FLOPs. The code is available athttps://github.com/xiuqhou/Salience-DETR.