How Adding Noise Improves Object Detection in Vision Transformers: DN-DETR and Beyond
Adding Training Noise to Improve Detections in Transformers Transformers have revolutionized natural language processing, but their application to computer vision tasks like object detection has seen a slower uptake due to unique challenges. One of the early models, DETR (DEtection TRansformer) introduced by Carion et al. in 2020, used a transformer architecture to perform end-to-end object detection. However, DETR faced significant issues, notably its slow convergence, requiring 500 epochs for effective training. Early Vision Transformers and DETR DETR utilized learned decoder queries to extract information from image tokens. These queries were initially random and did not resemble traditional anchors, which are predefined reference points often used in CNN-based object detection. While DETR achieved comparable results to Faster-RCNN, its slow training process was a significant drawback. Subsequent DETR variants, such as Deformable DETR by Zhu et al. (2020), introduced deformable aggregation, allowing queries to focus on specific image regions. Another variant, DAB-DETR by Liu et al. (2022), used dynamic anchor boxes similar to those in anchor-based CNNs, which helped the model converge faster by encoding spatial features directly into the queries. Prediction to Ground Truth Matching One of the primary challenges in training DETR and its variants was the need to match model predictions with ground truth (GT) boxes. Traditional anchor-based CNNs have straightforward methods for this, such as restricting matches to nearby voxels and using non-maximum suppression (NMS) to eliminate overlapping detections. However, DETR employed the Hungarian algorithm, a bipartite matching method, to optimize the matching of predictions to GT boxes. This algorithm, although effective, has a high computational complexity of (O(n^3)) and is unstable. Small changes in the objective function can lead to dramatic shifts in matching results, causing object queries to jump between different objects and slowing down the learning process. DN-DETR: An Elegant Solution To address the instability issue, Li et al. proposed DN-DETR (Denoising-DETR) in 2022. The core idea is to create fictitious anchor boxes by adding a small amount of noise to the GT boxes. These noisy anchors are then used to guide the decoder queries, bypassing the need for bipartite matching. The DN queries are isolated from the organic queries to prevent interference. This method ensures that the detections generated by DN are already matched to their corresponding GT boxes, improving the model's stability and accelerating convergence. The authors of DN-DETR demonstrated that this technique significantly boosts performance. When using a ResNet-50 backbone, DN-DETR outperformed the previous state-of-the-art DAB-DETR by increasing the average precision (AP) on the COCO detection dataset by 1.9%, achieving an AP of 42.2%. Moreover, DN-DETR reached DETR’s peak performance in just 1/10 of the training epochs, highlighting its efficiency. DINO and Contrastive Denoising Building on the successes of DN-DETR, researchers developed DINO, which incorporated contrastive learning into the denoising mechanism. DINO not only created noisy positive examples but also generated negative examples that were mathematically ensured to be farther from the GT boxes than the positive ones. By learning to differentiate between these examples, the model became more robust and achieved even higher detection accuracy, with an AP of 49% on the COCO val2017 dataset. DINO further enhanced training by enabling multiple contrastive denoising (CDN) groups, where each GT box is paired with multiple noised-up anchors. This approach maximizes the utility of each training iteration, leading to more refined and stable query optimization. Temporal Models and Tracking Recent advancements have extended denoising techniques to temporal models, such as Sparse4Dv3, which are designed to track objects across video frames. These models store successful DN anchors alongside learned, non-DN anchors in a bank, using them to enhance tracking performance. By regressing from an object's previous detection rather than the nearest anchor, these models benefit from the flexibility induced by denoising, especially given the dynamic nature of video data. Discussion The denoising technique (DN) has proven to be effective in improving both the convergence speed and final performance of vision transformer detectors. However, several questions remain unresolved: Learnable Anchors vs. Non-Learnable Anchors: Does the learnability of anchors add significant value, or would DN also enhance models with fixed, non-learnable anchors? Studies like Anchor-DETR by Wand et al. (2021) suggest that the Hungarian algorithm and spatial constraints might play a crucial role, but more research is needed to determine if these constraints are essential. Bipartite Matching Necessity: The main advantage of DN is its ability to stabilize gradient descent by bypassing the Hungarian algorithm. If queries were manually constrained to specific image locations and a simplified matching process (e.g., patch-based matching) was used, would DN still provide a benefit? Production Considerations: Some production systems might prefer avoiding NMS during inference, which makes the Hungarian algorithm more attractive despite its complexity. It appears that denoising can be particularly useful in tracking scenarios. Temporal transformers rely on a continuous stream of video frames and must maintain the identity of detected objects across frames. The flexibility provided by DN, allowing the model to regress from previous detections, can be crucial in such applications. Future research should explore these areas to better understand the full scope of DN’s impact on transformer-based object detection and tracking. Industry Insights and Company Profiles The introduction of denoising in transformer-based object detection represents a significant leap forward in the field, offering both faster convergence and improved accuracy. Companies like Facebook AI Research and Google AI, which have been pivotal in the development of transformers, are likely to adopt and refine these techniques. The practical implications are vast, from enhancing real-time video analysis in autonomous vehicles to improving object recognition in surveillance systems. Tech leaders anticipate that further innovations in denoising and contrastive learning will continue to drive the performance of vision transformers, making them increasingly competitive with traditional CNNs.
