YOLOv10 Real-time End-to-end Object Detection
YOLOv10 is the latest generation of real-time end-to-end object detection system developed by researchers from Tsinghua University. It is built on the basis of Ultralytics Python package and aims to address the deficiencies of previous YOLO versions in post-processing and model architecture. By eliminating non-maximum suppression (NMS) and optimizing various model components, YOLOv10 achieves state-of-the-art performance while significantly reducing computational overhead. The research team published a paper 「YOLOv10: Real-time End-to-End Object Detection」The study framework is explained in detail.
Background
In the past few years, YOLO has become the dominant paradigm in the field of real-time object detection due to its effective balance between computational cost and detection performance. Researchers have explored YOLO's architecture design, optimization objectives, data augmentation strategies, etc., and have made significant progress. However, the reliance on non-maximum suppression (NMS) for post-processing hinders the end-to-end deployment of YOLO and has an adverse impact on inference latency. In addition, the design of individual components in YOLO lacks a comprehensive and thorough inspection, resulting in significant computational redundancy and limiting the capabilities of the model. It leads to suboptimal efficiency while having considerable potential for performance improvements.
YOLOv10 Research Introduction
In this work, the research team aims to further push the performance efficiency boundary of YOLO from two aspects: post-processing and model architecture. To this end, the research team first proposed a consistent dual assignment for YOLO training without NMS, which simultaneously brought competitive performance and low inference latency. In addition, the research team introduced an overall efficiency-accuracy driven model design strategy for YOLO. The research team comprehensively optimized various components of YOLO from the perspective of efficiency and accuracy, greatly reducing computational overhead and improving performance. The result of the research team's efforts is a new generation of YOLO series for real-time end-to-end object detection, called YOLOv10. Extensive experiments show that YOLOv10 achieves state-of-the-art performance and efficiency at various model scales. For example, the research team's YOLOv10-S is 1.8 times faster than RT-DETR-R18 at similar AP on COCO. Compared with YOLOv9-C, YOLOv10-B has a latency reduction of 46% and a parameter reduction of 25% at the same performance.
The architecture of YOLOv10 includes the following key components:
- Backbone network: Responsible for feature extraction, using an enhanced version of CSPNet (Cross Stage Partial Network) to improve gradient flow and reduce computational redundancy.
- neck: Designed to aggregate features of different scales and achieve effective multi-scale feature fusion through the PAN (Path Aggregation Network) layer.
- One-to-many: Generate multiple predictions for each object during training, providing rich supervision signals and improving learning accuracy.
- One-to-one: Generates a single best prediction for each object during inference without NMS, thus reducing latency and improving efficiency.
YOLOv10 has multiple model sizes to meet different application needs:
- YOLOv10-N: Nano version, suitable for environments with extremely limited resources.
- YOLOv10-S: Small version, balancing speed and accuracy.
- YOLOv10-M: Medium version, suitable for general use.
- YOLOv10-B: Balanced version with increased width for better precision.
- YOLOv10-L: Large version that improves accuracy at the expense of increased computational resources.
- YOLOv10-X: Extra large version for maximum precision and performance.
YOLOv10 has been extensively tested on standard benchmarks such as COCO, demonstrating superior performance and efficiency, with significant improvements in both latency and accuracy over previous versions and other contemporary detectors.