YOLOv2 and YOLO9000 Explained: Key Improvements, Architecture, and PyTorch Implementation
The paper titled "YOLO9000: Better, Faster, Stronger" introduced YOLOv2 as a major advancement over YOLOv1, with the goal of improving detection accuracy, speed, and recall. While the paper’s title suggests a single model, YOLO9000 is actually a specialized version built on top of YOLOv2, designed to detect over 9,000 object categories using a hierarchical classification approach. YOLOv2 addresses two key limitations of YOLOv1: high localization error and low recall. To improve stability and convergence, the authors introduced batch normalization after every convolutional layer, eliminating the need for dropout. This change improved mAP from 63.4% to 65.8%. A more effective fine-tuning strategy was also implemented. Instead of directly fine-tuning YOLOv1 from 224×224 ImageNet to 448×448 PASCAL VOC, YOLOv2 first fine-tunes on 448×448 ImageNet before moving to detection tasks. This intermediate step significantly improved mAP by 3.7%, reaching 69.5%. The use of anchor boxes was another critical change. Unlike YOLOv1, which predicted bounding boxes directly, YOLOv2 predicts offsets relative to predefined anchor boxes. This shift improved model performance and allowed each anchor to predict its own class, increasing flexibility. The prediction vector length changed from (B×5)+C to B×(5+C), enabling richer output per grid cell. To optimize anchor box shapes, the authors applied K-means clustering on bounding box dimensions from the dataset using a custom distance metric based on Intersection over Union (IoU). This produced five optimal anchor boxes that better matched real object sizes than manually designed ones, improving mAP from 69.6% to 74.4%. To prevent unstable predictions, YOLOv2 constrained the bounding box coordinates. The x and y offsets are passed through a sigmoid function to ensure they stay within the grid cell (0 to 1), while width and height are transformed using an exponential function to avoid negative values. This leads to more stable training and better localization. The passthrough layer was introduced to preserve fine-grained features lost during downsampling. It takes the 26×26 feature map from earlier in the network, reshapes it by splitting spatial dimensions and stacking channels, resulting in a 13×13 feature map with 256 channels. This is then concatenated with the main 13×13 stream, enriching the final feature representation. Multi-scale training was implemented by randomly changing input resolution every 10 batches between 320×320 and 608×608 (multiples of 32). This data augmentation technique improved generalization and boosted mAP to 78.6% at higher resolutions. YOLOv2 uses Darknet-19 as its backbone, a lighter network with 5.58 billion operations (vs. 8.52 billion in YOLOv1), enabling faster inference. The architecture consists of 19 convolutional layers and 5 maxpooling layers, modified to include the passthrough layer and detection head. For YOLO9000, the authors combined COCO (80 classes) and ImageNet (22,000+ classes) using a hierarchical structure called WordTree. This allows the model to detect both general and fine-grained object categories—e.g., predicting “airplane” and then subtypes like “jet” or “biplane.” This approach enabled detection of 9,418 object classes without requiring full bounding box annotations for all. In implementation, the YOLOv2 model is built using PyTorch with custom ConvBlock modules that include convolution, batch normalization, and leaky ReLU. The Darknet-19 backbone is constructed in stages, with the passthrough layer processing the 26×26 feature map and reshaping it into a 13×13 tensor with 256 channels. This is concatenated with the main stream and passed through final convolutional layers to produce a 13×13×125 prediction tensor (5 anchors × (5 + 20 classes)). The complete architecture demonstrates how YOLOv2 balances speed and accuracy through architectural innovations like anchor boxes, batch normalization, multi-scale training, and feature fusion. The implementation in PyTorch provides a clear path to understanding how input images are transformed into detection outputs, making it a foundational model for modern object detection.
