Review: G-RMI — Winner in 2016 COCO Detection (Object Detection)

Review: G-RMI — Winner in 2016 COCO Detection (Object Detection)A Guide to Select a Detection Architecture: Faster R-CNN, R-FCN and SSDSH TsangBlockedUnblockFollowFollowingJan 11This time, G-RMI, Google Research and Machine Intelligence, who won the 1st place in 2016 MS COCO detection challenge is reviewed.

G-RMI is the team name attending the challenge.

It is not a name for a proposed approach.

Because they do not have any innovative idea such as modifying the deep learning architecture to win the challenge.

The paper name called “Speed/accuracy trade-offs for modern convolutional object detectors” also gives us some hints that, they systematically investigated on different kinds of object detectors and feature extractors.

Specifically:3 Object Detectors (meta-architecture): Faster R-CNN, R-FCN, and SSD6 Feature Extractors: VGG-16, ResNet-101, Inception-v2, Inception-v3, Inception-ResNet-v2 and MobileNetThey also analysed the effects of other parameters such as input image sizes and number of region proposals.

Finally, an ensemble of several models achieved the state-of-the-art results and won the challenge.

And it is published in 2017 CVPR with more than 400 citations.

(SH Tsang @ Medium)OutlineMeta-architecturesFeature ExtractorsAccuracy vs TimeEffect of Feature ExtractorEffect of Object SizeEffect of Image SizeEffect of the Number of ProposalsFLOPs AnalysisMemory AnalysisGood localization at .

75 IOU means good localization at all IOU thresholdsState-of-the-art Detection Results on COCO1.

Meta-architecturesThe object detectors are named as meta-architectures here.

Three meta-architectures are investigated: Faster R-CNN, R-FCN, and SSD.

Abstract ArchitectureSSDIt uses a single feed-forward convolutional network to directly predict classes and anchor offsets without requiring a second stage per-proposal classification operation.

Faster R-CNNIn the first stage, called the region proposal network (RPN), images are processed by a feature extractor (e.


, VGG-16), features at some selected intermediate level (e.


, “conv5”) are used to predict class-agnostic box proposals.

In the second stage, these (typically 300) box proposals are used to crop features from the same intermediate feature map (ROI pooling) which are subsequently fed to the remainder of the feature extractor (e.


, “fc6” followed by “fc7”) in order to predict a class and class-specific box refinement for each proposal.

R-FCNSimilar to Faster R-CNN, there is RPN in the first stage.

In the second stage, positive-sensitive score maps are used such that crops (ROI pooling) are taken from the last layer of features prior to prediction.

This makes the per-ROI operation cost become very low as nearly all operations are shared before ROI pooling.

Thus, it achieves comparable accuracy to Faster R-CNN often at faster running time.


Feature ExtractorsSix feature extractors are tried: VGG-16, ResNet-101, Inception-v2, Inception-v3, Inception-ResNet-v2 and MobileNetV1.

Top-1 classification accuracy on ImageNetDifferent feature extractors, different layer is used for extracting features for object detection.

Some modifications are made such as, dilated convolutions are used, or making max pooling stride smaller, for some feature extractors so that the stride size is not too small after feature extraction.


Accuracy vs TimeAccuracy vs Time, The dotted Line is Optimality FrontierTest-dev performance of the “critical” points along our optimality frontierColors: Feature ExtractorsMarker shapes: Meta-architectures3.


General ObservationsR-FCN and SSD are faster on average.

Faster R-CNN is slower but more accurate, requires at least 100ms per image.



Critical Points on Optimality FrontierFastest: SSD w/MobileNetSSDs with Inception-v2 and MobileNet are most accurate of the fastest models.

Ignoring post-processing costs, MobileNet seems to be roughly twice as fast as Inception-v2 while being slightly worse in accuracy.

Sweet Spot: R-FCN w/ResNet or Faster R-CNN w/ResNet and only 50 proposalsThere is an “elbow” in the middle of the optimality frontier occupied by R-FCN models using ResNet feature extractors.

This is the best balance between speed and accuracy among the model configurations.

Most Accurate: Faster R-CNN w/Inception-ResNet at stride 8Faster R-CNN with dense output Inception-ResNet-v2 models attain the best possible accuracy on our optimality frontier.

Yet, these models are slow, requiring nearly a second of processing time.


Effect of Feature ExtractorAccuracy of detector (mAP on COCO) vs accuracy of feature extractorIntuitively, stronger performance on classification should be positively correlated with stronger performance on COCO detection.

This correlation appears to only be significant for Faster R-CNN and R-FCN while the performance of SSD appears to be less reliant on its feature extractor’s classification accuracy.


Effect of Object SizeAccuracy stratified by object size, meta-architecture and feature extractor, image resolution is fixed to 300All methods do much better on large objects.

SSDs typically have (very) poor performance on small objects, but still SSDs are competitive with Faster R-CNN and R-FCN on large objects.

And later on, there is DSSD to address the small object detection issue.


Effect of Image SizeEffect of image resolutionDecreasing resolution by a factor of two in both dimensions consistently lowers accuracy (by 15.

88% on average) but also reduces inference time by a relative factor of 27.

4% on average.

High resolution inputs allow for small objects to be resolved.

High resolution models lead to significantly better mAP results on small objects (by a factor of 2 in many cases) and somewhat better mAP results on large objects as well.


Effect of the Number of ProposalsFaster R-CNN (Left), R-FCN (Right)We can output different number of proposals at RPN (the first stage).

Fewer proposals, faster running time, or vice versa.

Faster R-CNNInception-ResNet, which has 35.

4% mAP with 300 proposals can still have surprisingly high accuracy (29% mAP) with only 10 proposals.

The sweet spot is probably at 50 proposals, where we are able to obtain 96% of the accuracy of using 300 proposals while reducing running time by a factor of 3.

R-FCNThe computational savings from using fewer proposals in the R-FCN setting are minimal.

This is not surprising because as mentioned, per-ROI computation cost is low for R-FCN due to shared computation by positive-sensitive score maps.

Comparison between Faster R-CNN and R-FCNAt 100 proposals, the speed and accuracy for Faster R-CNN models with ResNet becomes roughly comparable to that of equivalent R-FCN models which use 300 proposals in both mAP and GPU speed.


FLOPs AnalysisFLOPs vs TimeFor denser block models such as ResNet-101, FLOPs/GPU time is typically greater than 1.

For Inception and MobileNet models, this ratio is typically less than 1.

Perhaps, factorization reduces FLOPs, but adds more overhead in memory I/O or potentially that current GPU instructions (cuDNN) are more optimized for dense convolution.


Memory AnalysisMemory (Mb) vs TimeHigh correlation with running time with larger and more powerful feature extractors requiring much more memory.

As with speed, MobileNet is the cheapest, requiring less than 1Gb (total) memory in almost all settings.


Good localization at .

75 IOU means good localization at all IOU thresholdsOverall COCO mAP (@[.


95]) for all experiments plotted against corresponding mAP@.

50IOU and mAP@.

75IOUBoth mAP@.

5 and mAP@.

75 performances are almost perfectly linearly correlated with mAP@[.




75 is slightly more tightly correlated with mAP@[.


95] (with R² > 0.

99), so if we were to replace the standard COCO metric with mAP at a single IOU threshold, IOU=.

75 is likely to be chosen.


State-of-the-art Detection Results on COCO11.


Ensembling and MulticropSummary of 5 Faster R-CNN single modelsSince mAP is the main objective in COCO detection challenges, the most accurate though time-consuming Faster R-CNN is considered.

The diverse results encouraging ensembling.

Performance on the 2016 COCO test-challenge dataset.

G-RMI: With the above 5 models ensembled and multicrop yielded the final model.

It outperforms the winner in 2015 and 2nd place in 2016.

The winner in 2015 uses ResNet + Faster R-CNN + NoCs.

Trimps-Soushen, Faster R-CNN + ensemble multiple models + improvements from other papers.

(There is a paper for NoCs but there is no details about Trimps-Soushen.

)Note: There is no multiscale training, horizontal flipping, box refinement, box voting, or global context.

Effects of ensembling and multicrop inference.

2nd Row: 6 Faster RCNN models with 3 ResNet-101 and 3 Inception-ResNet-v2.

3rd Row: Diverse ensemble results as in the first table in this section.

Thus, it is encouraging for diversity did help against a hand selected ensembleAnd ensembling and multicrop were responsible for almost 7 points of improvement over a single model.



Detections from 5 Different ModelsBeachBaseballElephantsReferences[2017 CVPR] [G-RMI]Speed/accuracy trade-offs for modern convolutional object detectorsMy Related ReviewsImage Classification[LeNet] [AlexNet] [ZFNet] [VGGNet] [SPPNet] [PReLU-Net] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet]Object Detection[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [DeepID-Net] [R-FCN] [ION] [MultiPathNet] [NoC] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000]Semantic Segmentation[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [ParseNet] [DilatedNet] [PSPNet]Instance Segmentation[DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN].

. More details

Leave a Reply