Review: DCN — Deformable Convolutional Networks, 2nd Runner Up in 2017 COCO Detection (Object Detection)

Review: DCN — Deformable Convolutional Networks, 2nd Runner Up in 2017 COCO Detection (Object Detection)With Deformable Convolution, Improved Faster R-CNN and R-FCN, Got 2nd Runner Up in COCO Detection & 3rd Runner Up in COCO Segmentation.

SH TsangBlockedUnblockFollowFollowingFeb 2After reviewed STN, this time, DCN (Deformable Convolutional Networks), by Microsoft Research Asia (MSRA), is reviewed.

(a) Conventional Convolution, (b) Deformable Convolution, (c) Special Case of Deformable Convolution with Scaling, (d) Special Case of Deformable Convolution with RotationConventional/Regular convolution operates on a pre-defined rectangular grid from an input image or a set of input feature maps, based on the defined filter size.

This grid can be the size of 3×3 and 5×5, etc.

However, objects that we want to detect and classify can be deformed or occluded within the image.

In DCN, the grid is deformable in the sense that each grid point is moved by a learnable offset.

And the convolution operates on these moved grid points, which thereby is called deformable convolution, similarly for the case of deformable RoI pooling.

By using these two new modules, DCN improves the accuracy of DeepLab, Faster R-CNN, R-FCN, and FPN etc.

Finally, by using DCN+FPN+Aligned Xception, MSRA won the 2nd Runner Up in COCO Detection Challenge and 3rd Runner Up in Segmentation Challenge.

It is published in 2017 ICCV with more than 200 citations.

(SH Tsang @ Medium)OutlineDeformable ConvolutionDeformable RoI PoolingDeformable Positive-Sensitive (PS) RoI PoolingDeformable ConvNets Using ResNet-101 & Aligned-Inception-ResNetAblation Study & ResultsMore Results on COCO Detection Challenge Using Aligned Xception1.

Deformable ConvolutionRegular convolution is operated on a regular grid R.

Deformable convolution is operated on R but with each points augmented by a learnable offset ∆pn.

Convolution is used to generate 2N number of feature maps corresponding to N 2D offsets ∆pn (x-direction and y-direction for each offset).

Standard Convolution (Left), Deformable Convolution (Right)As shown above, the deformable convolution will pick the values at different locations for convolutions conditioned on the input image or feature maps.

Compared with Atrous convolution: Atrous convolution has a larger but fixed dilation value during convolution while deformable convolution, different dilation values are applied to each point in the grid during convolution.

(Atrous convolution is also called dilated convolution or hole algorithm.

)Compared with Spatial Transformer Network (STN): STN performs transform on the input image or feature maps while deformable convolution can be treated as a extremely light-weight STN.

2.

Deformable RoI PoolingRegular RoI pooling converts an input rectangular region of arbitrary size into fixed size features.

In Deformable RoI pooling, firstly, at the top path, we still need regular RoI pooling to generate the pooled feature map.

Then, a fully connected (fc) layer generates the normalized offsets ∆p̂ij and then transformed to offset ∆pij (equation at bottom right) where γ=0.

1.

The offset normalization is necessary to make the offset learning invariant to RoI size.

Finally, at the bottom path, we perform deformable RoI pooling.

The output feature map is pooled based on regions with augmented offsets.

3.

Deformable Positive-Sensitive (PS) RoI PoolingDeformable Positive-Sensitive (PS) RoI Pooling (Colors are important here)For original Positive-Sensitive (PS) RoI pooling in R-FCN, all the input feature maps are firstly converted to k² score maps for each object class (In total C+ 1 for C object classes + 1 background) (It is better to read R-FCN to understand the original PS RoI pooling first.

If interested, please read review about it.

)In deformable PS RoI pooling, firstly, at the top path, similar to the original one, conv is used to generate 2k²(C+1) score maps.

That means for each class, there will be k² feature maps.

These k² feature map represents the {top-left (TL), top-center (TC), .

 , bottom right (BR)} of the object that we want to learn the offsets.

The original PS RoI pooling for the offset (top path) is done in the sense that they are pooled with the same area and the same color in the figure.

We get the offsets here.

Finally, at the bottom path, we perform deformable PS RoI pooling to pool the feature maps augmented by the offsets.

4.

Deformable ConvNets Using ResNet-101 & Aligned-Inception-ResNet4.

1.

Aligned-Inception-ResNetAligned-Inception-ResNet Architecture (Left), Inception Residual Block (IRB) (Right)In original Inception-ResNet, suggested in Inception-v4, there is alignment problem that, for a cell on the feature maps close to the output, its projected spatial location on the image is not aligned with the location of its receptive field center.

In Aligned-Inception-ResNet, we can see that within the Inception Residual Block (IRB), all asymmetric convolutions (e.

g.

: 1×7, 7×1, 1×3, 3×1 conv), used for factorization, is removed.

Only one type of IRB is used as shown above.

Also, the number of IRB is different from either Inception-ResNet-v1 or Inception-ResNet-v2.

Error Rates on ImageNet-1K validation.

Aligned-Inception-ResNet has lower error rate than ResNet-101.

Though Aligned-Inception-ResNet has higher error rate than Inception-ResNet-v2, Aligned-Inception-ResNet has solved the alignment issue.

4.

2.

Modified ResNet-101 & Aligned-Inception-ResNetNow we got two backbones: ResNet-101 & Aligned-Inception-ResNet for feature extraction, which is originally used for image classification task.

However the output feature map is too small which is not good for object detection and segmentation tasks.

Atrous convolution (or dilated convolution) is used to reduce at the beginning of the last block (conv5), stride is changed from 2 to 1.

Thus, the effective stride in the last convolutional block is reduced from 32 pixels to 16 pixels to increase the feature map resolution.

4.

3.

Different Object DetectorsAfter feature extraction, different object detectors or segmentation schemes are used such as DeepLab, class-aware RPN (or treated as simplified SSD), Faster R-CNN and R-FCN.

5.

Ablation Study & ResultsSemantic SegmentationPASCAL VOC, 20 categories, VOC 2012 dataset with additional mask annotations, 10,582 images for training, 1,449 images for validation.

mIoU@V is used for evaluation.

Cityscapes, 19 categories + 1 background category, 2,975 images for training, 500 images for validation.

mIoU@C is used for evaluation.

Object DetectionPASCAL VOC, union of VOC 2007 trainval and VOC 2012 trainval for training, VOC 2007 test foe evaluation.

mAP@0.

5 and mAP@0.

7 are used.

COCO, 120k images in the trainval, 20k images in the test-dev.

mAP@[0.

5:0.

95] and mAP@0.

5 are used for evaluation.

5.

1.

Applying Deformable Convolution on Different Number of Last Few LayersBoth 3 and 6 deformable convolutions are also good.

Finally, 3 is chosen by authors due to a good trade-off for different tasks.

And we can also see that DCN improves DeepLab, class-aware RPN (or treated as simplified SSD), Faster R-CNN and R-FCN.

5.

2.

Analysis of Deformable Convolution Offset DistanceAnalysis of deformable convolution in the last 3 convolutional layersAn analysis is also performed as above to illustrate the effectiveness of DCN.

First, the deformable convolution filters are categorized into four classes: small, medium, large, and background, according to the ground truth bounding box annotation and where the filter center is.

Then, mean and standard deviation of dilation value (offset distance), are measured.

It is found that the receptive field sizes of deformable filters are correlated with object sizes, indicating that the deformation is effectively learned from image content.

And the filter sizes on the background region are between those on medium and large objects, indicating that a relatively large receptive field is necessary for recognizing the background regions.

Similarly for deformable RoI pooling, now the parts are offset to cover the non-rigid objects.

5.

3.

Comparison with Atrous Convolution on PASCAL VOCOnly Deformable Convolution: DeepLab, class-aware RPN, R-FCN with deformable convolution are improved, already outperform DeepLab, class-aware RPN and R-FCN with atrous convolution.

And Faster R-CNN with deformable convolution obtains competitive result with Faster R-CNN with atrous convolution(4,4,4).

Only Deformable RoI Pooling: There is only RoI pooling in Faster R-CNN and R-FCN.

Faster R-CNN with deformable RoI pooling obtains competitive result with Faster R-CNN with atrous convolution (4,4,4).

R-FCN with deformable RoI pooling outperforms R-FCN with atrous convolution (4,4,4).

Both Deformable Convolution & RoI Pooling: Faster R-CNN and R-FCN with deformable convolution & RoI pooling are the best among all settings.

5.

4.

Model Complexity and Runtime on PASCAL VOCModel Complexity and RuntimeDeformable ConvNets only add small overhead over model parameters and computation.

Significant performance improvement is from the capability of modeling geometric transformations, other than increasing model parameters.

5.

5.

Object Detection on COCOUsing Deformable ConvNet consistently outperforms the plain one.

With Aligned-Inception-ResNet, using R-FCN with Deformable ConvNet, plus multi-scale testing and iterative bounding box average, 37.

5% mAP@[0.

5:0.

95] is obtained.

6.

More Results on COCO Detection Challenge Using Aligned XceptionThe above results are from the paper.

They also presented a new result in ICCV 2017 conference.

6.

1.

Aligned XceptionThe update of aligned Xception from original Xception is in blue colors.

To be brief, some of the max pooling operations are replaced by separable conv in the entry flow.

The number of repeating is increased from 8 to 16 in the middle flow.

One more conv is added in the exit flow.

6.

2.

COCO Detection ChallengeObject Detection on COCO test-devResNet-101 as feature extractor and FPN+OHEM as object detector: 40.

5% mAP is obtained which is already higher than the results mentioned in the previous section.

Replace ResNet-101 by Aligned Xception: 43.

3% mAP.

With ensemble of 6 models + other small enhancements: 50.

7% mAP.

In the COCO 2017 detection challenge leaderboard, 50.

4% mAP which makes it become 2nd Runner Up in the challenge.

In the COCO 2017 segmentation challenge leaderboard, 42.

6% mAP which makes it become 3rd Runner Up in the challenge.

The leaderboard: http://cocodataset.

org/#detection-leaderboardReference[2017 ICCV] [DCN]Deformable Convolutional NetworksMy Previous ReviewsImage Classification[LeNet] [AlexNet] [ZFNet] [VGGNet] [SPPNet] [PReLU-Net] [STN] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet]Object Detection[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [DeepID-Net] [R-FCN] [ION] [MultiPathNet] [NoC] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [FPN] [RetinaNet]Semantic Segmentation[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [ParseNet] [DilatedNet] [PSPNet] [DeepLabv3]Biomedical Image Segmentation[CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet]Instance Segmentation[DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]Super Resolution[SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net].

. More details

Leave a Reply