PVANET: Deep but Lightweight Neural Networks for Real-time Object Detection

PVANET: Deep but Lightweight Neural Networks for Real-time Object DetectionA paper summaryArunavaBlockedUnblockFollowFollowingFeb 9Fig 2.

PVANET Entire Model VizualizationA paper summary of the paperPVANET: Deep but Lightweight Neural Networks for Real-time Object Detectionby Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun LiangOverviewThis paper presents our lightweight feature extraction network architecture for object detection, named PVANET, which achieves real-time object detection performance without losing accuracy.

Computational Cost: 7.

9GMAC for feature extraction with 1065x640inputRuntime Performance: 750ms/image (1.

3FPS) on Intel i7 and 42ms/image (21.

7FPS) on NVIDIA Titan X GPUAccuracy: 83.

8% mAP on VOC-2007; 82.

5% mAP on VOC-2012The key design principle is “less channels with more layers”.

Aditionally, the network adopts some other building blocks:Concatenated Rectified Linear Unit (C.

ReLU) is applied to the early stage of our CNNs to reduce the number of computations by half without losing accuracy.

Inception is applied to the remaining of our feature generation sub-networkAdopted the idea of multi-scale representationt that combines several intermediate outputs so that multiple levels of details and non-linearities can be considered simultaneously.

MethodsFig 2.

Model ArchitectureConcatenated Rectified Linear UnitFig 3.

Concatenated Rectified Linear Unit (C.


ReLU is motivated from the observation that in the early stage, output nodes tend to be paired such that one node’s activation is the opposite side of another’s.


ReLU reduces the number of output channels by half, and doubles it by simply concatenating the same outputs with negation, which leads to 2x speed up of the early stage.


The Inception ModuleInception can be one of the most cost-effective building block for capturing large objects and for capturing small objects.

They replace the 5×5 convolution in a common Inception block with 2 3x3s.

HyperNetMultiscale representation and its combination are proven to be effective in many Deep Learning tasks.

Combining fine grained details with highly abstracted information in feature extraction layer helps the following region proposal network and classification network to detect object of different scales.

They combine the1) Last layer2) Two intermediate layers whose scales are 2x and 4x of the last layer, respectively.

Deep Network TrainingThey have adopted the residual structures for better training.

They add residual connections onto inception layers as well to stabilize the later part of the deep network.

Add Batch Normalization layers before all ReLU activation layers.

The Learning rate policy they use is based on plateau detection, where they detect a plateau based on the moving average of loss, and if its below a certain threshold they decrease the learning rate by a certain factor.

Faster R-CNN with PVANETThree intermediate outputs from conv3_4, conv4_4 and conv5_4 are combined into the 512 channel multi scale output features which are fed into the Faster RCNN modulesResultsPVANET was pretrained with ILSVRC2012 training images for 1000-class image classification.

All images were resized to 256×256 and 192×192 patches were randomly cropped and used as the network input.

The learning rate was initially set to 0.

1 and then decreased by a factor of 1/sqrt(10) ~ 0.

3165 whenever a plateau is detected.

Pre-training terminated if learning rate drops below 1e-4 (which usually requires about 2M iterations)Then PVANET was trained with the union set of MS-COCO trainval, VOC2007 trainval, VOC2012 trainval.

Fine tuning with VOC2007 trainval and VOC2012 trainval was also required afterwards, since the class definations of MS-COCO and VOC are slightly different.

Training images were resized randomly such that the shorter edge of an image to be between 416 and 864.

For PASCAL VOC evaluations, each input image was resized such that its shorter edge to be 640.

All parameters related to Faster R-CNN were set as in the original work except for the number of proposal boxes before non-maximum suppression (NMS) (=12000) and NMS threshold (=0.

4)All evaluations done on Intel i7 with a single core and NVIDIA Titan X GPU.

Fig 5.

Performance with VOC2007Fig 6.

Performance with VOC2012PVANET+ achieved the 2nd place on the PASCAL VOC 2012 Challenge.

The first being the Faster-RCNN + ResNet101 which is much heavier than PVANET.



Kim, S.

Hong, B.

Roh, Y.

Cheon, and M.


PVANET: Deep but lightweight neural networks for real-time object detection.

arXiv preprint arXiv:1608.

08021, 2016.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich.

Going deeper with convolutions.

In Proceedings of the IEEE International Conference on Computer Vision and Patter Recognition (CVPR), 2015.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.

Deep residual learning for image recognition.

In Proceedings of the IEEE International Conference on Computer Vision an Pattern Recognition (CVPR), 2016.

Tao Kong, Anbang Yao, Yurong Chen, and Fuchun Sun.

HyperNet Towards accurate region proposal generation and joint object detection.

In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

Thanks for reading!.Do read the paper.

Will update, if I find more interesting insights.


. More details

Leave a Reply