Review: PSPNet — Winner in ILSVRC 2016 (Semantic Segmentation / Scene Parsing)

Review: PSPNet — Winner in ILSVRC 2016 (Semantic Segmentation / Scene Parsing)Pyramid Scene Parsing Network: Pyramid Pooling ModuleSH TsangBlockedUnblockFollowFollowingDec 14This time, PSPNet (Pyramid Scene Parsing Network), by CUHK and SenseTime, is reviewed.Semantic Segmentation is to know the category label of each pixels for known objects only.Scene Parsing, which is based on Semantic Segmentation, is to know the category label of ALL pixels within the image.Scene ParsingBy using Pyramid Pooling Module, with different-region-based context aggregated, PSPNet surpasses state-of-the-art approaches such as FCN, DeepLab, and DilatedNet..And PSPNet finally:got the champion of ImageNet Scene Parsing Challenge 2016Arrived 1st place on PASCAL VOC 2012 & Cityscapes datasets at that momentAnd it is published in 2017 CVPR with more than 600 citations..(SH Tsang @ Medium)What Are CoveredThe Need of Global InformationPyramid Pooling ModuleSome DetailsAblation StudyComparison With State-of-the-art Approaches1..The Need of Global InformationMismatched Relationship: FCN predicts the boat in the yellow box as a “car” based on its appearance..But the common knowledge is that a car is seldom over a river.Confusion Categories: FCN predicts the object in the box as part of skyscraper and part of building..These results should be excluded so that the whole object is either skyscraper or building, but not both.Inconspicuous Classes: The pillow has similar appearance with the sheet..Overlooking the global scene category may fail to parse the pillow.Hence, we need some global information of the image.2..Pyramid Pooling Module(a) and (b)At (a), we have an input image at..At (b), ResNet is used with dilated network strategy (DeepLab / DilatedNet) for extracting features..The dilated convolution is following DeepLab..The feature map size is 1/8 of the input image here.(c).1..Sub-Region Average PoolingAt (c), sub-region average pooling is performed for each feature map.Red: This is the coarsest level which perform global average pooling over each feature map, to generate a single bin output.Orange: This is the second level which divide the feature map into 2×2 sub-regions, then perform average pooling for each sub-region.Blue: This is the third level which divide the feature map into 3×3 sub-regions, then perform average pooling for each sub-region.Green: This is the finest level which divide the feature map into 6×6 sub-regions, then perform pooling for each sub-region.(c).2.. More details

Leave a Reply