Review: SharpMask — 1st Runner Up in COCO Segmentation (Instance Segmentation)

By concatenating the feature maps at top down pass to the feature maps at bottom up pass, the performance can be boosted further.Object Detection: Identify the object category and locate the position using a bounding box for every known object within an image.Semantic Segmentation: Identify the object category of each pixel for every known object within an image..Labels are class-aware.Instance Segmentation: Identify each object instance of each pixel for every known object within an image..Labels are instance-aware.Object Detection (Left), Semantic Segmentation (Middle), Instance SegmentationSharpMask obtained 2nd place in MS COCO Segmentation challenge and 2nd place in MS COCO Detection challenge..It has been published in 2016 ECCV, with over 200 citations..(SH Tsang)Average recall on the MS COCO improves 10–20%.By optimizing the architecture, speed is improved by 50% compared with DeepMask.By using additional image scales, small object recall is improved by about 2 times.By applying SharpMask onto Fast R-CNN, object detection results are also improved.What Are CoveredEncoder Decoder ArchitectureSome DetailsArchitecture OptimizationResults1..Encoder Decoder ArchitectureArchitectures for Instance Segmentation(a) The Conventional Feedforward NetworkThe network contains a series of convolutional layers interleaved with pooling stages that reduce the spatial dimensions of the feature maps, followed by a fully connected layer to generate the object mask..Hence, each pixel prediction is based on a complete view of the object, however, its input feature resolution is low due to the multiple pooling stages.This network architecture is similar to the DeepMask approach.DeepMask only coarsely align with the object boundaries.SharpMask produce sharper, pixel-accurate object masks.(b) Multiscale NetworkThis architecture are equivalent to making independent predictions from each network layer and upsampling and averaging the results.This network architecture is similar to the FCN and CUMedVision1 approaches (Note: they are not for instance segmentation).(c) Encoder Decoder Network & (d) Refinement ModuleAfter a series of convolutions at the bottom-up pass (left side of the network), the feature maps are very small.These feature maps are 3×3 convolved and gradually upsampled at the top-down pass (right side of the network) using 2× bilinear interpolation.Added to this, the corresponding same-size feature maps F at the bottom-up pass are concatenated to the mask-encoding feature maps M at the top-down pass before upsampling.Before each concatenation, 3×3 convolution is also performed on F, to reduce the number of feature maps, since direct concatenation is computationally expensive.The concatenation has been used in many deep learning approaches as well such as the famous U-Net.And authors refactored the refinement module which leads to a more efficient implementation as follows:(a) Original (b) Refactored but equivalent model that leads to a more effcient implementation2..Some DetailsImageNet-Pretrained 50-layer ResNet is used.Two-stage TrainingFirst, the model is trained to jointly infer a coarse pixel-wise segmentation mask and an object score using the feedforward path..Second, the feedforward path is `frozen’ and the refinement modules trained.Faster converge can be obtained.We can have a coarse mask using the forward path only, or have a sharp mask using bottom-up and top-down paths.Gains of fine-tuning of whole network is minimal once the forward branch had converged.During Full-image InferenceOnly the most promising locations are refined..Top N scoring proposal windows are refined.3..Architecture OptimizationIt is required to reduce the complexity of the network..And it is found that DeepMask spends 40% of its time for feature extraction, 40% for mask prediction, and 20% for score prediction.3.1.. More details

Leave a Reply