Review: SharpMask (Instance Segmentation)

Hence, each pixel prediction is based on a complete view of the object, however, its input feature resolution is low due to the multiple pooling stages.This network architecture is similar to the DeepMask approach.DeepMask only coarsely align with the object boundaries.SharpMask produce sharper, pixel-accurate object masks.(b) Multiscale NetworkThis architecture are equivalent to making independent predictions from each network layer and upsampling and averaging the results.This network architecture is similar to the FCN and CUMedVision1 approaches (Note: they are not for instance segmentation).(c) Encoder Decoder Network & (d) Refinement ModuleAfter a series of convolutions at the bottom-up pass (left side of the network), the feature maps are very small.These feature maps are 3×3 convolved and gradually upsampled at the top-down pass (right side of the network) using 2× bilinear interpolation.Added to this, the corresponding same-size feature maps F at the bottom-up pass are concatenated to the mask-encoding feature maps M at the top-down pass before upsampling.Before each concatenation, 3×3 convolution is also performed on F, to reduce the number of feature maps, since direct concatenation is computationally expensive.The concatenation has been used in many deep learning approaches as well such as the famous U-Net.And authors refactored the refinement module which leads to a more efficient implementation as follows:(a) Original (b) Refactored but equivalent model that leads to a more effcient implementation2..Some DetailsImageNet-Pretrained 50-layer ResNet is used.Two-stage TrainingFirst, the model is trained to jointly infer a coarse pixel-wise segmentation mask and an object score using the feedforward path..Second, the feedforward path is `frozen’ and the refinement modules trained.Faster converge can be obtained.We can have a coarse mask using the forward path only, or have a sharp mask using bottom-up and top-down paths.Gains of fine-tuning of whole network is minimal once the forward branch had converged.During Full-image InferenceOnly the most promising locations are refined..Top N scoring proposal windows are refined.3..Architecture OptimizationIt is required to reduce the complexity of the network..And it is found that DeepMask spends 40% of its time for feature extraction, 40% for mask prediction, and 20% for score prediction.3.1..Trunk ArchitectureInput Size W: Reducing W decreases stride density S which further harms accuracy.Pooling Layers P: More pooling P results in faster computation, it also results in loss of feature resolution.Stride Density S: Doubling the stride while keeping W constant greatly reduces performanceDepth D: Increasing D, in the context of instance segmentation, reducing spatial resolution hurts performance.Feature Channels F: Adopting 1×1 convolution to reduce F and show that we can achieve large speedups in this manner.Results for Various W, P, D, S, FW160-P4-D39-F128: obtains the tradeoff between speed and accuracy.The top and last rows are the total time for DeepMask and SharpMask (i.e. W160-P4-D39-F128) respectively.3.2..Head ArchitectureThe head architecture also consumes certain complexity of the model.Various Head Architecture(a): Original DeepMask head architecture to obtain the mask and score.(b) to (d): Various common sharing conv and fully connected layers to obtain the mask and score.Results for Various Head ARchitecturesHead C is chosen due to its simplicity and time.3.3..Number of Feature Maps in Different Conv(a) Number of Feature Maps are the same for all convolutions.(b) Number of Feature Maps are reduced along bottom-up path and increased back along top-down path.And (b) has lower inference time and similar AUC (Average across AR at 10, 100, 1000 proposals).4..Results4.1..MS COCO SegmentationResults on MS COCO SegmentationDeepMask-ours: DeepMask with optimized trunk and head, better than DeepMask.SharpMask: Better than previous state-of-the-art approachesSharpMaskZoom & SharpMaskZoom²: With one or two additional smaller scale and achieves a large boost in AR for small objects.4.2..Object Detection & Results in MS COCO Challenges 2015Results on MS COCOTopBy applying SharpMask on Fast R-CNN with VGGNet as backbone for feature extraction, i.e.. More details

Leave a Reply