Review: DeepMask (Instance Segmentation)

DeepMask is the CNN approach for instance segmentation.Image Classification: Classify the main object category within an image.Object Detection: Identify the object category and locate the position using a bounding box for every known object within an image.Semantic Segmentation: Identify the object category of each pixel for every known object within an image..Labels are class-aware.Instance Segmentation: Identify each object instance of each pixel for every known object within an image..Labels are instance-aware.Some Differences from Semantic SegmentationMore understanding on the instance individuals.Reasoning about occlusion.Essential to tasks such as counting the number of objects.Some Differences from Object DetectionA bounding box is a very coarse object boundary, many pixels irrelevant to the detected object are also included by the bounding box.And Non Maximum Suppression (NMS) will suppress occluded objects or slanted objects.Thus, Instance Segmentation is one level increase in difficulty!!!And DeepMask is the 2015 NIPS paper with more than 300 citations..Though it is a paper published in the year of 2015, it is one of the earliest paper using CNN for instance segmentation..It is worth to study it to know the development of deep-learning-based instance segmentation.Since a region proposal is generated based on the predicted segmentation mask, object detection task can also be performed.What Are CoveredModel ArchitectureJoint LearningFull Scene InferenceResults1..Model ArchitectureLeft Bottom: Positive SamplesA label yk=1 is given for k-th positive sample..To be a positive sample, two criteria need to be satisfied:The patch contains an object roughly centered in the input patch.The object is fully contained in the patch and in a given scale range.When yk=1, the ground truth mask mk has positive values for the pixels which belong to the single object located in the centre of the image patch.Right Bottom: Negative SamplesOtherwise, a label yk=-1 is given for a negative sample even the object is partially present..When yk=-1, the mask is not used.Top, Model Architecture: Main BranchThe model as shown above, given the input image patch x, after feature extraction by VGGNet, The fully connected (FC) layers originated in VGGNet are removed..The last max pooling layer in VGGNet is also removed, thus the output before splitting into two paths are of the size of 1/16 of input..For example as above, the input is 224×224 (3 is the number of channels in the input image, i.e. RGB), the output at the end of main branch is (224/16)×(224/16) =14×14..(512 is the number of feature maps after convolution.)There are two paths after VGGNet:The first path is to predict the class-agnostic segmentation mask, i.e..fsegm(x).The second path is to assign a score corresponding to how likely the patch is to contain an object, i.e.. More details

Leave a Reply