Object Detection using Google AI Open Images

For the purpose of object detection, the YOLO algorithm divides the input image into a 19*19 grid each with 5 different anchor boxes..It then tries to detect classes within each of these grid cells and assigns an object to one of the 5 anchor boxes for each grid cell..The anchor boxes differ in shape and are intended to capture differently shaped objects for each grid cell.The YOLO algorithm outputs a matrix (shown below) for each of the defined anchor boxes-Given that we had to train the algorithm for 43 classes, we got output dimensions of:These matrices give us the probabilities of observing an object for each anchor box and also the probability of what class that object is..To filter out anchor boxes that don’t have any classes or have the same object as some other box, we use two thresholds — IoU threshold to filter out anchor boxes capturing the same object and confidence threshold to filter out boxes that don’t contain any class with a high confidence.Below is the illustration of last few layers of the YOLO v2 architecture:Last few layers of YOLO v2 architecture (Only for illustration purposes)Transfer LearningTransfer learning is the idea of obtaining a neural network that has already been trained to classify images and using it for our specific purpose..This saves us computation time since we don’t need to train a lot of weights — for instance, the YOLO v2 model we used has about 50 million weights — training which would have taken 4–5 days easily on the Google cloud instance we were using.To successfully implement transfer learning, we had to make a few updates to our model:Input image size — The model that we downloaded used input images of size 416*416..Since some of the objects we were training for were very little — birds, footwear- we didn’t want to squish the input image so much..For this reason, we used input images of size 608*608.Grid size — We changed the dimensions of the grid size so that it divide the image into 19*19 grid cells instead of 13*13 which was the default for the model we downloaded.Output Layer — Since we were training on a different number of classes 43 versus 80 that the original model was trained on, the output layer was changed to output the matrix dimension as discussed above.We re-initialized the weights of YOLO’s last convolution layer to train it on our data set which eventually helped us identify unique classes..Below is the code snippet for the same -Re-initializing the last convolution layer of YOLOCost FunctionIn any object detection problem we want to identify the right object at the right place with a high confidence in an image..There are 3 major components to the cost function:Classification Loss: It is the squared error of class conditional probability, if an object is detected..Thus the loss function only penalizes classification error only if an object is present in a grid cell.Localization Loss: It is the squared error in the predicted boundary boxes location and size with the ground truth boxes, if the boxes are responsible for detecting the object..In order to penalize the loss from bounding box coordinate predictions we use a regularization parameter (ƛcoord)..Further, to make sure that small deviations in larger boxes matter less than in smaller boxes the algorithm uses square root of bounding box width and height.Confidence Loss: It is the squared error of the bounding box’s confidence score.. More details

Leave a Reply