Breaking down Mean Average Precision (mAP)

Is it just an average of the precision?This article hopes to address these questions and calculate mAP for both object detection and information retrieval tasks.

This article would also be exploring why mAP is an appropriate and commonly used metric for information retrieval and object detection tasks.

OutlinePrimerAverage Precision and mAP for Information RetrievalAverage Precision and mAP for Object Detection1.

PrimerPrecision and recall are two commonly used metric to judge the performance of a given classification model.

To understand mAP, we would need to first review precision and recall.

Precision of a given class in classification is given as the ratio of true positive (TP) and the total number of predicted positives.

The formula is given as such:Precision formula of a given class in classificationSimilarly, the recall, a.



true positive rate (TPR) or sensitivity, of a given class in classification, is defined as the ratio of TP and total of ground truth positives.

The formula is given as such:Recall formula of a given class in classificationJust by looking at the formulas, we could suspect that for a given classification model, there lies a trade-off between its precision and recall performance.

If we are using a neural network, this trade-off can be adjusted by the model’s final layer soft-max threshold.

For our precision to be high, we would need to decrease our number of FP, by doing so, it will decrease our recall.

Similarly, by decreasing our number of FN would increase our recall and decrease our precision.

Very often for information retrieval and object detection cases, we would want our precision to be high (our predicted positives to be TP).

(source)Precision and recall are commonly used along with other metrics such as accuracy, F1-score, specificity, a.



true negative rate (TNR), receiver operating characteristics (ROC), lift and gain.

However, all these metrics fail when it comes to determining if a model is performing well in information retrieval or object detection tasks.

This where mAP comes to the rescue!It is important to note that the calculations of mAP for objection detection and information retrieval tasks are subtly different.

The following sections shall demonstrate how the calculations and some discussions to why they are done as such.

Back to outline2.

Average Precision and mAP for Information RetrievalCalculating APA typical task in information retrieval is for a user to provide a query to a database and retrieving information very similar to the query.

Let’s now perform a calculation for AP with an example with three ground truth positives (GTP).

Additional nomenclature: Ground truth positives are the labeled-as-positive data.

We shall define the following variables:Q to be the user queryG to be a set of labeled data in the databased(i,j) to be a score function to show how similar object i is to jG’ which an ordered set of G according to score function d( , )k to be the index of G’User querying G with an image QAfter calculating the d( , ) for each of the images with Q, we can sort G and get G’.

Recall the definition of precision, we shall now use it to calculate the AP for each image in G’.

Calculation of AP for a given query, Q, with a GT=3The overall AP for this query is 0.


One thing to note is that since we know that there are only three GTP, the AP@5 would equal to overall AP.

A general AP@k formula can be written as such:AP@k formula for information retrieval tasksWhere GTP refers to the total number of ground truth positives for the query and TP seen refers to the number of true positives seen till k.

For another query, Q, we could get a perfect AP of 1 if the returned G’ is sorted as such:Calculation of a perfect AP for a given query, Q, with a GTP=3What AP does, in this case, is to penalize models that are not able to sort G’ with TPs leading the set.

It provides a number that is able to quantify the goodness of the sort based on the score function d( , ).

By dividing the sum of precision with the total GTP instead of dividing by the length of G’ allows a better representation for queries that only have a few GTP.

Calculating mAPFor each query, Q, we can calculate a corresponding AP.

A user can have such much queries as he/she likes against this labeled database.

The mAP is simply the mean of all the queries that the use made.

mAP formula for information retrievalNote: This is the same formula as Wikipedia’s one, just written differently.

Back to outline3.

Average Precision and mAP for Object DetectionCalculating AP (Traditional IoU = 0.

5)Intersection over Union (IoU)To do the calculation of AP for object detection, we would first need to understand IoU.

The IoU is given by the ratio of the area of intersection and area of union of the predicted bounding box and ground truth bounding box.

(source)The IoU would be used to determine if a predicted bounding box (BB) is TP, FP or FN.

The TN is not evaluated as each image is assumed to have an object in it.

Let us consider the following image with :Image with a man and horse labeled with ground truth bounding boxes (source)The image contains a person and horse with their corresponding ground truth bounding boxes.

Let us ignore the horse for the moment.

We run our object detection model on this image and received a predicted bounding box for the person.

Traditionally, we define a prediction to be a TP if the IoU is > 0.


The are possible scenarios described below:True Positive (IoU > 0.

5)IoU of predicted BB (yellow) and GT BB (blue) > 0.

5 with the correct classificationFalse PositiveThere are two possible scenarios where a BB would be considered as FP:IoU < 0.

5Duplicated BBIllustrating the different scenarios a predicted BB (yellow) would be considered as FPFalse NegativeWhen the predicted BB has an IoU > 0.

5 but has the wrong classification, the predicted BB would be FN.

FN BB as the predicted class is a horse instead of a personPrecision/Recall Curve (PR Curve)With the TP, FP and FN formally defined, we can now calculate the precision and recall of our detection for a given class across the test set.

Each BB would have its confidence level, usually given by its softmax layer, and would be used to rank the output.

Note that this is very similar to the information retrieval case, just that instead of having a similarity function d( , ) to provide the ranking, we used the model’s predicted BB’s confidence.

Interpolated precisionBefore we plot the PR curve, we need first need to know the interpolated precision introduced in [1].

The interpolated precision, p_interp, is calculated at each recall level, r, by taking the maximum precision measured for that r.

The formula is given as such:Interpolated Precision for a given Recall Value (r)where p(r)˜ is the measured precision at recall r˜.

Their intention of interpolating the PR curve was to reduce the impact of “wiggles” caused by small variations in the ranking of detections.

With that out of the way, we can now start plotting the PR curve.

Consider an example for the person class with 3 TP and 4 FP.

We calculate the corresponding precision, recall, and interpolated precision given by the formulae defined above.

Calculation table for plotting PR curve with an example of 3 TP and 4 FP.

Rows correspond to BB with person classification ordered by their respective softmax confidencePrecision-Recall Curve from the rankings given aboveThe AP is then calculated by taking finding the area under the PR curve.

This is done by segmenting the recalls evenly to 11 parts: {0,0.




We get the following:AP calculation for the above exampleFor another example, I would like to refer you to this well-written article [2] by Jonathan HuiCalculating mAPThe mAP for object detection is the average of the AP calculated for all the classes.

It is also important to note that for some papers, they use AP and mAP interchangeably.

Calculating APWith reference to [5], COCO provided six news methods of calculating AP.

The few three are thresholding the BB at different IoUs:AP: AP at IoU= 0.

50: 0.

05: 0.

95 (primary challenge metric)AP@IoU=0.

5 (traditional way of calculating as described above)AP@IoU=0.

75 (IoU of BBs need to be > 0.

75)For the primary AP, 0.



95 means starting from IoU = 0.

5, with steps of 0.

05, we increase to an IoU = 0.


These would result in computations of AP threshold at ten different IoUs.

An average is done to provide a single number which rewards detectors that are better at localization.

The remaining three methods are calculating AP across scales:AP^small: AP for small objects: area < 32² pxAP^medium: AP for medium objects: 32² < area < 96² pxAP^large: AP for large objects: area > 96² pxThis would allow better differentiation of models as some datasets have more small objects than others.

Back to outlineSpot an error?.Feel free to comment!Special thanks to Raimi, Derek and Wai Kit for proofreading and giving me feedback on this article.

Feel free to connect with me via Twitter, LinkedIn!If you are interested in other projects that I have worked on, feel free to visit my Github!For my other writings:A Starter Pack to Exploratory Data Analysis with Python, pandas, seaborn, and scikit-learnHow I improved a Human Action Classifier to 80% Validation Accuracy in 6 Easy StepsReferences:The PASCAL Visual Object Classes (VOC) Challenge, by Mark Everingham, Luc Van Gool, Christopher K.


Williams, John Winn and Andrew Zissermanhttps://medium.








. More details

Leave a Reply