We could do that, but the department may not be very happy….
As noted above, the department can only fix 5 wells per day, assuming 252 working days (US schedule), the department can only fix 1,260 water pumps in the next year.
This means we need to prioritize which pumps to service.
Part 2: Prioritizing Our PredictionsGiven that we cannot service all of the water pumps our model is predicting are in need of service, we should prioritize servicing the pumps that our model is most confident are actually broken.
Doing so is relatively trivial, we simply look at the ‘non functional’ probability our model assigned to each water pump represented in the dataset.
Ranking these probabilities allows us to identify the 1,260 water pumps we are most confident are broken.
This approach shifts our focus from maximizing our model’s overall accuracy score to a metric known as precision at k.
Precision at k is why, our model was ‘good enough’ with 80% accuracy.
Given our limited resources, we really care only about the predictions our model makes for the water pumps we have capacity to repair.
Thus, our accuracy measured at k, in this case k=1,260, is substantially higher than our overall accuracy score.
Effort invested in increasing the model’s accuracy beyond 80% would have been misplaced.
With our predictions prioritized, we can now confidently present our recommendations to the department.
To do so, we likely should visualize our prioritized predictions as follows:Luckily, as we are about to send our final presentation to the printer, a thought crept into our head … ‘I wouldn’t want to be the maintenance team driving all over the country to visit these dispersed sites.
Maybe there are other factors to take into account’….
Part 3: Maximizing ImpactWhile accuracy is important, there are a number of other factors that we likely want to take into account when prioritizing which water pumps to service.
For example, it may make sense to consider the following:Population: How many people are impacted by the water pump?Distance: How close to other non functioning water pumps is this one?By combining these metrics with the predicted probability that a well is non functional, we can derive an ‘impact score’ for each water pump.
The formula to derive our impact score could look something like the following:Prioritizing water pump visits based on impact score, leads us to recommend the department service the following water pumps:Further, in addition to prioritizing maintenance visits for the department, we can also quantify their potential impact:Servicing the prioritized water pumps over the next year will increase access to clean water for approximately 600,000 peopleOne final note, it is important to consider how our impact score may be biased or produce unintended results.
For example, as shown in the above visualizations, our impact score formula clearly biases toward areas with tighter concentrations of water pumps and people — meaning that populations in remote areas are more likely to go unserved.
The change in prioritized wells is shown in the below visualization:In a real world scenario, such an unintended consequence may require us to adjust the impact score formula to better represent our intentions.
ConclusionIn this essay, we answered a critical question that faces nearly every organization: how do we deploy scare resources to maximum impact.
In this contrived example, we considered how we could prioritize opportunities identified by our model.
Further, we explored why maximizing accuracy alone is likely not the right answer.
To that end, we looked at how to incorporate factors beyond just accuracy into prioritizing recommendations.
I hope you enjoyed this essay, I’d love to connect with you on Twitter.
If you would like to dive into the code behind this essay, all of it is available in a public repository.
Footnotes[1] This scenario is based on the situation described by DrivenData.
The analysis uses the data provided by DrivenData.
[2] The population for each site is scaled within the range 0,1 to be the same magnitude as the predicted probabilities.
[3] Distance is determined by creating 5 clusters, using a KMeans estimator, based on longitude and latitude.
The five clusters each ‘represent’ one of the available workers the department has to service water pumps.
Each site’s latitude and longitude is then subtracted from those of the centroid to which that site belongs.
The absolute value of the different between the site’s coordinates and the centroid’s are then averaged.
Finally, the value is scaled within the range 0,1 to be the same magnitude as the predicted probabilities.
.