The REAL Correct Way to Handle Missing Data

Simple.

You talk to your client.

Now, this answer is so short, and so simple, that for some of you this answer may even be downright disappointing.

Then please allow me to elaborate by sharing an experience I once had.

Several years ago, I was part of a team that built a model for a client whose business was built largely around giving quotes on large orders.

The model provided recommendations to quote specialists on how to discount the items in the quote based on what was in the order, the quantities being ordered, and previous purchasing behavior exhibited by the customer, among other things.

The idea was to build a model that could strike the ideal balance where the items were discounted just enough that the customer would purchase the order, but not any more than necessary, so that profits were still maximized.

In addition to all the Bayesian statistics and machine learning algorithms you might imagine went into the design of this model, it was also important to the client that we incorporate certain business rules.

For example, they wanted to maintain a minimum profit margin of 25 percent on any item sold.

The client also wanted to honor the published list price.

That is, the client never wanted to bid above the list price, even if the data indicated that a client might pay a higher price.

These constraints on the model required us to know both the cost and list price of every item in the client’s catalogue.

It certainly seemed like a given that the client would have this data.

However, once executives sent us the data set for all items in their catalogue, we found that we were missing either the cost, or the list price for a pretty significant portion of the items.

To give some context, this client sold items that were as little as 10 cents and as much as $4,000.

It was very unlikely that setting all the missing costs or list prices to the mean would have given a sensible value.

So, what did we do?We had a five-minute phone call with the client company’s head of sales.

We explained the situation, and he told us that they almost always structured their prices so that cost was around 40 percent of the list price.

So, anywhere we had list price, but no cost, we could impute cost at 40 percent of the list price.

For any item that had a cost, but no list price, we could impute the list price so that cost was 40 percent of the imputed list price.

It was that easy!There was no measure of central tendency or mathematical theorem that would’ve given us the correct answer in this situation.

We only knew what to do by talking to our client, so we could understand the business logic around the variables in question.

Some readers might be thinking perhaps we could’ve built a simple model to infer the relationship between cost and list price.

While this is probably true, this is still very much an Ivory Tower answer.

Using a model to infer the relationship still doesn’t beat getting a direct insight into your client’s business logic, and, more importantly, getting the client’s blessing on an agreed upon path forward.

This brings us to the topic of “stakeholder buy-in.

” To have a successful data science project, you need the stakeholders to approve of what you’re doing, how you’re doing it, and with which data.

I’ve seen entire avenues of analysis and modeling within a project be slowed or halted completely, because, while the mathematics and theory behind the analysis were sound, the stakeholders just couldn’t or wouldn’t get on board.

Sometimes, this can be resolved by simply finding a more clear and concise way to explain the analysis to the stakeholders.

Sometimes, however, this isn’t possible, because the reason the stakeholders are objecting has to do with their business logic, values or goals.

The importance of this simply can’t be overstated.

We should always be considering the business logic, values and goals of our clients, ensuring that our clients’ objectives are our objectives.

Brute force methods, such as dropping observations or assuming all the missing values can be replaced with the mean, should only be used as a last resort in cases where there is no clear guidance from the client.

Even in these cases, it’s only something you should do with the approval of your client.

You definitely don’t want to find out that company leaders find this method a little too rough or haphazard after you’ve already moved forward with it.

Rather, it’s always best to make sure everyone is comfortable with how the project is moving forward at every stage.

This may seem as though it’s inefficient and slow, and deadlines are a very real part of any consulting engagement.

It can be hard to feel good about spending more time doing extra analyses to see what kinds of results you get replacing missing values with a measure of central tendency to determine the best measure to use, and to have a clear explanation for your client about the pros and cons of any particular choice.

However, your project time scope is exactly why you need to approach this issue in this way.

It ultimately saves time to ensure you don’t waste it pursuing something your stakeholders are going to second guess, and are probably never going to agree to.

A final thought on how to proceed if your client is unable to provide any useful guidance about how to handle missing data: See if you can make solving the problem as small and simple as possible for them.

Let’s say that our conversation with the head of sales hadn’t gone so swimmingly.

Say they didn’t have a clear, consistent strategy around the relationship between cost and list price.

What then?.As part of the process of brainstorming a solution, I would focus on methods to shrink what we need to ask of them in order to fix the problem.

In this particular case, we got lucky.

The items were organized into departments, and also divided into item categories within each department.

Rather than simply taking the mean cost or list price of the entire data set, I would’ve tried aggregating the items within the item categories and subcategories before taking the measure of central tendency.

This way, I could know, and tell my client, that the measure used only involved similar items.

This way, I wouldn’t be asking the stakeholders to accept something so rough.

Also, if there were certain cases where they weren’t comfortable with the achieved result, then there are a lot fewer data points for which I would need a cost or list price.

My ask of the client is likely to be much smaller and simpler with this approach than just trying to find the mean of all the items.

I’m sure there will be readers who can come up with all kinds of “what ifs” about different specific scenarios in which one may find oneself.

I’m not going to be able to cover every possible scenario in this article.

These examples are simply meant to illustrate a general approach and way of thinking when dealing with this issue.

I’ll attempt to sum up the ideas with a few simple rules to follow.

1.

Always talk to you client about missing values in the data.

2.

If they have clear guidance to give, take it.

If they don’t, see if you can gain an understanding of the process being described by the data, and how the details translate into the specific representation you see in the data.

Try to gain an understanding of their business logic, values and goals around the process represented in the data.

You should try to understand all of this even if they have clear guidance for you.

In the event that they don’t, this should be used to try to come to clear guidelines you can follow.

3.

If all else fails, go ahead and use simple models to infer relationships, or measures of central tendency.

4.

Always get the clients’ approval to proceed with the method you think is best before proceeding.

Following these guidelines will always make your project run more smoothly and your clients happier than if you just resort first to the above-mentioned Ivory Tower answers.

.

. More details

Leave a Reply