Data Science With No Data

Now you have some sample data that you can utilize for developing model prototypes.

Preprocess dataset for input into a modelThe next step is to preprocess the data into a form that works with machine learning models.

These are the steps for this process:Visualize the data to make sure that it seems reasonable.

Place continuous data into buckets.

Convert data fields into one hot encoded columns.

Visualizing the dataHere we want to make sure that the dataset that we generated looks like we’re expecting.

The first field I am curious about is the age field, so I plotted it out to make sure that it relatively matches the graph I show above.

It looks pretty close to what I was expecting.

I can always go back and modify the formula to improve it as I learn more about the data and models.

Place Continuous Data Into BucketsWe have two fields that have what I call continuous data (i.

e.

, they’re not boolean or classifications).

Those fields are Age and BMI.

Age can contain up to 100 unique values, and BMI can have 40 unique values.

This can become unwieldy for our model, so we’ll categorize them into buckets.

For both age and BMI, we’ll place them into five buckets.

Now let’s visualize the data again and make sure everything looks as expected.

Looks pretty good, so let’s proceed to the next step.

Convert Data Fields into One Hot Encoded ColumnsAs you can see Age_group and BMI group are on a different scale than Diabetic, Smoker, and Gender.

For example, Age_group is on a scale between 0 and 4, while Diabetic is between 0 and 1.

There are many ways to address this issue, and for this one, I’ll use a technique called one-hot encoding.

The idea is that we pivot the Age_group column into five columns containing zero for except for the column containing the age group.

For example, if Age_group ==2, the values of the columns would be 0, 0, 1, 0, 0.

Luckily, Python has a builtin function that will do this conversion for us.

Here’s what our dataset looks like after running the code above:Now all of the input fields contain a value of zero or one.

We are finally ready to start building a model based on the data.

There are two points I want to make before we do that.

First, I didn’t add any null checking or data type checking to this model.

You would want to do this when you start working on “real” data.

Second, you may be wondering why I went through all of this data processing in the first place (i.

e.

why didn’t I just generate a dataset that looks like the completed process.

The answer is that I wanted to generate a dataset that looks like what I would probably get from a real source.

The likelihood of finding a dataset that has already been bucketized and encoded is very unlikely.

Train a ModelThe first step in training a model is to split our data into a training set and a test set.

We will train the model on the training set, then test it by presenting data that the model hasn’t seen and comparing the predicted result with the actual result.

This is called the Error, which is what we want to minimize as we’re training the model.

The calculation for computing the error in this tutorial is called Root Mean Squared Error, but there are numerous functions you could use to calculate this.

Here is the code to split the data (in this case we’re using 70% for training and 30% for testing).

Now that we have our data in a usable format, we can build and train a model to predict results.

For this exercise, we will be using the K-Nearest Neighbor model.

Python supports many different models we could utilize, but I feel that this is the most intuitive model for starting out in data science.

Here is how it works:Suppose we have a dataset that looks like this (X1 and X2 are input features, while Y is the resulting value).

Now, let’s plot this data to see what it looks like:With this, we can predict a Y value for a new point on the graph by averaging the Y values for the K nearest neighbors.

For this example, we’ll set K = 4.

The red point will be our new input point.

Since K == 4 in this example, we will take the mean Y value of the 4 points closest to our new value and use it to predict what the Y value for our new point would be.

In this case, we would be using the highlighted rows:For our new point, the estimate would be 60.

5 for the Y value.

Python’s sklearn library has a K-Nearest Neighbor built-in function that we will utilize.

The only trick is what value should we set K to be?.The simplest technique would be to run the model with multiple values of K and log the error for each value.

We can use that to find the sweet spot for accuracy.

Here is the code to accomplish this:After running the above code, we can plot the error rate in relation to K:To determine the best value for K, we will utilize the Elbow Method.

The Y axis of the chart is the error value.

The X axis is the K value.

Looking at the chart, it appears that somewhere around 55 would be the best value for K.

However, the “elbow” appears around 10.

After that, your results start to flatten out or even get worse.

For this tutorial, I’ll split the difference and use K = 20.

For this model and dataset, we can predict the cost of a patient’s medical care to a value within ~$2400 (The mean for our generated data was around $11K).

This isn’t great, but there are numerous enhancements we could make to our model as well as our data generation.

However, it will still be helpful in identifying potentially fraudulent claims, which we’ll do next.

Identifying Fraudulent ClaimsNow let’s see if we can identify some fraudulent claims.

To do this, I generated 100 new rows of data.

The idea for this project is that a doctor is double charging for insurance claims.

So, 90 of the rows utilized the add_rows function as is and set a flag called ‘Fraud’ to zero.

I then added 10 rows but I doubled the cost and set the ‘Fraud’ flag to 1.

Here’s a chart of this new dataset.

The green dots represent fraudulent claims, while the blue dots represent legitimate claims.

It looks like we have five outliers which we would investigate.

In this case, they all happen to be fraudulent.

The other five outliers are hidden within the non-fraudulent claims.

Finding 50% of fraudulent claims seems pretty good, but can we do better?For this algorithm, we will run all 100 of the claims through our model and divide the actual cost by the predicted cost (I called the result of this calculation “error”) and look for outliers.

Here is a chart showing the results:In this chart, we see that we have done a pretty good job of identifying outliers.

This is a really good result and could help streamline the process of identifying fraud.

ConclusionFirst, I want to credit this article for walking through the process of using KNN for regression.

Second, I need to come clean here and let you know that I’m not a mathematician, and I’m relatively new to Python, so I’d love to get your feedback on how I can improve the code in both of these areas.

Third, the code has been posted on GitHub, feel free to use it in your projects and let me know what you think.

.

. More details

Leave a Reply