Data Science with no Math

Data Science with no MathUsing AI to Build Mathematical DatasetsRich FolsomBlockedUnblockFollowFollowingMar 15This is an addendum to my last article, in which I had to add a caveat at the end that I was not a mathematician, and I was new at Python.

I added this because I struggled to come up with a mathematical formula to generate patient data that would follow a trend that made sense to me.

The goal of this article is to generate 10,000 patient records which would correlate age to cost.

I wanted the correlation to follow a pattern that looks something like this:Artists rendition of correlation pattern (not exact)The Y axis is the cost multiplier; the X-axis is the age.

The idea here is that patient costs start as relatively high, decrease as they approach a certain age, then start increasing again.

After much trial and error I came up with a formula that would generate a graph that looks like this:You can obviously see that there are some flaws in this formula.

The most glaring one is that it implies costs level out once the patient hits 60.

In the correlation that I wanted to use, the cost continues to increase as age increases.

For the sake of completing the article, I felt like this was close enough and I was ready to move on to start writing actual code.

For days after I published the article, I continued to try to come up with a formula which would follow my correlation pattern, with no success.

Then one day I had an epiphany: why not let the computer figure out the formula.

If I could successfully implement this, I could focus my efforts on improving my Python knowledge (which was my goal in the first place) vs.

figuring out a mathematical formula.

Let’s use machine learning to generate an approximation of the formula that I wanted using a few values as input.

Once we have a model trained, we can generate a full sample dataset for input into a machine learning model.

The first step is to set up a few values to train our model:Now, if we plot this, you can see that it roughly follows the picture I drew above:We can now use this data as an input to a neural network to build a model that we could train to predict any age that we pass in:from sklearn.

neural_network import MLPRegressorregr=MLPRegressor(hidden_layer_sizes=(30), activation='tanh', solver='lbfgs', max_iter=20000)model=regr.

fit(np.

array(age_df['Age']).

reshape(-1,1),Age_df['Cost'])The MLP in MLPRegresser stands for Multi-Layer Perceptron, which is a type of neural network that is part of the sklearn Python library.

The sklearn library has numerous regressors built in, and it’s pretty easy to experiment with them to find the best results for your application.

All of the regressors have a fit function that trains the model with the given input.

Now that our model is trained, let’s generate a test dataset to see how our model did.

df = pd.

DataFrame( {'Age': np.

arange(0,100), 'Cost': model.

predict(np.

arange(0,100).

reshape(-1, 1))})In this case, we’re generating a dataframe containing a row for every age between zero and 100, along with the cost that is predicted by our model for that age.

Plotting the results of this gives us:This looks much more like the picture I drew at the top of the article.

However, we don’t want our model to predict the exact cost multiplier for an age.

Instead, we want to use the prediction as a baseline to predict a random value.

In this case, we’ll adjust the data so that the cost is within ±20% of the prediction.

Here’s how to do this in Python:df['Cost'] = [i + i * random.

uniform(-0.

2, 0.

2) for i in df['Cost']]Now, if we plot our dataset, it looks like this:Now we have generated 100 values that roughly follow the drawing at the top of the article.

Let’s generate a Dataset of 10,000 rows using this model.

df2 = pd.

DataFrame( {'Age': (np.

random.

random_sample((10000,))*100).

astype(int)})df2['Cost'] = model.

predict(np.

array(df2['Age']).

reshape(-1, 1))Here’s a scatter plot of those 10,000 Age/Cost values, and as we can see, it still roughly follows the drawing at the top of the article.

Now we’ll add some randomness to the dataset and see what it looks like:df2['Cost'] = [i + i * random.

uniform(-0.

2, 0.

2) for i in df2['Cost']]We can now use this as part of a dataset to predict healthcare costs using Age as one of the inputs.

This concept could be used effectively to augment this dataset on Kaggle, which contains valuable trends but only contains 1338 rows.

Using this technique, we could generate as many rows as we wanted to input into a model.

.

. More details

Leave a Reply