Data cleaning and feature engineering in Python

Finally, we will drop usecode (e.

g.

house, condo, mobile home) which could be quite useful but we will not use it for this example.

def drop_geog(data, keep = []): remove_list = ['info','address','z_address','longitude','latitude','neighborhood','lastsolddate','zipcode','zpid','usecode', 'zestimate','zindexvalue'] for k in keep: remove_list.

remove(k) data = data.

drop(remove_list, axis=1) data = data.

drop(data.

columns[data.

columns.

str.

contains('unnamed',case = False)],axis = 1) return data housing = drop_geog(housing)Now that we have cleaned up the data, let’s take a look at how a few algorithms manage using it.

We will use scikit-learn.

First, we need to split the data into testing and training sets, again using a function that we can reuse later.

This assures that when we test the data, we are actually testing the model on data it has never seen before.

from sklearn.

model_selection import train_test_splitdef split_data(data): y = data['lastsoldprice'] X = data.

drop('lastsoldprice', axis=1) # Return (X_train, X_test, y_train, y_test) return train_test_split(X, y, test_size=0.

2, random_state=30)housing_split = split_data(housing)Let’s try Linear Regression first.

import sysfrom math import sqrtfrom sklearn.

metrics import mean_squared_error, mean_absolute_error, r2_scorefrom sklearn.

model_selection import GridSearchCVimport numpy as npfrom sklearn.

linear_model import LinearRegressiondef train_eval(algorithm, grid_params, X_train, X_test, y_train, y_test): regression_model = GridSearchCV(algorithm, grid_params, cv=5, n_jobs=-1, verbose=1) regression_model.

fit(X_train, y_train) y_pred = regression_model.

predict(X_test) print("R2: ", r2_score(y_test, y_pred)) print("RMSE: ", sqrt(mean_squared_error(y_test, y_pred))) print("MAE: ", mean_absolute_error(y_test, y_pred)) return regression_model train_eval(LinearRegression(), {}, *housing_split)This train_eval function can be used for any arbitrary scikit-learn algorithm, for both training and evaluation.

This is one of the great benefits of scikit-learn.

The first line of the function incorporates a set hyperparameters that we want to evaluate against.

In this case, we pass in {} so we can just use the default hyperparameters on the model.

The second and third lines of this function do the actual work, fitting the model and then running a prediction on it.

The print statements then show some stats that we can evaluate.

Let’s see how we faired.

R2: 0.

5366066917131977RMSE: 750678.

476479495MAE: 433245.

6519384096The first score, R², also known as the Coefficient of Determination, is a general evaluation of the model showing the percentage of variation in the prediction that can be explained by the features.

In general, a higher R² value is better than a lower one.

The other two stats are root mean squared error and mean absolute error.

These two can only be evaluated in relation to other evaluations of the same statistic on other models.

Having said that, an R² of .

53, and the other stats in the many hundreds of thousands (for houses probably costing one or two million) is not great.

We can do better.

Let’s see how a few other algorithms perform.

First, K-Nearest Neighbors (KNN).

from sklearn.

neighbors import KNeighborsRegressorknn_params = {'n_neighbors' : [1, 5, 10, 20, 30, 50, 75, 100, 200, 500]}model = train_eval(KNeighborsRegressor(), knn_params, *housing_split)If Linear Regression is mediocre, KNN is terrible!R2: 0.

15060023694456648RMSE: 1016330.

95341843MAE: 540260.

1489399293Next we will try Decision Tree.

from sklearn.

tree import DecisionTreeRegressortree_params = {}train_eval(DecisionTreeRegressor(), tree_params, *housing_split)This is even worse!R2: .

09635601667334437RMSE: 1048281.

1237086286MAE: 479376.

222614841Finally, let’s look at Random Forrest.

from sklearn import ensemblefrom sklearn.

ensemble import RandomForestRegressorfrom sklearn.

datasets import make_regressionforest_params = {'n_estimators': [1000], 'max_depth': [None], 'min_samples_split': [2]}forest = train_eval(RandomForestRegressor(), forest_params, *housing_split)This one is a bit better, but we can still do better.

R2: 0.

6071295620858653RMSE: 691200.

04921061MAE: 367126.

8614028794How do we improve on these results?.One option is to try other algorithms, and there are many, and some will do better.

But we can actually fine tune our results by getting our hands dirty in the data with feature engineering.

Let’s reconsider some of the features that we have in our data.

Neighborhood is an interesting field.

The values are things like “Portrero Hill” and “South Beach.

” These cannot be simply ordered (from most expensive to least expensive neighborhood), or at least, doing so would not necessarily produce better results.

But we all know that the same house in two different neighborhoods will have two different prices.

So we want this data.

How do we use it?Python’s Pandas library gives us a simple tool for creating a “one-hot encoding” of these values.

This takes the single column of “neighborhood” and creates a new column for each value in the original neighborhood column.

For each of these new rows (with new column header names like “Portrero Hill” and “South Beach”), if a row of data has that value for the neighborhood in the original column, it is set to 1, otherwise it is set to 0.

The machine learning algorithms can now build a weight associated with that neighborhood, which is either applied if the data point is in that neighborhood (if the value for that column is 1) or not (if it is 0).

First, we need to retrieve our check-pointed data, this time keeping the “neighborhood” field.

housing_cleaned = drop_geog(clean_data.

copy(), ['neighborhood'])Now we can create a one-hot encoding for the “neighborhood” field.

one_hot = pd.

get_dummies(housing_cleaned['neighborhood'])housing_cleaned = housing_cleaned.

drop('neighborhood',axis = 1)We will hold onto the “one_hot” value and add it later.

But first, we have to do two more things.

We need to split the data into a training set and a test set.

(X_train, X_test, y_train, y_test) = split_data(housing_cleaned)For our final step, we need to scale and center the data.

from sklearn.

preprocessing import StandardScalerscaler = StandardScaler()scaler.

fit(X_train)X_train[X_train.

columns] = scaler.

transform(X_train[X_train.

columns])X_train = X_train.

join(one_hot)X_test[X_test.

columns] = scaler.

transform(X_test[X_test.

columns])X_test = X_test.

join(one_hot)housing_split_cleaned = (X_train, X_test, y_train, y_test)Let’s unpack this step a bit.

First, we apply StandardScaler().

This function scales and centers the data by subtracting the mean of the column and dividing the standard deviation of the column, for all data points in each column.

This standardizes all of the data, giving each column a normal distribution.

It also scales the data, because some fields will vary from 0 to 10,000, such as “finishedsqft,” while others will vary only from 0 to 30, such as number of rooms.

Scaling will put them all on the same scale, so that one feature does not arbitrarily play a bigger role than others just because it has a higher maximum value.

For some machine learning algorithms, as we will see below, this is critical to getting even a half decent result.

Second, it is important to note that we have to “fit” the scaler on the training features, X_train.

That is, we take the mean and standard deviation of the training data, fit the scaler object with these values, then transform the training data AND the test data using that fitted scaler.

We do not want to fit the scaler on the test data, as that would then leak information from the test data set into the trained algorithm.

We could end up with results that appear better than they are (because the algorithm already is trained on test data) or appear worse (because the test data is scaled on their own data set, and not on the test data set).

Now, let’s rebuild our models with the newly engineered features.

model = train_eval(LinearRegression(), {}, *housing_split_cleaned)Now, under Linear Regression, the simplest algorithm we have, the results are already better than anything we saw previously.

R2: 0.

6328566983301503RMSE: 668185.

25771193MAE: 371451.

9425795053Next is KNN.

model = train_eval(KNeighborsRegressor(), knn_params, *housing_split_cleaned)This is an a huge improvement.

R2: 0.

6938710004544473RMSE: 610142.

5615480896MAE: 303699.

6739399293Decision Tree:model = train_eval(DecisionTreeRegressor(), tree_params,*housing_split_cleaned)Still pretty bad, but better than before.

R2: 0.

39542277744197274RMSE: 857442.

439825675MAE: 383743.

4403710247Finally, Random Forrest.

model = train_eval(RandomForestRegressor(), forest_params, *housing_split_cleaned)Again, a decent improvement.

R2: 0.

677028227379022RMSE: 626702.

4153226872MAE: 294772.

5044353021There is certainly far more that can be done with this data, from additional feature engineering to trying additional algorithms.

But the lesson, from this short tutorial, is that seeking more data or pouring over the literature for better algorithms may not always be the right next step.

It may be better to get the absolute most you can out of a simpler algorithm first, not only for comparison but because data cleaning may pay dividends down the road.

Finally, in spite of its simplicity, K-Nearest Neighbors can be quite affective, so long as we treat it with the proper care.

.. More details

Leave a Reply