Predicting House Prices using Machine Learning

One approach is to create an annotated heatmap.

This will allow us to easily see how strongly is each variable correlated with the other variables.

Each cell contains the correlation coefficient, telling us the strength of linear relationship between two variables.

Heatmap containing 37 numerical featuresWe are interested in finding what features play a significant role in determining the sale price of the house.

We are going to set a threshold value and include all the numerical features whose correlation coefficient is greater than that threshold.

We set our threshold to 0.

45 and get the following features: OverallQual, YearBuilt, YearRemodAdd, MasVnrArea, TotalBsmtSF, 1stFlrSF, GrLivArea, FullBath, TotRmsAbvGrd, Fireplaces, GarageYrBlt, GarageCars, GarageArea.

Since LotFrontage falls below our threshold, we choose to drop it, and ignore the NA values present in that column.

MasVnrArea and GarageYrBlt have 8 and 81 missing values, respectively.

We can replace the missing values in the MasVnrArea column with the median of this column.

This is called imputation — replacing the missing values with an estimated value based on the feature values that are present.





median(),inplace=True) will replace 8 null values with the median of MasVnrArea column.

For GarageYrBlt, we can replace the missing values of the rows with the value of its corresponding YearBuilt value.

Since these two variables are strongly correlated (with a correlation coefficient of 0.

83), we can choose to drop GarageYrBlt.

This leads us to the notion of redundant features.

We can also use the heatmap to help us find redundant features — descriptive feature that is strongly correlated with another descriptive feature.

This will also help us deal with the curse of dimensionality.

We see that GarageCars and GarageArea are strongly correlated (since a bigger garage can fit more cars) and have the same correlation to the target feature (SalePrice).

Therefore, we are going to only include GarageArea when building our machine learning model.

Furthermore, TotalBsmtSF and 1stFirSF are also strongly correlated to each other and have the same correlation to SalePrice; we are going to keep TotalBsmtSF and discard 1stFirSF.

The list of numerical features we feed our model is as follows: OverallQual, YearBuilt, YearRemodAdd, TotalBsmtSF, GrLivArea, FullBath, TotRmsAbvGrd, GarageCars, Fireplaces, MSSubClass.

This method of reducing the number of descriptive features in a dataset to just the subset that is most useful is called feature selection.

The goal of feature selection is to identify the smallest subset of descriptive features that maintains the overall model performance.

Extra: We use trainingSet.

plot(x = "GrLivArea", y = “SalePrice", kind = "scatter") to check for outliers by plotting a scatter plot of GrLivArea vs.


The two points on the bottom right can be outliers.

We can choose to drop the rows(or instances) associated with these two points or plot more graphs to see if the IDs we identified as outliers in this graph correspond to the IDs of the outlier we get in the second graph.

This graph also demonstrates a strong correlation between SalePrice and GarageArea.

We can mark the 4 points on the lower right to be outliers.

We can use trainingSet.


GarageArea > 1200,["SalePrice"]] to retrieve the Id of those points.

We get the following tables:We notice that Id 1299 appears to be an outlier in both plots.

Although we cannot deduce that Id 1299 is an outlier based on only 2 descriptive features, we are going to drop this row for demonstration by using trainingSet.

drop(axis=0, index=1298,inplace=True).

Notice that GarageArea is not in our final list.

I have included it to demonstrate one basic method of detecting outliers.

Categorical FeaturesNow it’s time to deal with categorical features and see what features play a significant role in affecting the sale prices of the homes.

For the sake of simplicity, I will use 3 features — BsmtQual, ExterQual, ExterCond — to introduce basic techniques on how to convert categorical values to numerical value.

We’ll check if any of our (selected) categorical features have missing values.




sum() gives the total number of missing values in the BsmtQual column.

We notice that BsmtQual has 37 NA values; however, these NA values mean that the house has no basement based on the description file of the project.

Therefore, we replace 37 NA values with “no”.

ExterCond and ExterQual have no NA values.

"""replaces the NA values in the BsmtQual column from both the training set and the test set"""for df in [trainingSet,testSet]: for i in ['BsmtQual']: df[i].

fillna('no',inplace= True)Furthermore, the selected columns all have the same traits, making it easier for us to convert them into numerical values.

These traits include: Ex for excellent, Gd for good, TA for typical, Fa for fair, Po for poor.

An important part of data preprocessing is to convert our descriptive feature into a language that our machine learning model can understand.

We use map to manually assign each trait a numerical value; this method of converting is called label encoding.

(Scikit-learn has a class, called LabelEncoder, that can do this automatically for you.

)for df in [trainingSet,testSet]: for i in ['ExterQual','ExterCond','BsmtQual']: df[i]= df[i].

map({'Ex':1,'Gd':2,'TA':3,'Fa':4,'Po':5,no':5})The way I assigned each value a number is arbitrary; you can assign 1 to Po, 2 to Ex, and so on.

The final list of descriptive features that our model will use to train is as follows:feature_numerical = ['OverallQual', 'YearBuilt', 'YearRemodAdd', 'TotalBsmtSF','GrLivArea', 'FullBath', 'TotRmsAbvGrd', 'GarageCars', 'TotRmsAbvGrd', 'Fireplaces', 'MSSubClass']feature_categorical = ['ExterQual','ExterCond','BsmtQual']final_features = feature_numerical + feature_categoricalAs a final step, we feed these features to our model for training and make prediction of the sale price for each house in the test set using the gradient boosting algorithm.

from sklearn.

ensemble import GradientBoostingRegressor#Training gb = GradientBoostingRegressor(n_estimators=1000, learning_rate=0.

05, max_depth=3, max_features='sqrt', min_samples_leaf=15, min_samples_split=10, loss='huber').


values,train_target)#Predictionspredictions_gb = gb.



values refers to our training set, containing our chosen features, converted to a NumPy array.

train_target is the SalePrice column (our target value).


values is the test set converted to a NumPy array.

Final ResultAfter running this script, we get a Root-Mean-Squared-Error (RMSE) of 0.

14859, which puts us in the top 62%.

Participants in the top 10% have error around 0.


A mere difference of 0.

03 can be attributed to selecting only 3 categorical features out of 43 and to the lack of feature engineering.

Nevertheless, we can add more features by inspecting the features more closely and by employing other techniques, such as PCA for dimensionality reduction or one-hot encoding for converting categorical values to numerical values, that can lead to a lower RMSE.

Thank you for reading.

Please comment if you have any questions or any suggestions.


. More details

Leave a Reply