Transforming Categorical Data for Usability in Machine Learning Predictions

Transforming Categorical Data for Usability in Machine Learning PredictionsA Quick Example Using Zip Codes, Latitude, and LongitudeAlexander ShropshireBlockedUnblockFollowFollowingApr 19By Kevin Velasco & Alex ShropshireGiven a dataset of King County, WA House Sales from 2014–2015 (https://www.


com/harlfoxem/housesalesprediction), we were tasked to create a model to accurately predict home sale price.

After initial data exploration and cleaning, we discovered an abundance of usable information to describe the quality of the home, but not much useful information in regards to location.

Aerial photo of Seattle by Thatcher KelleyWhy location is important to our datasetWe have all heard the overused phrase ‘location, location, location’ as a central theme in real estate market assessment.

It is wildly believed to be the most important factor in determining where one should live.

Price and location often times go hand in hand with buyers.

The condition, price, and size of any home can change.

The one constant is the home’s location.

Take for example the zip code feature of our dataset:In its current form, with 70 unique categorical values in ‘zipcode’ column, a machine learning model cannot extract any of the useful information contained within each zip code to assess a relationship to price.

Within each of these categories exists a unique selection of meaningful real estate factors like parks, schools, cafes, shops, grocery stores, and access to transit and major roadways.

All are major factors that can indicate implicit differences in, say, the Wallingford neighborhood of Seattle versus a home out in the middle of the Yakima Valley.

Unlocking the intrinsic qualities of a zip code in regards to price is important to incorporate into our price prediction model, so we had to adapt to maintain the power of the category.

A quick glance at the information above show it was obvious that we needed to engineer new features to make sense of the zipcode, as well as the latitude and longitude!Transforming & Comprehending Categorical DataOur approach in properly applying the latitude and longitude information in a unique way involved creating a feature to measure the distance from the major economic centers of both Bellevue and Seattle in order to improve our price predictor.

Rather than reinventing the wheel by creating a mathematical function, we decided to calculate distance via the Haversine formula illustrated below:Example of the Haversine functionThe haversine module (https://pypi.

org/project/haversine/) saved us plenty of time.

Haversine takes in two latitude and longitude tuples and calculates the geographic distance between the two geographical points in a particular unit.

We hypothesized that there would be a significant relationship between home price and both the Seattle downtown and Bellevue downtown areas, chosen because they are the biggest hubs of jobs & economic activity in the area.

We figured smaller commute times and good access to the resources of a major city would generally increase demand, and with it, price.

For the formula, we needed to transform the coordinates of each house.

Given the latitude and longitude in separate columns, we applied the zip function to create a new column of coordinate tuples:With single points for each home in the data set, as well as two points of reference, all the variables for the haversine formula were created.

Applying the function directly into a new column into the existing Dataframe proved to be tricky.

In order to create a new column indicating distances, we had to create a Pandas Series from a list datatype to then add it to the existing Dataframe.

Lastly, we created yet another column which selected the minimum distance between both points to input for the “distance_from_epicenter” column.

This step was important to capture the information of proximity to both cities to ensure our interpretation of the shortest distance to an economic hub was ready for our model to interpret.

For the zip codes, we implemented the one hot encoding pd.

get_dummies() method, by which categorical variables are converted into a form that could be provided to ML algorithms to more accurately predict price.

In terms of our data, we wanted to take the one category of the zip code, and transform it into the binary yes[1], this row is a member of this zip code, or no[0], it isn’t.

The sheer amount of zip codes first seemed alarming, adding many columns to our data frame:But by transforming the zipcode feature to multiple columns containing binary information, we were then able include it as a feature to train the model.

To correctly interpret the coefficients that our model would produce for all of the columns, we first have to select one of the various zip codes to drop after the pd.

get_dummies() method, to redact redundant information.

In other words, we drop one zip code column because the presence of zeros in every other zip code column would indicate that the row is a member of our dropped zip code.

The information is inherent in the remaining n-1 columns’ binary form.

We chose to drop the zip code of 98103, which included the areas of Wallingford, Greenlake, Phinney Ridge, Greenwood, and Fremont.

We chose this zip code because it represents a centrally located, highly sought out area of Seattle, that we hope can be easily interpreted as a benchmark of price to a wider Seattle-aware audience.

We wanted to choose a zip to drop that was neither Bill Gates’ zip code nor one closer to the opposite extreme, and so that plenty of other zip codes’ mean home prices lie both above and below that of 98103.

Most importantly, this entire process ensures that our resulting coefficients could be explained in relation to the 98103 neighborhoods.

The Results of our Model Improvement EffortsAfter careful data cleaning and feature engineering, our model contained 10 predictors, including an initial “distance_from_seattle” predictor.

After running the dataset through statsmodels’ Ordinary Least Squares Linear Regression model, our summary output came out with decent a decent R-squared (a quick, handy way to initially assess model quality):Initial OLS ResultsAt 0.

735, we’re not in bad shape, but there’s room for improvement.

Aware of the limitations of R-squared as a measure of model quality, for the sake of illustrating before and after scenarios of basic model iterations it can serve valiantly.

That said, after incorporating our ‘distance_from_epicenter’ feature that integrated the distances from the economic hubs Downtown Seattle & Bellevue:adding distance_from_epicenterA respectable improvement!.Lastly, by transforming the awkwardly formatted zip code categories to a useful numerical format using one-hot encoding:adding zipcodeWith such a significant boost, and depending on the potential context of the business case, we may be ready to deploy our predictor or bringing in new data, engineering new features, or simply continue iterating within the existing data set.

Our model now results in an adjusted R-squared of 0.


In other words, 86.

7% of the response variable variation can be explained by our linear model.

Final ThoughtsLooking back to the original data set and the information provided, what seemed at first to be very little actionable information regarding location actually contained powerful categorical zip code data and transformable latitude and longitude data that produced great improvements in our machine learning algorithm once the potential of each was unlocked.

The old adage of ‘location, location, location’ in understanding the market price of a home seems to hold true power after all.

To level up from here, variables like time, larger geographical bounds, and data from other demand-influencing patterns can be considered, especially as high growth companies in the Real Estate/Property Technology space enter an analytics arms race to be the most knowledgeable.

I’m sure you’re nice folks, but we’re coming for you Zillow.

For project details, code for our most accurate price predictor, and a related slideshow, check out our GitHub repos:Alex Shropshire:as6140/kingcountyWA_home_price_predictorContribute to as6140/kingcountyWA_home_price_predictor development by creating an account on GitHub.


comKevin Velasco:kevintheduu/dsc-1-final-project-seattle-ds-career-040119Contribute to kevintheduu/dsc-1-final-project-seattle-ds-career-040119 development by creating an account on GitHub.


comStay in touch on Linkedin!Kevin: https://www.


com/in/kevinrexisvelasco/Alex: https://www.



. More details

Leave a Reply