Predicting Airbnb prices with deep learning, part 1: how to clean up Airbnb data

Predicting Airbnb prices with deep learning, part 1: how to clean up Airbnb dataHow to deal with messy location and text data in PythonLaura LewisBlockedUnblockFollowFollowingMay 16Project aims and backgroundAirbnb is a home-sharing platform that allows home-owners and renters (‘hosts’) to put their properties (‘listings’) online, so that guests can pay to stay in them.

Hosts are expected to set their own prices for their listings.

Although Airbnb and other sites provide some general guidance, there are currently no free and accurate services which help hosts price their properties using more than just an address and a finger in the air.

Paid third party pricing software is available, but generally you are required to put in your own expected average nightly price (‘base price’), and the algorithm will vary the daily price around that base price on each day depending on day of the week, seasonality, how far away the date is, and other factors.

Airbnb pricing is important to get right, particularly in big cities like London where there is lots of competition and even small differences in prices can make a big difference.

It is also a difficult thing to do correctly — price too high and no one will book.

Price too low and you’ll be missing out on a lot of potential income.

This project aims to solve this problem, by using machine learning and deep learning to predict the base price for properties in London.

This first blog post will explain how I went about sourcing and preparing the data, including some thoughts on dealing with UK geographic data (it’s surprisingly complicated) and extracting relevant information from long text strings.

The second blog post in this series will explore the listing data to extract interesting and useful insights, in order to help hosts maximise their earnings.

And the final post will go into the machine learning and deep learning modelling in more detail.

Additional context: I previously worked for a year and a half at an Airbnb property management company, as head of the team responsible for pricing, revenue and analysis.

Decisions made during the course of this project are therefore informed by domain expertise in this industry.

The datasetThe dataset used for this project comes from Insideairbnb.

com, an anti-Airbnb lobby group that scrapes Airbnb listings, reviews and calendar data from multiple cities around the world.

The dataset was scraped on 9 April 2019 and contains information on all London Airbnb listings that were live on the site on that date (about 80,000).

A GeoJSON file of London borough boundaries was also downloaded from the same site.

The data is quite messy, and has some limitations.

The major one is that it only includes the advertised price (sometimes called the ‘sticker’ price).

The sticker price is the overall nightly price that is advertised to potential guests, rather than the actual average amount paid per night by previous guests.

The advertised prices can be set to any arbitrary amount by the host, and hosts that are less experienced with Airbnb will often set these to very low (e.


£0) or very high (e.


£10,000) amounts.

Nevertheless, this dataset can still be used as a proof of concept.

A more accurate version could be built using data on the actual average nightly rates paid by guests, e.


from sites like AirDNA that scrape and sell higher quality Airbnb data.

Cleaning and preparing the dataFor the full exciting details (disclosure: level of excitement may vary) of data cleaning, feel free to check out my GitHub repo.

In the interests of brevity, I’ll just mention here three particular areas of data pre-processing that might be of interest (if they’re not, feel free to skip to part 2 with the colourful graphs).

Features that weren’t included (but I would have liked to include)The original dataset contained 106 features, including quite a few text columns of all the different description fields that you can fill in for an Airbnb listing.

Due to time constraints I did not do any natural language processing (NLP) in this model, so all these features were dropped.

However, an interesting avenue of future development for this model would be to augment it with NLP — perhaps for sentiment analysis, or looking for keywords, or some sort of fancy Word2Vec type situation that looks for similar listing descriptions and uses this to help guess price based on similar listings.

Another potential direction of future work could include reviews.


com also scrapes reviews, which can be matched to listings with their listing IDs.

Although most guests tend to give most listings high ratings (more on this in part 2), more nuanced ratings could perhaps be derived from the reviews themselves.

Dealing with London geography (TLDR: London was not mapped with data scientists in mind)Postcodes in the UK are complicated and messy.

They can be various lengths, and consist of letters and numbers in various orders.

The first half of a postcode is called the outcode or postcode district, and refers to areas as shown here:London postcode districts.

Source: https://en.


org/wiki/London_postal_districtJust to make things more complicated, the main geographic division of London is into 32 boroughs plus the City of London (technically a Corporation rather than a borough due to some quirks of 12th Century English history), which do not align with postcode districts (because that would be too easy, right?):London postcode districts (red), layered over London boroughs (black lines).

Source: https://en.


org/wiki/London_postal_districtThere also aren’t any easy ways of classifying London areas on a less granular level.

In fact, there is not even agreement on what counts as ‘inner London’:What even is London?.Source: https://en.


org/wiki/Inner_LondonAnd to make matters even worse, it turns out Airbnb allows hosts to enter postcodes in a free text entry box, precluding any easy separation of parts of postcodes, and allowing hosts to write all kinds of nonsense (my favourite is just the word ‘no’).

In the end, after discarding a bunch of regex experimentation with postcodes, I settled on using borough as the unit of geography.

Location is very important for Airbnb listings, and so I was not entirely happy about having to use borough.

It is not on a particularly fine-grained level, and does not always express well whether a property is in central London or out in the sticks — which makes a huge difference to price.

For example, the famous Shard skyscraper is in Southwark, but so is Dulwich , where the tube doesn’t even reach (disclaimer: Dulwich is actually lovely, but is probably less well known to tourists in London).

I did also experiment with using latitude and longitude instead of borough in order to get more fine-grained results — but as part 3 of this blog will show, it was not entirely successful.

Amenities (so very many amenities)In the dataset from Insiderairbnb.

com, amenities were stored as one big block of text— here’s one example:In order to figure out what the various options were and which listings had them, I first made a giant string of all the amenities values, tidied it up a bit, split out the individual amenities separated by commas, and created a set of the resultant list (fortunately the dataset was small enough to allow this, but I would have needed a more efficient way to do this with a much larger dataset):And here’s a list of all the amenities it is possible to have: '24-hour check-in', 'Accessible-height bed', 'Accessible-height toilet', 'Air conditioning', 'Air purifier', 'Alfresco bathtub', 'Amazon Echo', 'Apple TV', 'BBQ grill', 'Baby bath', 'Baby monitor', 'Babysitter recommendations', 'Balcony', 'Bath towel', 'Bathroom essentials', 'Bathtub', 'Bathtub with bath chair', 'Beach essentials', 'Beach view', 'Beachfront', 'Bed linens', 'Bedroom comforts', 'Bidet', 'Body soap', 'Breakfast', 'Breakfast bar', 'Breakfast table', 'Building staff', 'Buzzer/wireless intercom', 'Cable TV', 'Carbon monoxide detector', 'Cat(s)', 'Ceiling fan', 'Ceiling hoist', 'Central air conditioning', 'Changing table', "Chef's kitchen", 'Children’s books and toys', 'Children’s dinnerware', 'Cleaning before checkout', 'Coffee maker', 'Convection oven', 'Cooking basics', 'Crib', 'DVD player', 'Day bed', 'Dining area', 'Disabled parking spot', 'Dishes and silverware', 'Dishwasher', 'Dog(s)', 'Doorman', 'Double oven', 'Dryer', 'EV charger', 'Electric profiling bed', 'Elevator', 'En suite bathroom', 'Espresso machine', 'Essentials', 'Ethernet connection', 'Exercise equipment', 'Extra pillows and blankets', 'Family/kid friendly', 'Fax machine', 'Fire extinguisher', 'Fire pit', 'Fireplace guards', 'Firm mattress', 'First aid kit', 'Fixed grab bars for shower', 'Fixed grab bars for toilet', 'Flat path to front door', 'Formal dining area', 'Free parking on premises', 'Free street parking', 'Full kitchen', 'Game console', 'Garden or backyard', 'Gas oven', 'Ground floor access', 'Gym', 'HBO GO', 'Hair dryer', 'Hammock', 'Handheld shower head', 'Hangers', 'Heat lamps', 'Heated floors', 'Heated towel rack', 'Heating', 'High chair', 'High-resolution computer monitor', 'Host greets you', 'Hot tub', 'Hot water', 'Hot water kettle', 'Indoor fireplace', 'Internet', 'Iron', 'Ironing Board', 'Jetted tub', 'Keypad', 'Kitchen', 'Kitchenette', 'Lake access', 'Laptop friendly workspace', 'Lock on bedroom door', 'Lockbox', 'Long term stays allowed', 'Luggage dropoff allowed', 'Memory foam mattress', 'Microwave', 'Mini fridge', 'Mobile hoist', 'Mountain view', 'Mudroom', 'Murphy bed', 'Netflix', 'Office', 'Other', 'Other pet(s)', 'Outdoor kitchen', 'Outdoor parking', 'Outdoor seating', 'Outlet covers', 'Oven', 'Pack ’n Play/travel crib', 'Paid parking off premises', 'Paid parking on premises', 'Patio or balcony', 'Pets allowed', 'Pets live on this property', 'Pillow-top mattress', 'Pocket wifi', 'Pool', 'Pool cover', 'Pool with pool hoist', 'Printer', 'Private bathroom', 'Private entrance', 'Private gym', 'Private hot tub', 'Private living room', 'Private pool', 'Projector and screen', 'Propane barbeque', 'Rain shower', 'Refrigerator', 'Roll-in shower', 'Room-darkening shades', 'Safe', 'Safety card', 'Sauna', 'Security system', 'Self check-in', 'Shampoo', 'Shared gym', 'Shared hot tub', 'Shared pool', 'Shower chair', 'Single level home', 'Ski-in/Ski-out', 'Smart TV', 'Smart lock', 'Smoke detector', 'Smoking allowed', 'Soaking tub', 'Sound system', 'Stair gates', 'Stand alone steam shower', 'Standing valet', 'Steam oven', 'Step-free access', 'Stove', 'Suitable for events', 'Sun loungers', 'TV', 'Table corner guards', 'Tennis court', 'Terrace', 'Toilet paper', 'Touchless faucets', 'Walk-in shower', 'Warming drawer', 'Washer', 'Washer / Dryer', 'Waterfront', 'Well-lit path to entrance', 'Wheelchair accessible', 'Wide clearance to bed', 'Wide clearance to shower', 'Wide doorway', 'Wide entryway', 'Wide hallway clearance', 'Wifi', 'Window guards', 'Wine cooler', 'toilet',In the list above, some amenities are more important than others (e.


a balcony is more likely to increase price than a fax machine), and some are likely to be fairly uncommon (e.


‘Electric profiling bed’).

Based on previous experience in the industry, and furtherresearch into which amenities are considered by guests to be more important, a selection of the more important amenities were extracted.

These were then selected from for inclusion in the final model depending on how sparse the data was.

For example, if it turns out that almost all properties have/do not have a particular amenity, that feature will not be very useful in differentiating between listings or helping explain differences in prices.

The whole convoluted code for this can be found on GitHub, but this is the final section where I removed columns where over 90% of the listings either had or did not have a particular amenity:These are the amenities that I ended up keeping:BalconyBed linenBreakfastTVCoffee machineBasic cooking equipmentWhite goods (specifically a washer, dryer and/or dishwasher)Child-friendlyParkingOutdoor spaceGreeted by hostInternetLong term stays allowedPets allowedPrivate entranceSafe or security systemSelf check-inCliffhanger endingAfter these (and many other) cleaning and pre-processing steps, the Airbnb was in suitable form to begin exploration and modelling.

In the interests of not losing useful insights amidst code chunks and long bulleted lists of random things an Airbnb might have, the data exploration section has been separated out into part 2 of this blog series (coming very soon).

Thanks for reading so far, and stay tuned for the next instalment!If you found this post interesting or helpful, please let me know via the medium of claps and/or comments, and you can follow me in order to be notified when parts 2 and 3 are published — I’ll be posting them very soon.

Thanks for reading!.

. More details

Leave a Reply