Where in the world is Carmen Sandiego?

Where in the world is Carmen Sandiego?Developing a classification model for which country a new user will make their first bookingSteven LiuBlockedUnblockFollowFollowingJan 17Airbnb home pageGrowing up, Where in the World is Carmen Sandiego?, was a staple in my rotation of early video games alongside Oregon Trail and SimCity.

For those of you who aren’t familiar with this game, let me briefly describe it for you.

You are a rookie detective employed by ACME Detection Agency to track down villains from the V.

I.

L.

E.

organization by collecting and solving clues from country to country.

In the end, you confront the boss lady herself, Carmen Sandiego.

It was such an awesome educational game, complete with hilarious cartoon physics and beautiful panoramas.

But how does this relate to classifying where a new user will make their first travel booking on Airbnb (this challenge first appeared on kaggle as part of a recruiting competition)?Not unlike the video game, we are provided a bunch of clues about the user (language, age, web session records, etc.

) and our job is to figure out which country they will most likely visit.

While it may not be as thrilling as busting up a global crime syndicate, figuring out where new guests will spend their first travel experience does pose a serious challenge for Airbnb.

It’s great if the guest already knows where they’re going to stay, they can just make a search and book a home.

What we’re really focused on are those users who don’t necessarily have a set destination in mind and are either browsing or looking for some travel inspiration.

We need to know what the new guests want before they even know it.

The advantages of having this prior knowledge are:Share more personalized content to rapidly deliver relevant findings to the guest, keeping them engaged on the platform and more likely to complete a travel booking.

Decrease average time to first booking to more quickly earn money from that transaction.

Better forecast demand to increase understanding of rental markets and ensure guest demands are met.

Start hereThe data is provided on kaggle as a series of csv files.

Grab the train_users_2 and countries files, which contain user/country level information, and join them together.

Clean it upThe next step is to tidy up the data so that we can explore it in greater detail and pass it to the classification model later.

Luckily for this particular dataset, we don’t need to do too much cleaning.

All we need to do is (1) convert the date variables to a datetime format and (2) handle missing values.

datetime formatTo go from string to datetime, use the pd.

to_datetime() function.

For the timestamp_first_active variable, this requires a little more work because it is just a jumble of characters.

We need to specify the format so the function understands how to read it.

missing valuesThere are 9 variables with missing values and most of them are missing over 40% of the data.

Ideally, we would like to impute these missing values (i.

e.

infer them from the known values) because otherwise we would discard the rows or columns containing them and that would incur a hefty amount of data loss.

In some cases though, the missingness of some these variables actually tells us something.

Imputation would not be valid in these situations (i.

e.

the missing dates in date_first_booking tells us the user never made a travel booking) because it would not preserve the original meaning.

What’s more, imputation with the mean or median can reduce variability in the data leading to unreliable predictions.

Instead, we take a greedy approach and keep everything by replacing null values with their own separate category using the replace() function.

The only exceptions are the date_first_booking and age variable.

For the former variable, we keep the null values in the datetime format for some feature engineering down the road and fill in the missing values right before we are ready to model.

As for the age variable, we employ a method known as discretization (or binning) to partition the age variable into several age groups which allows us to treat the missing values as their own category.

Check it outOne of the first things we can do is develop a general intuition for where users like to travel.

At it’s simplest, if all users prefer traveling to one country, then that’d make things easy for the classification model.

We can reasonably infer that a new guest will also make a travel booking in that country.

To visualize where travelers like to visit, create a choropleth map, where areas are shaded in proportion to the statistic displayed on the map.

It is important to normalize by country size to produce a density map because the U.

S.

is so large that it overshadows the smaller European countries.

If we hadn’t normalized the values, we would’ve been led to believe there was no travel interest in Europe, which is far from being true!The Netherlands, United Kingdom, France and Italy have the highest density of Airbnb travelers per square mile.

This falls in line with expectations because Airbnb is primarily driven by millenials who place high-value in seeking new unique travel experiences (can confirm as a millenial!).

In addition, the recent rise of low-price airlines helps fuel our travel obsession by breaking down the traditionally high-cost of international travel.

Fully overcoming this barrier will unlock additional growth by creating more opportunities for hosts and guests around the world to connect.

Another interesting thing we can look at is user behavior.

What’s different between guests who make a travel booking and those who do not?.For this particular analysis, we take a look at Airbnb web and app usage.

Uncovering differences between the web and app user interface will help us understand why some guests don’t make a booking and others do, which allows for an opportunity to improve.

Airbnb is very clearly a web-first platform and more users signed up on the web versus all mobile platforms (iOS, Android, mobile web) combined.

This pattern is observed in both groups, so we can’t conclude that the type of platform a guest uses affects their decision to make a travel booking.

The upside is that this leaves a lot of room for further mobile adoption.

Mobile devices are more ubiquitous than traditional desktops, allowing them to reach a greater audience, especially in developing countries.

This is obviously great for Airbnb because it represents an opportunity to onboard more potential hosts and guests.

One last thing we will check out before beginning to model is how age influences where a new guest will travel to.

To address this, we create a radar plot, which is good for visualizing and comparing multiple quantitative variables.

Be careful not to overcrowd the plot with too many variables and groups though because that will make it difficult to read.

Notice how easy it is to detect a deviation in guest age between Spain and the U.

K.

Users who visit Spain tend to be younger and in their 20s and 30s.

Contrast this to the U.

K.

, which has more users visiting in their 40s and 60s.

There are several reasons for this:Millenials prefer an active vacation and there may be more of these types of experiences in Spain (mountain skiing in the Pyrenees, sailing in Barcelona, horseback riding in Montserrat Natural Park, etc.

).

Many of the Airbnb experiences in the U.

K.

tend to be historic walking tours which are not as appealing to millenials.

Millenials rather travel to unfamiliar places like Spain because the U.

S.

and U.

K.

are more culturally alike.

Millenials prefer experiencing local spots rather than learning about the history of their destination and visiting major tourist attractions, and there seem to be a lot of these types of Airbnb experiences in the U.

K.

Building the modelPreprocessBefore we begin to model, there are a few preprocessing steps that we have to take care of.

Create a new feature called lag_account_booking, which is simply the lag time between the date an account was first created and the date the user made their first booking.

This is why we didn’t fill in the null values in date_first_booking earlier so we can do some datetime math now.

Next, encode some dummy variables because while decision trees can handle categorical variables, scikit’s decision tree implementation assumes that all features are numeric.

We use a one-two punch of OneHotEncoder() and LabelEncoder() for the feature and target variable, respectively.

As a result of encoding, we end up with a ton of features (217K!) and a very sparse matrix.

To reduce a lot of this dimensionality, we use the TruncatedSVD() function, which is very similar to PCA, but more efficient when working with sparse matrices.

TruncatedSVD() accepts sparse matrices without densifying them, unlike PCA, which would rapidly fill up your memory.

Check out the documentation here for more details.

To choose a value for n_components, we plot the number of components against the cumulative explained variance.

We find that with 230 components, we are able to capture 80% of the explained variance.

After 230 components, we only observe incremental gains, so we won’t pursue any additional components.

ModelTo model, we use a random forest classifier because they are simple to train and tune.

mean cross validation score: 0.

7807555168312879Note how even a basic implementation yields a fairly decent accuracy score.

Let’s see if we can do better though by fiddling with the hyperparameters using GridSearchCV().

The hyperparameters simply refer to things we can adjust to optimize the performance of the model.

These dictate the overall architecture of the model, such as the number of decision trees or the depth of the tree (read more about the available hyperparameters here).

With GridSearchCV(), we perform an exhaustive search over a grid of specified parameters and then select the best candidate.

Note that this may take a while because we are considering 72 combinations of settings.

So just be patient while it’s running and watch an episode of Narcos.

{'criterion': 'entropy', 'max_depth': 6, 'max_features': 'sqrt', 'min_samples_split': 2, 'n_estimators': 120}After obtaining the best hyperparameter values, we fit the classifier with these new values.

training mean cross validation score: 0.

9948090874713612test mean cross validation score: 0.

9940221899302687As a result of tuning the hyperparameters, we get an accuracy score of 99%!.Running the model on the test data yields a similar accuracy score so we know that the model generalized well and we avoided overfitting.

How did we do?A popular method for evaluating classifier performance is to create a confusion matrix, which compares the predicted values against the true values for each label.

The off-diagonal elements are mislabeled while the diagonal elements represent a correct prediction.

It looks like the classifier had some minor issues with the Netherlands, where it mislabeled two users each in Italy, Spain and Germany.

What’s neat though is that these errors are localized to Europe and the classifier was not too far off the target.

The NDF group gave the classifier the most trouble and there are multiple mislabels, but it isn’t a worrying amount.

ConclusionIn summary, we developed an accurate classification model for which country a new user will make their first travel booking.

The gain is that Airbnb gets a competitive advantage by showing new guests what they want before they even know it.

In doing so, this will hopefully nudge new users towards making a travel booking faster!Thanks for reading!.Stay tuned for more as I continue on my path to become a data scientist!.✌️.

. More details

Leave a Reply