Using Machine Learning to Predict Value of Homes On Airbnb

However, these projects each required a lot of dedicated data science and engineering time and effort.Recently, advances in Airbnb’s machine learning infrastructure have lowered the cost significantly to deploy new machine learning models to production..Data scientists have started to incorporate several AutoML tools into their workflows to speed up model selection and performance benchmarking..At marketplace companies like Airbnb, knowing users’ LTVs enable us to allocate budget across different marketing channels more efficiently, calculate more precise bidding prices for online marketing based on keywords, and create better listing segments.While one can use past data to calculate the historical value of existing listings, we took one step further to predict LTV of new listings using machine learning.Machine Learning Workflow For LTV ModelingData scientists are typically accustomed to machine learning related tasks such as feature engineering, prototyping, and model selection..However, taking a model prototype to production often requires an orthogonal set of data engineering skills that data scientists might not be familiar with.Luckily, At Airbnb we have machine learning tools that abstract away the engineering work behind productionizing ML models..The remainder of this post is organized into four topics, along with the tools we used to tackle each task:Feature Engineering: Define relevant featuresPrototyping and Training: Train a model prototypeModel Selection & Validation: Perform model selection and tuningProductionization: Take the selected model prototype to productionFeature EngineeringTool used: Airbnb’s internal feature repository — ZiplineOne of the first steps of any supervised machine learning project is to define relevant features that are correlated with the chosen outcome variable, a process called feature engineering..In sum, there were over 150 features in our model, including:Location: country, market, neighborhood and various geography featuresPrice: nightly rate, cleaning fees, price point relative to similar listingsAvailability: Total nights available, % of nights manually blockedBookability: Number of bookings or nights booked in the past X daysQuality: Review scores, number of reviews, and amenitiesA example training datasetWith our features and outcome variable defined, we can now train a model to learn from our historical data.Prototyping and TrainingTool used: Machine learning Library in Python — scikit-learnAs in the example training dataset above, we often need to perform additional data processing before we can fit a model:Data Imputation: We need to check if any data is missing, and whether that data is missing at random..By exploring different estimators, data scientists can perform model selection to pick the best model to improve the model’s out of sample error.Performing Model SelectionTool used: Various AutoML frameworksAs mentioned in the previous section, we need to decide which candidate model is the best to put into production..For example, we learned that eXtreme gradient boosted trees (XGBoost) significantly outperformed benchmark models such as mean response models, ridge regression models, and single decision trees.Comparing RMSE allows us to perform model selectionGiven that our primary goal was to predict listing values, we felt comfortable productionizing our final model using XGBoost, which favors flexibility over interpretability.Taking Model Prototypes to ProductionTool used: Airbnb’s notebook translation framework — ML AutomatorAs we alluded to earlier, building a production pipeline is quite different from building a prototype on a local laptop..In fact, we believe that these tools will unlock a new paradigm for how to develop machine learning models at Airbnb.First, the cost of model development is significantly lower: by combining disparate strengths from individual tools: Zipline for feature engineering, Pipeline for model prototyping, AutoML for model selection and benchmarking, and finally ML Automator for productionization, we have shortened the development cycle tremendously.Second, the notebook driven design reduces barrier to entry: data scientists who are not familiar with the framework have immediate access to a plethora of real life examples..By bridging the gap between prototyping and productionization, we can truly enable data scientists and engineers to pursue end-to-end machine learning projects and make our product better.Want to use or build these ML tools?. More details

Leave a Reply