AirBnB in two cities: Seattle vs Boston

These houses would be the ones that are highly preferred due to multiple factors such as location, value and cleanliness.Chart 2 : Availability of Boston vs SeattleThe number of listings recorded in the dataset are flat throughout the year for both Boston and Seattle, at 3586 and 3818 respectively, while their availability is as provided as a boolean that changes daily.Probably the most straightforward part of the Proportion of Listings available graph below is that in the first three months of the data there is a constant upward trend..The reason for this is that the data provided is a snapshot that was taken and that the closer dates to the snapshot date tend to have a higher booking rate.Graph 3: Proportion of Listings availableQuestion 3: Can we predict the review score from comments using regression?.Do the comments of the visitors on a listing give us enough information for us to guess the review score of that listingReviews.csv contains user comments for bookings and Listing.csv file contains average review scores for each listing..I wanted to look into the potential of predicting review scores from the comments that were provided as an exercise.For this, I will be concatenating all comments that were provided for a listing into a single string, applying text cleaning, feature extraction methodologies and ML regression to predict my response variable, which is review score rating.The method we will be using to extract features, TFIDF, is similar to Bag of Words approach, but takes into frequency and reduces the impact of the words that are more common in the corpus — the whole universe of words.After feature extraction using cross validation, we will be using cross validation to test three different machine learning algorithms: Stochastic Gradient Descent Regressor, XGBoostRegressor and CatBoostRegressor with different learning rates..By looking at the lowest MSE Validation Error, we should be able to pick the regressor and parameters to use in our NLP pipeline.Table 1: Regression Results for XGBoost, CatBoost and SGDRAs you can see above, The XGBoost and CatBoost algorithms were quite close in terms of validation error rate in the cross validation results, XGBoost performed only slightly better at the learning_rate of 0.1 and became the winner..At this stage we don’t really look at Train Errors, simply because we would want to compare the performance of our models on data it has not seen.Note: The XGB and Catboost lower learning rates such as 0.0001 seem to have not converged and have produced very high error rates due to this.Finally I fit my pipeline of TFIDF Vectorizer to the X_intermediate data, transforming it and feeding it to a XGB Regressor with learning rate of 0.1..Then we make predictions on X_test and compare them with with y_test actuals, using mean-squared-error scoring.For comparison, we can look at the graph of XGB predicted scores and the actual review scores..As you can see, our model is performing quite well in predicting the scores from the comments provided.Graph 4: XGB Predictions vs Actual Review Score Average. More details

Leave a Reply