In our case, we have hundreds of specialties and hundreds of typos, so…we’ll come back and extract these features if we absolutely have to.
Please visit the github link for more detail on implementing text feature engineeringNext, since we know NYC is a diverse place with quickly-changing neighborhoods, let’s see how we can better incorporate location information into our data.
We have access to each physician’s primary workplace zipcode (awesome), fairly useful categorical information in itself, especially in conjunction with information like languages, age, specialty, etc.
Let’s take it two steps further, NYU has acutally published some in-depth demographic information online which includes details about the population, housing, and services by district.
And with a little (actually, a lot) of manual data entry we can do a fairly straightforward merge to tie all of that to our single zipcode feature.
Please visit the github link for more detail on implementing location feature engineeringUsing mapbox we can clearly see the relationship between location and our target.
loc_df = data.
groupby(['provider.
zip']).
mean()loc_df['n']=data.
groupby(['provider.
zip'])["n_member_feedback_facet_infos"].
sum()px.
set_mapbox_access_token(open('.
mapbox_token').
read())px.
scatter_mapbox(loc_df, lat="LAT", lon="LNG", color='rating', size="n", color_continuous_scale=px.
colors.
sequential.
Viridis)Although our dataset is biased towards high ratings, we can definitely see there are portions of Brooklyn and Queens which, on average, have lower scores.
The plot also emphasizes the high concentration of doctors in lower Manhattan.
I want to draw attention to another couple features in particular, the ‘provider.
INTERNAL’ features and ‘provider.
member_feedback_facet_infos’.
These internal score features are calculated by Oscar and the latter feature is actually another form of review — reviewers can leave specific feedback such as “Easy to book” and “Nice office environment”.
My primary concern is that they may not be independent from the patient review score even though we’ve already removed any rows without reviews.
The co-variance plot below luckily shows that the only features which are strongly correlated are the review badges with the number of reviews.
However, that doesn’t mean we can necessarily use all this data.
For one — as we’re trying to predict reviews when none already exist — ‘provider.
num_reviews’ obviously has to go.
Also, the positive correlation between ‘member_scoreALL’ and our target indicates a possible data leak, and, as we can’t be sure whether it’s calculated with our target directly, it’s best to exclude it from our dataset.
Finally, we’re ready to begin fitting models.
We still have to be cognizant of our small number of data samples — our approach will involve fitting a number of different regression models, which have the potential to find disparate patterns in the data, and then ensembling the results into a higher-level model to leverage all of the distinct outputs.
The bulk of this code is shown below in which we import our model architectures from the sklearn library, fit them on slices of our data, predict on the remaining data, repeat until we’ve predicted the entire dataset, and then fit with our top-level model.
#kfold predictions for each data sample for each model type so we can ensemblekf = KFold(n_splits=10)#selection of regression models from sklearn with updated default valuesmodels = [LinearRegression(), KNeighborsRegressor(n_neighbors=100), BayesianRidge(n_iter=300, tol=0.
001), DecisionTreeRegressor(), RandomForestRegressor(n_estimators=100), GradientBoostingRegressor(), AdaBoostRegressor()]#kfold predictions for each data sample for each model type so we can ensemblekf = KFold(n_splits=10)#selection of regression models from sklearn with updated default valuesmodels = [KNeighborsRegressor(n_neighbors=100), BayesianRidge(n_iter=300, tol=0.
001), DecisionTreeRegressor(), RandomForestRegressor(n_estimators=100), GradientBoostingRegressor(), AdaBoostRegressor()]#remove rating from feature setX,y = data.
loc[:,:'LNG'], data.
loc[:,'rating']#initialize scalerscaler = StandardScaler()#create k-fold split probability predictions for each modelfor train_index,test_index in kf.
split(X): #split data according to random fold x_train,y_train,x_test,y_test = X.
loc[train_index], y[train_index], X.
loc[test_index], y[test_index] #learn scaling from training data scaler.
fit(x_train) #scale training and test data x_train = scaler.
transform(x_train) x_test = scaler.
transform(x_test)#loop over every model on every fold and generate predictions based on the available training data for model in models: print(type(model).
__name__) #knearestneighbors does not accept sample_weight if type(model).
__name__!='KNeighborsRegressor': model.
fit(x_train,y_train,sample_weight=sample_weight[train_index]) else: model.
fit(x_train,y_train) #generate test predictions y_preds = model.
predict(x_test) #save predictions to the corresponding row data.
loc[test_index, type(model).
__name__] = y_preds #generate overall score for this model/fold (for observation) rmse = sqrt(mean_squared_error(y_test, y_preds))#move rating column to the backdata['rating'] = data.
pop('rating')#create meta model#remove rating and old features from datasetmeta_X,meta_y = data.
iloc[:,-(len(models)+1):-1], data.
iloc[:,-1]#split data into train and test meta_x_train, meta_x_test, meta_y_train, meta_y_test = train_test_split(meta_X, meta_y, test_size=0.
2)#initialize and fit adaboost modelmeta_model = AdaBoostRegressor()meta_model.
fit(meta_x_train, meta_y_train)#generate predictionsmeta_preds = meta_model.
predict(meta_x_test)#calculate overall ensemble rmse scoremeta_score = sqrt(mean_squared_error(meta_y_test, meta_preds))px.
histogram(pd.
DataFrame({'pred':meta_preds}),x='pred')And our new predicted values look very similar to our training set!.With an overall rmse of 0.
181, on a 5-star rating system, we would, on average, be accurate to a doctor’s rating within one star.
We can also see how each model contributed to the ensemble with the following line of code:print(meta_model.
feature_importances_)[0.
18907524 0.
03818779 0.
17733793 0.
20602109 0.
26546008 0.
12391788]With each value respective of it’s model, we can see that our meta-level model gained very little information from the DecisionTreeRegressor prediction features, while the remaining models all contributed significantly, especially AdaBoost.
And that’s it!.I can now predict ratings for every unreviewed doctor and find the best fit for me.
Though, this model still has significant room for growth with further investment.
There are firstly a number of features which we did not address:From a doctor’s name or picture we can extract information such as age and race, which are specifically very helpful when needing to connect to patients from a similar background.
Medical School, residency program, and all other education were also excluded — obviously this is a significant factor used by hospitals to recruit.
Combining their education with a list of internationally ranked schools would absolutely be useful.
As would simply label encoding and determining how school and patient reviews are correlated in practice — but would need a much larger sample size to execute.
However, the biggest improvement left to be made is to transform our unpersonalized recommendations into a formal recommender system.
What’s prevented us from doing that at the moment is the information publicly available from Oscar does not include patient-specific reviews, but rather just an aggregate of all patients.
With patient information, their history, and their individual reviews we would be able to build a a model which can predict whether a doctor is a correct fit specifically for that patient, without the need for filters.
There’s definitely more potential here than just for my own gain too; specifically in the case of finding a physician, helping new doctors and private clinics quickly get patients helps to distribute the demand away from already established doctors, ideally bringing down healthcare costs.
And, in a larger scope, having a positive or negative recommendation for new businesses and products allows consumers to make quick, educated decisions on otherwise uncurated options.
We see this kind of recommendation in services such as Netflix, where we’re unclear as to how they’ve decided that we might like a specific film, but we can see the percent match regardless.
On the other hand, Amazon does not offer recommendations for unreviewed products, leaving many consumers to do a significant amount of research or ignore the product entirely (which then leads businesses to desperately bring in reviews, sometimes by bribing or discounting their products).
.