MachineHack, Predict A Doctor’s Consultation Hackathon

MachineHack, Predict A Doctor’s Consultation HackathonBenjamin LauBlockedUnblockFollowFollowingMar 28MachineHack.

comRecently I took part in an online machine learning hackathon that uses machine learning to predict the price of doctor’s consultation in India.

Even though I did not achieve great prediction/result, it served as a learning session to consolidate what I had learned in the past few months.

Anyway, for those that didn’t know, I came from a pharmacy background and am picking up data analytics skill with my own curriculum.

If you are interested in getting involved in this field, make sure to check out my curriculum here.

Let’s jump right into the hackathon.

This is a rather small dataset with 5961 rows and 7 unique features.

Qualification of the doctorExperience of the doctor in yearsProfile of the doctorRatings that were given by patientsMiscellaeous_Info that contains other information about the doctorPlace (area and city of the doctor location)Fees charged by the doctor (Dependent variable)Firstly, import all dependencies and the datasetimport numpy as npimport pandas as pdfrom sklearn.

model_selection import GridSearchCVfrom sklearn.

preprocessing import StandardScalerfrom sklearn.

preprocessing import OrdinalEncoderfrom sklearn.

model_selection import train_test_splitfrom sklearn.

metrics import make_scorerdf = pd.

read_csv("Final_Train.

csv")Viewing the data to have a sense of the information givendf.

head()Clearly, some data cleaning need to be done before any modelling process.

Let’s start by looking at the number of missing values in this training dataset.

round(df.

isnull().

sum()/len(df)*100,2)The image above showed the percentage of missing values for each column.

I am going to deal with these missing values individually, so just keep those values in mind for now.

I had always like to start from the simplest tasks to get it going before tackling the more complex issues.

The “Experience” column here seems simple enough as it just requires extracting the integer values from the string.

# Extract years of experiencedf["Experience"] = df["Experience"].

str.

split()df["Experience"] = df["Experience"].

str[0].

astype("int")The first line of code split the string into a list while the second extract the first element of the list and convert it into an integer.

Next, the “Place” column can be easily processed by separating the City from the area.

# Extract citiesdf["Place"].

fillna("Unknown,Unknown",inplace=True)df["Place"] = df["Place"].

str.

split(",")df["City"] = df["Place"].

str[-1]df["Place"] = df["Place"].

str[0]Before extraction, I replaced all missing values in this column with the string ‘Unknown, Unknown’ to represent them.

Side note, sometimes it is a good idea to give missing values a separate class instead of relying on missing values imputation technique like mean/median/mode.

For example in this dataset, some regions in Indian might not have listed down their location during data collection but they could have come from the same region.

Next, splitting the string at ‘,’ and creating a new column ‘City’ using the last element of the list.

Moving on to the ‘Ratings’ column, remember that this column has more than 50% of missing values.

We have to deal with the missing values before any other processing.

# Seperate Ratings into binsdf["Rating"].

fillna("-99%",inplace=True)df["Rating"] = df["Rating"].

str[:-1].

astype("int")bins = [-99,0,10,20,30,40,50,60,70,80,90,100]labels = [i for i in range(11)]df["Rating"] = pd.

cut(df["Rating"],bins=bins,labels=labels,include_lowest=True)Missing values were replaced with -99% to differentiate them.

Then, assuming a rating of 91% has no significant difference as a rating of 99%, I grouped them into bins of size 10.

Missing values will fall under class 0 while, 0-9% will be class 1, 10–19% will be class 2, so on and so forth.

df[“Rating”].

value_counts().

sort_index() displayed the distribution.

For the ‘Qualification’ columns, it consists of various qualification of the doctor without any standardized reporting method.

I start off by doing the normal split and try to get an idea of the frequency of the different terms appeared in this column.

# Extract relevant qualificationdf["Qualification"]=df["Qualification"].

str.

split(",")Qualification ={}for x in df["Qualification"].

values: for each in x: each = each.

strip() if each in Qualification: Qualification[each]+=1 else: Qualification[each]=1To be honest, I am quite lost at this stage.

If you look at theQualification dictionary, a large proportion of the qualification has only 1 occurrence in the entire dataset and some of the terms actually refer to similar qualification but were accounted separately.

For example, there were entries of ‘MBA -Healthcare’ and ‘MBA’ which I think referred to the same qualification.

This is the problem of non-standardized data entry or data collection and I believe data scientist/analyst see this on a daily basis.

I decided to go with the simplest approach and simply identify the top 10 qualification that occurs the most.

most_qua = sorted(Qualification.

items(),key=lambda x:x[1],reverse=True)[:10]final_qua =[]for tup in most_qua: final_qua.

append(tup[0])for title in final_qua: df[title]=0 for x,y in zip(df["Qualification"].

values,np.

array([idx for idx in range(len(df))])): for q in x: q = q.

strip() if q in final_qua: df[q][y] = 1df.

drop("Qualification",axis=1,inplace=True)The final result is dummies variables for the 10 highest frequency qualification in the dataset.

Now for the ‘Profile’ column.

If you can remember, we do not have any missing value in this column.

A quick value_counts() check yielded this.

That is actually pretty neat.

Since the whole column only consists of 6 classes, oneHotEncoding the column should do the trick.

Before that, a quick check on the ‘City’ column we created showed that it also contains a small number of classes (10).

However, something weird popped up.

There is an ‘e’ entry out of nowhere and I guessed it should be a mistake (wrong entry).

I found that the problem occurred in row 3980 and I changed the ‘City’ and ‘Place’ columns for that row to ‘unknown’ instead.

df["City"][3980] = "Unknown"df["Place"][3980] = "Unknown"A final check.

Much better.

Then, by using only 1 line of code, we will be able to generate dummies variables for ‘Profile’ and ‘City’ columns concurrently.

# Get dummiesdf = pd.

get_dummies(df,columns=["City","Profile"],prefix=["City","Profile"])Lastly, here is a short preview of the ‘Miscellaneous_Info’ columnTaking into account the high percentage of missing values (43.

95%), and the fact that I could not find any relevance of the column (I am no NLP expert), I decided to forgo the column and just drop it.

Not the best way of course but I will do that for now.

df.

drop("Miscellaneous_Info",axis=1,inplace=True)Once we are done with data preprocessing, the next step should naturally be data visualization.

Note* some people preferred not looking at your data before modelling to prevent any bias introduced by the analyst, but to each his own.

Most people would have assumed there is some association between the experience of the doctor and the fees they charged.

Indeed there is, but it might not what we expect it to be.

Average fees increased with experience but peak at approximately 25 years of experience, then, average fees decreased with further increasing number of experience.

Ratings is another interesting variable to look at.

If you remember, we group rating into bins of size 10, inclusive of the smallest value.

Eg, bin 5 will be a rating of 40–49%, bin 10 will be 90–100%, and bin 0 will just be the missing values in the dataset.

As you can see, a high rating does not correlate to a higher fee charged (in fact a lower fee might be the reason for high rating!), and the highest average fees charged were actually rated 30–60%.

The colour scheme depicts the median experience level in each bin, with dark green representing a higher median experience.

Median experience in bins 4 and 5 was 27 years and 31 years respectively while bin 10 only has a median experience of 14 years, justifying the ability of the doctors to charge a higher fee in those bins.

There is a lot to unwrap here.

Fees charged by doctors differ between CitiesDistribution of the different doctor profile within each city is similar for most citiesAll of the entries with unknown city are actually Dermatologist!Previously I mentioned that missing data might not be random and can be due to the data collection process, this is a very good example of it.

Somehow dermatologists in some city are not recording their location!Note* All visualizations are done in Tableau just because I am learning the tool.

You can use python or any other visualization tools.

Finally, we can model our data can do some cool machine learning.

I had decided to use a support vector machine for this task as it can be used for both linear and non-linear problems.

The small amount of data also do not warrant the use of a neural network.

Before implementing the algorithms, we have to encode the categorical variables and scale its features.

X = df.

drop("Fees",axis=1)y = df["Fees"]# Encodingenc = OrdinalEncoder()X = enc.

fit_transform(X)X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.

2)# feature scalingscaler = StandardScaler()X_train = scaler.

fit_transform(X_train)Implementing SVR …# support vector machine from sklearn.

svm import SVRm = SVR(gamma="scale")m.

fit(scaler.

transform(X_train),y_train)Based on the hackathon site, submission are evaluated on Root-Mean-Squared-Log-Error (RMSLE) error, more specifically, 1-RMSLE.

def score(y_pred,y): y_pred = np.

log(y_pred) y = np.

log(y) return 1 – ((np.

sum((y_pred-y)**2))/len(y))**1/2# Predictiony_pred = m.

predict(scaler.

transform(X_test))score(y_pred,y_test)Our prediction on the testing set gave us a score of 0.

7733490738717279 .

If you take a look at the leaderboard, the winner with the best score is 0.

76162342 .

Of course, our testing set is not the real testing set used for the leaderboard and is not comparable.

But it gave us a metric to use for further optimization.

Doing a GridSearchCV is a great way to do hyperparameters tunning, but take note of the computation power needed especially for algorithms like SVM that do not scale well.

# Define own scorerscorer = make_scorer(score,greater_is_better=True)# Hyperparameter tunningparameters = {"C":[0.

1,1,10],"kernel":["linear","rbf","poly"]}reg = GridSearchCV(m,param_grid=parameters,scoring=scorer,n_jobs=-1,cv=5)reg.

fit(X_train,y_train)Running reg.

best_params_ gave the combination of hyperparameters that provide the best score.

The best hyperparameters in this case are C=10, and kernel=”rbf”.

Note* define your own scorer to be used in GridSearchCV so it optimized using that scoring matrix.

Finally.

y_pred_tuned = reg.

predict(scaler.

transform(X_test))score(y_pred_tuned,y_test)The score obtained here is 0.

8034644306855361 .

A slight improvement but an improvement nonetheless.

This translates to a score of 0.

72993077 using the test set for the leaderboard and rank 62/169 (top 40th percentile)There you have it.

A full walk-through of a machine learning competition.

There are many more I could have done to achieve a better score, do more feature engineering, use other algorithms, but unfortunately, I did not manage to do that before the competition end.

Doing such competition is useful to consolidate what you have learned and a great way to practice.

Unless you are aiming for the prizes, winning often doesn't matter and it is the journey that is the most fulfilling.

All the code can be found in my GitHub here.

.

. More details

Leave a Reply