Are Starbucks app users responding to offers?

Number of offer sendout days : {}'.

format(transcript[transcript['event'] == 'offer received']['time'].

nunique()))There were 17000 users that engaged in 306534 events during 29 days.

The offers were sent out on 6 different days of the week.

Before proceeding to machine learning modeling, I removed the duplicates in the transcript data frame.

# Check for duplicates in transcript data set and remove if trueif transcript.

drop_duplicates().

shape != transcript.

shape: number_rows = transcript.

shape[0] transcript = transcript.

drop_duplicates() print('Dropped {} duplicate rows from transcript dataset.

'.

format(number_rows – transcript.

shape[0]))else: print('No duplicates found.

')2.

Machine Learning ModelingPredict whether a consumer completes an offer.

Shuffle and Split DataX = result[['income', 'offer_type_bogo','offer_type_discount','offer_type_informational', 'time', 'reward', 'gender_F', 'gender_M', 'gender_O', 'social', 'email', 'web', 'mobile']]y = result.

event.

apply(lambda x:1 if x == 'offer completed' else 0)# Import Stratified Shuffle Split#from sklearn.

cross_validation import train_test_splitfrom sklearn.

model_selection import StratifiedShuffleSplit# Split the 'features' and 'income' data into training and testing setssss = StratifiedShuffleSplit(n_splits=5, test_size=0.

2, random_state=0)for train_index, test_index in sss.

split(X, y): print("TRAIN:", train_index, "TEST:", test_index) X_train, X_test = X.

iloc[train_index], X.

iloc[test_index] y_train, y_test = y.

iloc[train_index], y.

iloc[test_index]# Show the results of the splitprint("Training set has {} samples.

".

format(X_train.

shape[0]))print("Testing set has {} samples.

".

format(X_test.

shape[0]))Performance MeasureI made a worst possible predictor to have as a baseline performance measure.

Naive Predictor as Baseline PredictorTP = np.

sum(y) # Counting the ones as this is the naive case.

FP = y.

count() – TP # Specific to the naive caseTN = 0 # No predicted negatives in the naive caseFN = 0 # No predicted negatives in the naive case# TODO: Calculate accuracy, precision and recallaccuracy = (TP+TN)/(TP+TN+FP+FN)recall = TP / (TP + FN)precision = TP / (TP + FP)# TODO: Calculate F-score using the formula above for beta = 0.

5 and correct values for precision and recall.

fscore = (1 + (0.

5)**2) * (precision * recall) / (((0.

5)**2*precision) + recall)# Print the results print("Naive Predictor: [Accuracy score: {:.

4f}, F-score: {:.

4f}]".

format(accuracy, fscore))Training and Predicting Pipeline# TODO: Import two metrics from sklearn – fbeta_score and accuracy_scorefrom sklearn.

metrics import fbeta_score, accuracy_scoredef train_predict(learner, sample_size, X_train, y_train, X_test, y_test): ''' inputs: – learner: the learning algorithm to be trained and predicted on – sample_size: the size of samples (number) to be drawn from training set – X_train: features training set – y_train: income training set – X_test: features testing set – y_test: income testing set ''' results = {} # TODO: Fit the learner to the training data using slicing with 'sample_size' using .

fit(training_features[:], training_labels[:]) start = time() # Get start time learner = learner.

fit(X_train[:sample_size], y_train[:sample_size]) end = time() # Get end time # TODO: Calculate the training time results['train_time'] = end – start # TODO: Get the predictions on the test set(X_test), # then get predictions on the first 300 training samples(X_train) using .

predict() start = time() # Get start time predictions_test = learner.

predict(X_test) predictions_train = learner.

predict(X_train[:300]) end = time() # Get end time # TODO: Calculate the total prediction time results['pred_time'] = end – start # TODO: Compute accuracy on the first 300 training samples which is y_train[:300] results['acc_train'] = accuracy_score(y_train[:300], predictions_train) # TODO: Compute accuracy on test set using accuracy_score() results['acc_test'] = accuracy_score(y_test, predictions_test) # TODO: Compute F-score on the the first 300 training samples using fbeta_score() results['f_train'] = fbeta_score(y_train[:300], predictions_train, beta = 0.

5) # TODO: Compute F-score on the test set which is y_test results['f_test'] = fbeta_score(y_test, predictions_test, beta = 0.

5) # Success print("{} trained on {} samples.

".

format(learner.

__class__.

__name__, sample_size)) # Return the results return resultsModel Results Metrics# TODO: Import two supervised learning models from sklearnfrom IPython.

display import displayfrom time import timefrom sklearn.

ensemble import AdaBoostClassifier#from sklearn.

linear_model import LogisticRegressionfrom sklearn.

svm import SVC# TODO: Initialize the modelsclf_A = AdaBoostClassifier(random_state = 15)#clf_B = LogisticRegression(random_state = 44)clf_C = SVC()# TODO: Calculate the number of samples for 1%, 10%, and 100% of the training data# HINT: samples_100 is the entire training set i.

e.

len(y_train)# HINT: samples_10 is 10% of samples_100 (ensure to set the count of the values to be `int` and not `float`)# HINT: samples_1 is 1% of samples_100 (ensure to set the count of the values to be `int` and not `float`)samples_100 = len(y_train)samples_10 = int(0.

1 * len(y_train))samples_1 = int(0.

01 * len(y_train))# Collect results on the learnersresults = {}for clf in [clf_A, clf_C]: clf_name = clf.

__class__.

__name__ results[clf_name] = {} for i, samples in enumerate([samples_1, samples_10, samples_100]): results[clf_name][i] = train_predict(clf, samples, X_train, y_train, X_test, y_test)# Run metrics visualization for the two supervised learning models chosenevaluate(results, accuracy, fscore)I will focus primarily on how well the algorithm performs on the testing set.

Accuracy wise, Adaboost is slightly better.

In terms of F-score Adaboost though, for 100% of the training set size, SVC did a bit better.

Nonetheless, both models achieved only around 0.

5 as F-score,which is quite low.

Fine Tuning the Chosen Model (Adaboost)Since Adaboost did slightly better, I am fine tuning the model to achieve better performance metrics.

# TODO: Import 'GridSearchCV', 'make_scorer', and any other necessary librariesfrom sklearn.

grid_search import GridSearchCVfrom sklearn.

metrics import make_scorer# TODO: Initialize the classifierclf = AdaBoostClassifier(random_state = 14)# TODO: Create the parameters list you wish to tune, using a dictionary if needed.

# HINT: parameters = {'parameter_1': [value1, value2], 'parameter_2': [value1, value2]}parameters = {'n_estimators': [50, 100, 200], 'learning_rate': [1.

0, 2.

0]}# TODO: Make an fbeta_score scoring object using make_scorer()scorer = make_scorer(fbeta_score, beta = 0.

5)# TODO: Perform grid search on the classifier using 'scorer' as the scoring method using GridSearchCV()grid_obj = GridSearchCV(clf, parameters, scoring=scorer)# TODO: Fit the grid search object to the training data and find the optimal parameters using fit()grid_fit = grid_obj.

fit(X_train, y_train)# Get the estimatorbest_clf = grid_fit.

best_estimator_# Make predictions using the unoptimized and modelpredictions = (clf.

fit(X_train, y_train)).

predict(X_test)best_predictions = best_clf.

predict(X_test)# Report the before-and-afterscoresprint("Unoptimized model.——")print("Accuracy score on testing data: {:.

4f}".

format(accuracy_score(y_test, predictions)))print("F-score on testing data: {:.

4f}".

format(fbeta_score(y_test, predictions, beta = 0.

5)))print(".Optimized Model.——")print("Final accuracy score on the testing data: {:.

4f}".

format(accuracy_score(y_test, best_predictions)))print("Final F-score on the testing data: {:.

4f}".

format(fbeta_score(y_test, best_predictions, beta = 0.

5)))The accuracy did not change much, but there is some improvement in the F-score, from 0.

4555 to 0.

659.

Thank you very much for reading.

Stay blessed :).. More details

Leave a Reply