Implementing a Profitable Promotional Strategy for Starbucks with Machine Learning (Part 2)

Implementing a Profitable Promotional Strategy for Starbucks with Machine Learning (Part 2)Josh Xin Jie LeeBlockedUnblockFollowFollowingJan 8In this series, we will design a promotional strategy for Starbucks and walk through the entire process from data pre-processing to modelling.

This is my solution for the Udacity Data Scientist Nanodegree Capstone project.

In the final part of the series, we will cover feature engineering, implementation of the uplift model, additional model and data adjustments made and results of the project.

Link to part 1 of this article.

The code accompanying this article can be found here.


Feature EngineeringWith only 4 demographics attributes to work with, feature engineering could prove beneficial.

Customers often went through long periods of time without receiving promotions.

Hence cumulative values and moving averages would be used to capture past transactional behaviors of customers.

Cumulative sums would be calculated for the following statistics:total spendingnumber of transactionsprofitsFor example, the cumulative profits at time N will beCumulative Profit at Time N = Cumulative Profit at Time 0 + … + Cumulative Profit at Time N-1Note that the computation of cumulative sums for month N will be based on values from month 0 to N-1 to avoid data leakages.

Likewise the moving averages (rolling means) of the same statistics were calculated.

For example:Moving Average of Profit at Time N = Cumulative Profit at Time N / Number of MonthsThe cumulative statistics and moving averages would be computed for each ‘offer id’ including non-promotional situations (represented by ‘offer id 10’), as well as on an accumulative basis (all offers and no offer).

For example, we will compute:Cumulative Profit at Time N for Offer id 0Cumulative Profit at Time N for Offer id 1…Cumulative Profit at Time N for Offer id 10Total Cumulative Profit at Time N for All Offers id 0–10In addition, the cumulative spending per transaction (total spending / total number of transactions) and cumulative profit per transaction (total profit / total number of transactions) will be added as well.

Any missing values will be filled with 0, since null values indicate that the customer had not yet made any transactions.

Lastly, 1-month lags of these engineered features will be computed to allow our model to capture recent changes in transactional behaviors.

Since there is no previous month before month 0, the engineered features for month 0 and the 1-month lag features for month 1 will be comprised entirely of null values.

Thus, we will discard months 0 and 1 from our training data.

Once again, due to the large number of features created, I will abstain from listing all of them in this article.

For more details, refer to the code on my GitHub repository, located under the file generate_monthly_data.


Examples of engineered featuresVIII.

Indicator Uplift Model And Promotional StrategyWe will be using a single model to predict the probabilities of profits from both promotional and non-promotional exposure.

During the training phase, an indicator variable is created to track if a data point from monthly_data belongs to a promotion or not.

Each type of offer will have its own model, so a single indicator variable for each model will be sufficient.

The rationale for using separate models will be discussed shortly.

Training Features for promotion ‘offer id 0’.

The column ‘offer_id_0’ serves an indicator variable that tracks if a data point belongs to ‘offer_id_0’ (indicator=1), or if it belongs to a non-offer instance (indicator=0).

Note that the other training features have been reduced with PCA, a process which we will discuss shortly.

Once the model is trained, it can be used to formulate our promotional strategy.

To predict whether an individual should receive a promotion when testing our strategy, we can predict the individual’s profit probability when given the promotion by setting the indicator variable to 1.

Next, we can predict the individual’s profit probability when he/she is not given the promotion by setting the indicator variable to 0.

Note that the same model is used to predict the probability of profits during promotional and non-promotional periods.

Only the inputs, specifically the indicator variable, are changed during the procedure.

If the difference in probability (also known as the uplift effect) is larger than 0, we will send the promotion.

This is because the individual is more likely to generate profits when given a promotion as opposed to no promotions.

Uplift Effect = Probability of Profit When Given a Promotion — Probability of Profit When Not Given a PromotionAlternatively, regression models can be used to model the expected amount of profits in promotional events versus non-promotional events.

This can potentially tell us how much more profit we can expect to gain by sending an offer to an individual.

For this project, I decided to focus on modelling the probability of profits, rather than the expected amount of profits.

In addition, there are other types of uplift models that can be implemented for this task.

One such example will be to use two separate models to measure the uplift effect.

In this scenario, one model will be trained on the promotional data while the other model will be trained on the non-promotional data.

The difference between the predicted probabilities of the two models will indicate the uplift effect.

For more information about other uplift models, check out this article.


Additional Data and Model AdjustmentsBefore we discuss the modelling results, there are a couple of final adjustments that we will make.

Using Individual Models for Each Offer TypeInitial experimentations with the use of a single model for all offer types led to unsatisfactory results.

This could be due to the fact that the number of profitable instances differed significantly between different offer types.

Hence, the positive instances of some offer types might be weighed more heavily than other offer types.

Distribution of labels between the different offer types.

There was also the possibility that different offer types shared very little common signals that could be used to identify profitable offers.

Hence, the decision was made to create separate models for each offer type.

Each model would focus on modelling the differences in promotional and non-promotional spending for a single offer type.

Using Subset of Monthly DataIn addition, a reduced subset of monthly data was used to train each model.

The primary goal will be to model the transactional behaviors of individuals during months when they received offers, and identify which of them are likely to spend more money during promotional periods as oppose to non-promotional periods.

Only months in which the relevant offer was sent would be included in the dataset.

In addition, we would only include individuals whom we had transaction records for both promotional and non-promotional situations during those months.

For example, assume that we are working on a model for offer id 0.

If person id 1 received ‘offer id 0’ in month 1, then person id 1’s promotional and non-promotional expenditures in month 1 will be included in the dataset.

If person id 2 did not receive offer id 0 in month 1, then person id 2’s information (non-promotional transactional records) for month 1 will not be included.

Likewise, if person id 1 did not receive offer id 0 in month 9, then his/her transaction information for that month will not be used.

Hence, every offer would have its own unique subset of monthly data.

Taking a subset of the monthly data will allow us to accurately compare the differences in monthly expenditures between promotional and non-promotional situations for the same individuals.

In addition, this approach will help ensure that the model is seeing an equal number of promotional and non-promotional exposures each month.

This will help reduce the possibility of over-fitting to a specific exposure.

Imbalance in LabelsAs previously mentioned, there is an imbalance in the value counts of labels.

Data points are more likely to be non-profitable (has_profit labels of 0) than profitable (has_profit labels of 1).

If we look at the distribution of the has_profit labels among the promotions, the imbalance is even more pronounced when compared to non-promotional exposures.

This is especially true for promotions ‘offer id 0’ and ‘offer id 3’, which have an extremely low number of profitable instances, whereas no-offer data points have a much higher number of profitable instances.

Distribution of labels between the different offer types.

Hence, we will need to address the imbalance between the labels, so that they will remain consistent between promotional and non-promotional exposures.

If the imbalance is left unaddressed, the model will have a greater tendency to predict 0 labels for promotions, especially in the case of ‘offer id 0’ and ‘offer is 3’.

Synthetic Minority Over-sampling Technique, SMOTE, will be used to oversample the profitable class.

In order words, we will be adding artificially created person-month instances with has_profit labels 1.

SMOTE allows us to create new observations with slightly different feature values from the original observations.

To create a new sample, it will take a data point from the dataset and select one of its k-nearest neighbors.

It will then take the vector between the chosen neighbor and the current data point, and multiply this vector by a random number that lies between 0 and 1.

Finally, it will add the results to the current data point to create the new sample.

An overview of SMOTE.

Taken from Rich Data.

This is often a better approach than just resampling the original data, which will create too many duplicated data points and lead to over-fitting in the machine learning model.

Since oversampling often increases recall at the cost of precision, I chose to oversample only promotional data points.

This is because non-promotional data points already have a higher ratio of profit to no-profit labels than promotional data points.

Hence by oversampling only the promotional data points, the ratio of profit to no-profit labels in promotional situations will be brought closer to non-promotional situations.

Lastly, oversampling will be performed only on the training data.

We want our validation and test data to mimic actual customer behavior in the real world, where it is likely that only the minority of the customers will generate profits for the firm every month.

Scaling and Dimensionality ReductionSMOTE works best with continuous data.

Since our data is a mixture of categorical and continuous variables, we will need to covert them to continuous variables.

One approach will be to scale the dataset and perform dimensionality reduction.

This will generate a dataset comprised of only continuous variables.

Another benefit of performing dimensionality reduction is that most customers often respond to a single type of offer during the study’s duration.

Customers might receive a few types of offers, but most will generally act on only 1 type of offer.

Hence the amount of historical spending for most offer types will be 0 for many individuals.

Since we have engineered new features based on historical spending behaviors for each offer type, a large proportion of these engineered features will be sparse (0 for a lot of features).

Hence, dimensionality reduction will help reduce the sparsity of the dataset.

Normalization and dimensionality reduction were performed for each offer type separately.

Standard scaling was used to normalize all variables to a mean of 0 and a standard deviation of 1, while Principle Component Analysis was used to reduce the dimensions of the dataset.

For most offer types, 40 to 50 dimensions were sufficient to capture the majority of variance in the dataset.

Since the original number of features was approximately 200, this indicated a high degree of sparsity in the dataset.

Scree plot for Discount 10/20/5 (Offer ID 0) promotionMetricThe performance of our promotion strategy will be determined using the Net Incremental Revenue (NIR), where:NIR = Promotional Revenue — Cost of Promotion — Non-Promotional Revenuewhich can also be expressed asNIR = Promotional Profit — Non Promotional ProfitThe NIR will be calculated based on individuals who should receive the offer according to our strategy.

In other words, these are individuals with positive uplift values.

Thus, the NIR measures how much is made (or lost) by sending out the promotion to these individuals.

For example, let us assume that we are calculating the NIR for month 19.

Suppose that our promotional strategy predicted customers with id 15 and 5550 will have positive uplift values and they should receive the promotion, and the actual transaction record for these individuals during month 19 is as follows:Offer id 0 is a Discount 10/20/5 promotion.

Offer id 10 tracks non-promotional spendingThe NIR will be calculated as such:NIR = ($0+$23.

20) — ($8.


76) = -$2.

25Grid SearchXGBoost Classifier will be used to model the probability of profits, and early stopping was employed to reduce overfitting of the models.

The area under the precision-recall curves was used to decide when training should stop, instead of the area under the ROC curve.

This choice was made due to the imbalance of classes in the dataset, which meant that using the area under the ROC curve might lead to an overly optimistic picture.

To identify the optimal promotional strategy, a grid search was conducted over the following parameters: up-sampling ratio, maximum depth of tree and minimum child weight.

The grid search would evaluate the validation and test NIRs for each set of parameters.

The up-sampling ratio controls how much we should oversample the profit instances (has_profit label of 1) for the promotional data points.

Maintaining an equal balance in profit to non-profit instances between promotional and non-promotional situations did not always lead to optimal results.

Hence there was a need to vary the up-sampling ratio.

The larger the maximum tree depth and the lower the minimum child weight, the higher the modelling power.

This means that the tree is more capable of learning relations very specific to a particular sample.

On the other hand, smaller maximum tree depths and higher minimum child weights will make a model more conservative and control overfitting better.

Since the offers were sent in irregular months, each offer’s test month would be different.

In general, the final month during which the offer was sent would be used as the test month, while the second last month would be used as the validation month.

Finally, the rest of the months would be assigned to the training data.

In most cases, there were approximately 3 or 4 training months available for each offer.

For this project, the chosen promotional strategy was not necessarily the one that produced the best validation NIR.

It was observed that the best performing strategies during the validation month might not produce positive NIRs during the test month.

Hence, the selected strategy would be one that produced the highest validation NIR while still producing a positive test NIR.

If no strategies were found to produce positive NIRs during both the validation and test months, the strategy that produced the highest validation NIR would be reported.

Normally, it is not ideal to use the test results to tune the model.

However, we do not have sufficient monthly data to increase the number of months used for the validation and testing periods.

If more data was available, we could set aside additional months to the validation and test periods.

This might lead to greater consistency in the results and allow us to avoid using the test results to tune our strategy.

Hence, this project will serve only to demonstrate the viability of a profitable promotional strategy.

Further refinements will be needed if we want to obtain a promotional strategy that is reliable and profitable.

As we shall see in a while, regardless of the strategies that we picked, the uplift models generally produced results much better than what were original attained in the experiment.


ResultsWe will now compare the results obtained from the baseline strategies and our uplift models.

The baseline strategy will be the original strategy employed during the study.

In other words, everyone who received the offer during the actual experiment will receive the offer in the baseline strategy.

Our model’s goal would be identify a smaller subset of these individuals who were likely to spend more when given a promotion as opposed to when they were not given a promotion.

In other words, the uplift model will send the promotions only to individuals with positive uplift values.

Ideally, Starbucks can maximize its profits by restricting the promotions only to the most promising customers.

Discount 10/20/5 (Offer ID 0)Offer ID 0 is a discount promotion with a difficulty of $20, a reward of $5, and a validity period of 10 days.

Baseline Strategy ~ Validation NIR: $108.

70, Test NIR: -$4,889.

48Uplift Model ~ Validation NIR: $72.

83, Test NIR: -$2,163.

47Discount 7/7/3 (Offer ID 1)Offer ID 1 is a discount promotion that has a difficulty of $7, a reward of $3 and a validity of 7 days.

Baseline Strategy ~ Validation NIR: $185.

14, Test NIR: -$4,732.

18Uplift Model ~ Validation NIR: $60.

41, Test NIR: $4.

61Discount 7/10/2 (Offer ID 2)Offer id 2 is a discount promotion with a difficulty of $10 and a reward of $2.

The offer has a 7 days validity period.

Baseline Strategy ~ Validation NIR: $65.

88, Test NIR: -$5,519.

62Uplift Model ~ Validation NIR: $12.

40, Test NIR: $3.

17Informational 4/0/0 (Offer ID 3)Offer id 3 is an informational promotion with no difficulty and no reward.

It has a validity of 4 days.

According to Starbucks, this means that the customer will “feel” its impact for 4 days.

The probable explanation is that the customer will be able to view the offer in the app for a period of 4 days.

Baseline Strategy ~ Validation NIR: -$4,193.

67, Test NIR: -$8,754.

95Uplift Model ~ Validation NIR: $29.

39, Test NIR: -$34.

26BOGO 5/10/10 (Offer ID 4)Offer id 4 is a buy-one-get-one-free promotion with a difficulty of $10 and a reward of $10.

It has a validity period of 5 days.

Baseline Strategy ~ Validation NIR: -$4,634.

69, Test NIR: -$7,027.

36Uplift Model ~ Validation NIR: $12.

39, Test NIR: $10.

20Informational 3/0/0 (Offer ID 5)Offer id 5 is an informational promotion with a validity of 3 days.

These are the results for the models:Baseline Strategy ~ Validation NIR: -$5,188.

06, Test NIR: -$6,707.

87Uplift Model ~ Validation NIR: $2.

19, Test NIR: -$130.

91BOGO 7/5/5 (Offer ID 6)Offer id 6 is a buy-one-get-one-free promotion with a difficulty of $5 and a reward of $5.

It is valid for a period of 7 days.

These are the results for the models:Baseline Strategy ~ Validation NIR: $121.

58, Test NIR: -$6,542.

62Uplift Model ~ Validation NIR: $21.

81, Test NIR: $10.

15BOGO 7/10/10 (Offer ID 7)Offer 7 is a buy-one-get-one-free promotion with a difficulty of $10 and a reward of $10.

Offer id 7 is similar to offer id 4 with the exception that it has a validity period of 7 days.

Baseline Strategy ~ Validation NIR: $65.

13, Test NIR: -$6,207.

28Uplift Model ~ Validation NIR: $24.

29, Test NIR: $0.

73BOGO 5/5/5 (Offer ID 8)Offer id 8 is a buy-one-get-one-free promotion with a difficulty of $5 and reward of $5.

It is identical to offer id 6 with the exception of a shorter validity period of only 5 days.

Baseline Strategy ~ Validation NIR: -$5,779.

91, Test NIR: -$7,508.

97Uplift Model ~ Validation NIR: $481.

78, Test NIR: -$786.

3Discount 10/10/2 (Offer ID 9)Offer id 9 is the final promotion that we will discuss.

It is a discount promotion with a difficulty of $10, a reward of $2 and a validity period of 10 days.

It is similar to offer id 2 except that it has a longer validity period of 10 days compared to 7 days for offer id 2.

Baseline Strategy ~ Validation NIR: $104.

30, Test NIR: -$5,006.

65Uplift Model ~ Validation NIR: $51.

87, Test NIR: $3.

02In all cases, we were able to make significant improvements over the baseline strategies’ test months’ NIRs.

For 6 out of the 10 promotion types, we were able to find strategies that were profitable during the validation and test months.

The 4 types of promotions that we were not able to do so, were the discount 10/20/5, informational 4/0/0, informational 3/0/0 and BOGO 5/5/5.

There are two possible explanation for our strategies’ poor performance on the informational offers.

The first is that since informational offers lack a reward, they have limited effectiveness.

Thus, their impact on customers’ spending is negligible.

Alternatively, their relatively short validity period coupled with the fact that customers are not incentivized to ‘complete’ them quickly, mean that the true impact of these promotions will not be felt until later.

Customers may respond to these promotions, but only after the promotions have expired.

In addition, our strategies’ poor performance for the discount 10/20/5 promotion suggests that the promotion’s difficulty ($20) may be too high to incentivize meaningful customers’ response.

Even though the promotional strategies for the 4 aforementioned promotions were not profitable, they still represent significant improvements upon the baseline strategies.

Hence, their adoption would improve Starbucks’ bottom line.

In a number of promotions, our uplift model strategies achieved slightly lower NIRs in the validation months than what were originally attained in the experiment.

However, these strategies did manage to improve the test months’ NIRs dramatically.

Hence, the trade-off was acceptable.


ConclusionAnswering our QuestionLet us now get back to our question at the beginning:Can we increase Starbucks’ profits by adopting a more selective promotional strategy?We have shown that it is definitely possible to improve the effectiveness of the original promotional strategy and achieve better returns.

Profitable strategies were found for 6 of the 10 promotions, and we also managed to substantially reduce losses in the other 4 promotions.

However, our current approach does not generate positive NIRs for all offers.

In addition, there are also issues regarding the consistencies of the results.

Further improvements have to be made in order to attain strategies that are reliable and profitable.

As previously noted, uplift models can be tricky to implement.

The key takeaway from this experiment is that promotions don’t seem to generate significantly higher profits in the short-run.

Most customers are generally loyal and are often willing to purchase products regardless of the presence of promotions.

Hence, we need to be more selective when identifying individuals to send promotions.

Otherwise, we might adversely impact the company’s profits.

Potential ImprovementsWe noted that strategies producing the highest validation NIRs did not produce positive test NIRs.

The inconsistencies between validation and test results might suggest that either the signals were not strong or they were not consistent throughout the different months.

Considering that only a small proportion of customers responded to the offers, we did not have a lot of transaction data to work with.

In addition, only 4 demographics attributes were available.

Hence acquiring more transaction and demographics data could help improve the signal.

Alternatively, we can improve our uplift models in a number of ways:Send promotions only to individuals with uplift values above a certain percentile rather than just individuals with positive uplift.

Using regression models to model the amount of profits in promotional and non-promotional situations.

Try other uplift models, such as the two model approach, four quadrant approach, etc.

Trying all of these alternative approaches will be relatively time-consuming, hence I did not explore them for this project.

Furthermore, there is also the possibility that sending these promotions may lead to lower profits in the short-run due to the cost incurred, but they may build up loyalty in customers and encourage them to spend more money on future transactions.

Our current approach does not model the long-term impact of these promotions.

Hence, an alternative approach to this problem will be to design a strategy that maximizes future profits rather than short-term profits gained.

In this scenario, our goal will be to identify individuals who are likely to spend more money in the coming months after receiving a promotion.

The code accompanying this article can be found here.

Thank you for reading this article!.If you have any thoughts or feedback, leave a comment below or send me an email at leexinjie@gmail.


I’d love to hear from you.


. More details

Leave a Reply