Funk SVD hands-on experience on Starbucks data set

Photo by Austin Distel on UnsplashFunk SVD hands-on experience on Starbucks data setTalgat KussainovBlockedUnblockFollowFollowingJun 14Trying to personalize mobile application offers distribution using collaborative filtering.

Table of contents:Introduction.

Data overview, cleaning and transformation.

A role of the offers in the company’s revenue.

What kind of offers really excite people?Collaborative filtering using Funk SVD.

Model fine-tuning.

Conclusion.

IntroductionNowadays internet resources, mobile applications are designed to personalize promo offers so to increase loyalty, exceed expectations from one hand and to boost revenue from another.

I have been given several possible data sets as part of Udacity Data Scientist Nanodegree among which Starbucks project excited me the most as it contains simulated data that mimics customer behavior from the real Starbucks rewards mobile app.

I was curious if below points of my interest could be addressed:1.

Do offers really play significant role in the company’s cash inflows? 2.

What kind of offers really excite people and bring more revenue?3.

How customer experience in terms of promo offers can be improved through personalization of the offers distribution using collaborative filtering technique — Funk SVD (more info about Funk SVD can be found here and here).

Data overview, cleaning and transformationThe data is contained in three files:portfolio.

json — containing offer ids and meta data about each offer.

profile.

json — demographic data for each customer.

transcript.

json — records for transactions, offers received, offers viewed, and offers completed.

Porftolio contains only 10 rows describing possible offers and details like duration of the offers, difficulty (amount to be spent to complete the offer), channels, offer id, offer type (discount, BOGO — ‘buy one get one free’, informational) and reward to be received once an offer completed:portfolio data transformed into pandasProfile contains data about 17 000 customers, the data is missing values for gender and income of 2175 persons.

profile data transformed into pandas (first 10 rows)As long as for the income, nobody reported to have zero income then let’s fill NaN values with zeros and check how this affects the diagram.

Moreover missing values in gender to be filled with ‘S’ for investigation.

Income before (on the left side), missing values filled with zeroes on the right sideFrom the below scatterplots we can see that 2175 consumers with missing values for gender are the same customers with income missing.

They form a special group of people at age 118.

Thus we may fulfill missing values in gender as ‘S’ and missing values in income with zeroes as it will not interfere with other group of customers.

Transcript contains records for transactions, offers received, offers viewed, and offers completed.

It also includes time (hours), person id.

Offer id, reward and amount can be extracted from the value column.

The main challenge of the transcript data set is that it does not contain direct correlation tag between transactions and the offers that influenced those transactions.

In order to achieve this understanding I have prepared a function that verifies for each and every customer if a particular offer was received, viewed, both transaction(s) happened and the offer completed within the offer effective time.

However it appeared that a person may receive different offers simultaneously, both offers can be viewed by a customer and if a transaction will happen when both offers were viewed and within timeline of both of them, then we should decide which offer actually led to the transaction.

Example of the above mentioned caseAs there was no clear direction from the data set and no background info given, I have assumed that the offer that was viewed in closest time to the transaction would be counted as an offer affected the transaction.

In case one of the offers is informational and the second one gives a reward then rewarding offer should prevail as I have presumed that customers would be more attracted by rewards rather than informational offers that are not giving any additional benefits.

Code of the transformation function.

After applying trans_affected_func function we have got a clear vision on the customers behavior patterns in terms of reaction to the received offers.

Reverting back to the special group found in the beginning.

Number of offer types received by all customers (on the left) and by special group (on the right)The proportions look similar.

It means that that Starbucks app sends offers without distinguishing this group.

Number of affected transactions within offer types for all customers (on the left) and for special group (on the right)When it comes to the number of transactions, it appears that special group is more affected by informational offers rather than BOGO comparing to the overall population.

Discount offers on the first place for both of them.

Overall amount spent by all customers (on the left) and for special group (on the right)Proportion of the sum of amounts spent is different for special group, this can be explained by the fact of BOGO less popular for ‘S’ group, less number of transactions and less money consumed.

What is interesting here is that although number of transactions influenced by BOGO is noticeably lower than discount, the overall sum for BOGO and discount are pretty close to each other for the general customers community.

It was explored that special group differs in behavior from the general customers community and probably should be treated in a special way.

A role of the offers in the company’s revenueMajority of the revenue generated by transactions not influenced by offers: 65.

53% of purchase inflows generated by transactions not related to any offer.

Discount and BOGO are giving relatively similar to each other figures: 14.

06% and 13.

33% respectively.

Informational is at most 7.

08%.

How this 34.

47% of revenue distributed among customers?The cash inflows influenced by offers are generated by almost 72% of consumers that received offers at least once.

Within given clients population about 28% were not affected at all although received at least one offer during the experiment.

Main takeawayOffers role in the company’s revenue is not vital although considered to be noticeable.

What kind of offers really excite people?We can say that promos really excite people if they have high number of review, completion (if applicable) and transactions associated.

Distribution of offers and the number of affected transactionsOverall sum of amount spent within each offerFrom the above diagrams we can see that all offers were distributed almost equally, at the same time some of them being reviewed and completed more frequently than others.

6 out of 10 offers have highest number of view rate, within them 2 rewarding offers are in the top — discount_10_10 (valid for 10 days, difficulty 10) and discount_7_7 (valid for 7 days, difficulty 7).

They also got the highest number of influenced transactions performed.

From informational offers the offer which is valid for 3 days is the most exciting as it influenced high number of transactions (6223) — pretty close to discount_7_7 (6335).

These 3 offers are also champions in terms of overall sum of amount spent by customers.

Main takeaway:It was found that discount offers with difficulty 7, 10 and duration 7 and 10 respectively are the offers that really excite people.

From informational offers, an offer with 3 days duration is also got high influence rate to the consumers.

These 3 offers are also leading in terms of overall sum of amount spent by the customers.

Hence they are beneficial for both consumers and Starbucks.

Collaborative filtering using Funk SVDFirst step is to prepare user item matrix with users in index and offers in columns.

After applying set of transformations, following matrix was prepared:Offers and users were encoded into integer values.

‘1’ in cell (i, j) indicates that i-th user completed j-th offer or performed transactions as a result of informational offer influence.

We can say that i-th user is positively reacting to the j-th offer.

Zero means that i-th user received j-th offer at least once however never done transactions that could be considered as influenced by the offer.

Missing values in rows mean that users never received offers reflected in the columns.

missing values in user item matrixThe matrix is quite sparse as 63% of data is missing in each column.

Here is where I would expect benefits from using Funk SVD since it could fill all the missing values with ones and zeroes.

Next step is to define basic form of Funk SVD without regularization.

Then we can apply Funk SVD to the user item matrix.

Applying Funk SVDLooking at the result of matrix factorization and the original user item matrix.

Predictions and the original matrixSo we have a matrix with recommendations reflecting possible positive (‘1’) and negative/ignore reaction (‘0’) to the offers.

But how are we doing?.Is the prediction good enough?In order to validate results we should split user item matrix into train and test sets.

Train model on the train set and verify how well it is doing on the test set, comparing to a naive predictor.

We can assume naively that all offers will be completed/used by all customers.

It should be kept in mind that we can verify predictions only for customers and offers that presented in both data sets.

Metrics and estimatorsAn estimator for Funk SVD function will be Mean Squared Error (MSE).

MSE measures the average of the squares of the errors — that is the average squared difference between the estimated values and what is estimated.

It shows how well squared difference decreases for all predictions with more gradient decent iterations.

Accuracy was chosen as a metric for performance evaluation since both positive and negative reaction to offers are equally import to us: there is no big harm if we send an offer to a customer or if we refrain from doing that.

The classes are not imbalanced.

With all that in mind accuracy metric considered to be acceptable choice.

Splitting user item matrix into train and test data sets.

Splitting user item matrix into train and test setsWe can make predictions for 5456 common user, while for 10 user we cannot due to cold start problem, they are not presented in both sets simultaneously.

Fitting model to the train set.

I have applied the model on the train set and validated results on train and test sets.

Additionally accuracy, MSE and RMSE were compared to the Naive predictor figures.

MSE, RMSE, Accuracy on the train, test sets (on the left) and Naive model metrics and performance (on the right)Well, my FunkSVD model is not doing very well as the accuracy difference on the test set between the model and Naive predictor is only 3%.

Model fine-tuningCan the above result be improved?I have prepared and applied custom gridsearch_funkSVD function that verified how well the model doing with all possible parameters.

Achieved results are reflected below.

Note: column “overfitting” is filled with “Yes” whenever prediction is giving values besides 0s and 1s.

The same fact is justified by the performance of the model on the train set.

Observations: We can see that the more number of latent features, iterations and the larger learning we use then accordingly the more chance for the model to overfit on the train set and and on contrary lower number of iterations, latent features and learning rate makes performance of the model poorer on the test set.

So there should a trade off between parameters and performance.

From the above table the best parameters are: number of latent features =5, learning rate=0.

005, number of iterations=100.

These parameters are giving optimized model.

ConclusionI have analyzed Starbucks given data, applied data cleaning, transformation and visualization.

65.

53% of purchase inflows generated by transactions not related to any offer.

Discount and bogo are giving relatively similar to each other figures: 14.

06% and 13.

33% respectively.

Informational is at most 7.

08%.

Majority of the revenue generated by transactions not influenced by the offers.

These 35% of cash inflows influenced by offers are generated by almost 72% of customer that received offers at least once.

Within given consumer population about 28% were not affected at all although received at least one offer during the experiment.

It was found that discount offers with difficulty 7, 10 and duration 7 and 10 respectively are the offers that really excite people.

From informational offers, an offer with 3 days duration is also got high influence rate to the consumers.

These 3 offers are leading in terms of overall sum of amount spent by the customers.

Based on the transformed transactional information I have formed user-item-matrix that reflects positive or negative (ignore) reaction of customers to the received offers.

Basic form of Funk SVD without regularization was selected to fulfill missing values (rates) in user-item-matrix as not all customers received all of possible offers.

In order to assess how well the model is doing I have splitted the data into train and test sets.

As it was expected the tuned model is doing better than a naive prediction (sending offers to all customers as if all customers are happy to receive and use offers) on the test set.

What we should keep in mind is that for 45 customers we could not make predictions due to cold start problem as they were not presented in both sets simultaneously.

I could not achieve accuracy more than 0.

7093 as the more the model is trained on the train set the more it is overfitting.

So there was a trade off between training model on the train set and the prediction power on test set.

Although 0.

7093 does not look very bad it does not look very promising either, we should think over possible further steps.

What else can be done?.Possible further analysis and improvementPerformance can be compared with supervised learning algorithms that will receive as input customer data and will predict whether consumer positively responds to an offer or not.

Special group of people can be eliminated from the dataset and the model performance can be compared to the previously achieved.

Probably the special group is adding variance to the data and accordingly the model cannot generalize better.

As an alternative to the offline approach we used here, we could do an online approach where we run an experiment to determine the impacts of implementing one or more recommendation systems into our user base (one can be based on Funk SVD and the second one based on supervised learning algorithm for example).

A simple experiment for this situation might be to randomly assign users to a control group that receives additional offers they never seen.

Then we capture reaction to them and compare it with the predictions of the selected algorithms and measure performance.

Thank you for reading this.

The blog post reflects only small part of the analysis done, more you can find on Github.

.

. More details

Leave a Reply