Starbucks Promotion Strategy

11 min readOct 3, 2020

an analysis based on the Take Home Assignment exercise for Starbucks job candidates.

Introduction

A randomized experiment was conducted by Starbucks. An advertising promotion was tested to see if it would bring more customers to purchase a specific product priced at $10. Since it costs the company $0.15 to send out each promotion, it would be best to limit that promotion only to those that are most receptive to it.

The goal of this exercise is to build a model to select the best customers to target, maximizing the two following metrics:

Incremental Response Rate (IRR)

IRR depicts how many more customers purchased the product with the promotion, as compared to if they didn’t receive the promotion.

Mathematically, it’s the ratio of the number of purchasers in the promotion group to the total number of customers in the promotional group (treatment) minus the ratio of the number of purchasers in the non-promotional group to the total number of customers in the non-promotional group (control).

Net Incremental Revenue (NIR)

NIR depicts how much is made (or lost) by sending out the promotion.

Mathematically, this is 10 ($10) times the total number of purchasers that received the promotion minus 0.15 ($0.15) times the number of promotions sent out, minus 10 ($10) times the number of purchasers who were not given the promotion.

Fig 2 — NIR definition

The data for this exercise consists of about 120,000 data points split in a 2:1 ratio among training and test files. The training dataset will be referred as the dataset to set up models. Each data point includes one column indicating whether or not an individual was sent a promotion for the product, and one column indicating whether or not that individual eventually purchased that product. Each individual also has seven additional features associated with them, which are provided abstractly as V1-V7.

In order to answer the following question:

Can we build a model to select the best customers to target for the promotion?

I will try 2 different approaches.

Thanks to the experiment done on two groups of people (A/B testing), I will establish one one hand a model to predict if a customer who received a promotion is willing to purchase and I will do the same but on customers not in the promotional group.
Therefore, our final model will be the combination of 2 predicting models.
This will be the Two Model Approach.
The second approach is simply to consider all customers, whether in the promotional group or not, and to predict if they are willing to purchase.
As in the end we are only concerned in finding out if the promotion works, I added a new feature called ‘receptive’ which gives the value 1 to customers who received promotions and made purchases, and labels of 0 to everyone else. The objective will be to predict these ‘receptive’ customers.
This will be the One Model Approach.

However, before talking about implementing models, we must have a look to the data itself.

Part I. The Data Overview

I started by analyzing its structure itself.

The data has no missing values.
I found out that “V1”, “V4”, “V5”, “V6” and “V7” are categorical features. I transformed them into dummies variables via the OneHotEncoding.
Below is a picture explaining in a simple way the concept.

At first glance, “purchase” is seen as what we are interested in predicting; knowing if a customer is likely to buy or not the product.
“V2”, “V3” are respectively perfectly normally and uniformly distributed, without extreme values.

Number of customers in experiment: 42364 / Purchasers: 721
Number of customers not in experiment: 42170 / Purchasers: 319
The Data is imbalanced; the non-purchaser class is dominant.

This ends up generating a p-value of 0.506 (z = -0.66), which is within a reasonable range under the null hypothesis. Since we lack sufficient reason to reject the null, we can move on to performing a hypothesis test on the evaluation metrics and setting up models.

Here a link to understand the concept of p and z values.

Part 2. Determining the Sample Size

So, we have a dataset. We would need to sample parts of it for some qualitative analyzes.

Fig 7 — Sampling a population [source picture]

Statistically, our sample size must be large enough to be able to represent accurately the original population. This is called the confidence level. The most common value is 95%.
Next to that, analyzes shouldn’t be too much sensitive to differences, we accept a degree of flexibility, called margin of error or type error rate, closed to ~5%.
Putting both together, we can be 95% confident that our calculated sample is equals to the original population within a range of ±5% of error.
The Statistical power in a hypothesis test is the probability that the test will detect an effect that actually exists. I will go with a power of 80%, meaning in other words that it has an 80% chance of detecting an effect that exists (we expect a difference between the promotional group and the non).

Our desired minimum IRR is 1.5%, and we want to detect this change with a type 1 error rate of 5% and a power of 80%.

I used the Statistical Power calculations for z-test for two independent samples method to come up with a necessary sample size of 5089 entries in order to reach the target. The dataset is made of 120,000 data points.

Hence, we have enough data to play with different models and approaches.

Part 3. Handling the imbalanced Data

As already stated, the goal of this exercise is to build a model to select the best customers to target, maximizing the NIR and IRR metrics.

However, the data is imbalanced, there is largely more “no purchase” than “yes purchase”. Therefore, the classification of these data is biased towards the majority class, which practically means that it tends to attribute a mistaken “non-purchaser” status even to “purchaser”.
One of the method to balance it is the SMOTE method. It will be applied on the training data set (the dataset to train our models).

SMOTE is an over-sampling method. It creates synthetic (not duplicate) samples of the minority class. Hence making the minority class equal to the majority class. SMOTE does this by selecting similar records and altering that record one column at a time by a random amount within the difference to the neighbouring records.

Thanks to the over-sampling method, we end up with the same number of purchasers and non-purchasers.

Part 4. The Two Models Approach

Even if finding out who can buy is our target, ensuring also that we don’t send a promotion to someone who is not willing to buy is crucial, as it would be costly for the company. As the likelihood goes with being not a purchaser, I establish an optimal scorer function:

where 80% of the scorer is focus on minimizing false positives (real non-purchasers seen as purchasers). This is the Specificity (the higher it is, the less false positives are found). Additionally, if the value is below a given threshold, a penalty is given to it.
where 20% of the scorer is focus on maximizing true positives (real purchasers). This is called Recall.

In consequence, models with a high value of Specificity are kept, and run for the best Recall possible.

The Scorer Optimized

Our first approach would be to train 2 models.

on the customers involved in the experiment (promotional group), and finding out its profile to purchase.
on the customers not involved in the promotional group, and finding out its profile to purchase.

The positive reception to the promotion will be found by comparing the two models, built on customer belonging to the promotional group and non-promotional.

To be exact, the difference in the probabilities, which we will call the lift:

Fig 9 — The Lift definition

will tell us the probability that sending a promotion to an individual will increase his or her willingness to make a purchase vs not sending a promotion. We can then send promotions to individuals with lift values, higher than a predefined cutoff percentile. For example, we can send promotions to individuals in the top 3 deciles.

This approach has been written by Victor S. Y Lo in 2002, in his paper “The True Lift Model — A Novel Data Mining Approach to Response Modeling in Database Marketing”.

Fig 10— Victor S. Y Lo proposed methodology, 2002 [source picture]

For this dataset, we can manually try a few cutoff percentiles or perform a grid search to find the optimal cut-off percentile.

Different classification models were considered. However, thanks to the ‘scikit-learn algorithm cheat-sheet’ (that I really appreciate to give you hints in finding your model):

my first pick was with the LinearSVC Model. I was really concerned about timing to train models as my computer started to be really slow. A Linear classifier is relatively faster than a non-linear classifier, it has mainly one parameter to tune (C: the regularization parameter), and it is memory efficient (again advantages for my Laptop…).
Besides, the dataset is not so big and it is known that Support Vector Machines algorithm is not suitable for large ones.
However, it requests a clear margin of separation between classes and it has been found later that this limit is actually not so clearly delimited…
Secondly, the Gradient Boosting Classifier model. It produces a prediction model in the form of weak prediction models, typically decision trees. It combines weak “learners” into a single strong learner in an iterative fashion.
Unlike the LinearSVC Model, the training takes longer because of the fact trees are built sequentially.
One of its disadvantage is its possibility to overfit. Therefore, more parameters would be taken in consideration during the tuning such as the learning rate (shrinkage) or depth of three to avoid this problem.
Last but not least, I picked up the Logistic Regression model. It is one of the simplest machine learning algorithms. Consequently, training a model neither require high computation level nor many parameters to tune (C: the regularization parameter).
Logistic Regression outputs well-calibrated probabilities along with classification results, which is needed for our determination of our lift.
Nevertheless, it is as LinearSVC Model highly dependent on a clear linear separation between classes since it has a linear decision surface.

My expectation would be then that the Gradient Boosting Classifier Model outperforms according to our metrics (balanced accuracy, specification and scorer optimized), even though has a risk to overfit.

After splitting the data into a training and a test datasets (ratio 2.5:1), to get the best models possible, I implemented a GridSearchCV on tuning the parameters described previously.

Hyperparameters tuned in GridSearchCV

Results below are the best models found out after the tuning process. As expected, the Gradient Boosting Classifier finished first in all the metrics.

The Balanced accuracy is calculated as the average of the proportion corrects of each class individually; real purchasers / real non purchasers.
The closer to 1 in the criteria the model is, the better it is.
“Optimize” corresponds to the score given by the optimal established scorer function.

3 customers from 80 were only found as real purchasers, which explains a really low Recall from the model.

Our remaining step is then to try our Two Model Approach (using the Gradient Boosting Classifier Model as base) to verify if we could possibly targeting the right customers for our promotion. Let’s define first the ‘lift’:

Lift definition

Then if the probability to be a purchaser has been found on the customer, he will be append to the group receiving the promotion. A test dataset to this promotion strategy has been provided. It will compare our candidates from our predicted promotion group to the real test group. It came up that the performance of this model was worse than expected. If we take reference to our Promotion metrics IRR and NIR:

an IRR of 1.16%
a NIR of $-21.35

are achieved, which is not as good as the Starbuck’s model.
As reference figures, the company came up with:

an IRR of 1.88%.
a NIR of $189.45.

It should also be noted that this method is extremely sensitive to changes in the cut-off percentile used for this dataset. Both the addition or subtraction of a few percentiles resulted in drastically different NIR. Hence, the two model approach is not recommended for this data set.

Part 5. The One Model Approach

In this approach, we will add a new label (new feature: “receptive”) which consists in assigning labels of 1 to customers who received promotions and made purchases, and labels of 0 to everyone else. In other words, we want the model to find the individuals who are likely to purchase only after they received a promotion. Our task is to see if the new promotion has a positive impact on customers.

Instead of implementing a model which tries to find purchasers (attribute “purchase”), we will try this time to find receptive customers (attribute “receptive”).

After splitting the data into a training and a test datasets (ratio 2.5:1), from 3 different classifier models, the GradientBoostingClassifier outperforms the LinearSVC and LogisticRegression, after a research of optimal hyperparameters via GridSearchCV:

Fig 13 — Results from the trained models

Compared to the two models approach, the Recall is higher (+0.43) but the Specificity lower (-0.25).

99 customers over the 214 were correctly detected as real receptives (115 not detected). However, 6758 were wrongly classified. In overall, the classification works quite good, which is explained by the balanced accuracy of 0.59.

This unique model approach outperforms Starbucks’ strategy with:

an IRR of 2.0%.
a NIR of $296.

The assigning of this “receptive” label to some users might have been luck but overall results are very optimistic.

Conclusion

After having prepared the data, we have proved that we had enough data to come up with qualitative analysis. We have also handled its imbalanced propriety and we have trained models based on two different approaches.

The simplest approach showed optimistic results, and even did better than the Starbuck’s model. We definitely reached our fixed target and our model predicted profiles of receptive customers. On the other hand, the complexity of the Two Model Approach, combined with its sensitiveness regarding the cut-off, could not work.

The biggest challenge was to handle the imbalanced data. It is true that SMOTE helped us to overcome this issue but it has also disadvantages; while generating synthetic examples, SMOTE does not take into consideration neighboring examples can be from other classes. This can increase the overlapping of classes and can introduce additional noise. And as found before, the decision limit between our two classes is not so clearly defined, therefore it might have biased our results.

However, many other approaches could have been checked too. For instance, taking back the one model approach but without assigning the receptive label and predicting purchasers. More hyperparameters from the GradientBoostingClassifier algorithm could have been trained thanks to GridSearchCV, or simply comparing these results to other classifier algorithms.

For more detailed information, the project is available here.