Marketing A/B Testing at Zalando

Zalando Office Tamara-Danz-Straße, Berlin-FriedrichshainIn-Depth AnalysisMarketing A/B Testing at ZalandoEnabling Location Based A/B Tests Using Cluster AnalysisVictoria OsterlohBlockedUnblockFollowFollowingJun 24Contributors: Carsten Rasch, Thomas Perl, Martin Kasten, Jean de BressyThe goal of marketing A/B test analysis at Zalando is to derive the incremental impact of marketing actions.

The results of these analyses form the basis for an optimal allocation of budget across all marketing channels and thus an efficient marketing ROI steering.

This is the mission of our A/B Testing Team in Demand Planning & Analytics at Zalando.

One of our main testing methods are geo experiments, in which we split a market into highly correlated regional groups.

Highly correlated sales behaviors between the test and control groups are a precondition for geo A/B experiments, as low correlations between test and control regions lead to noise in the prediction and thus to a lower probability to identify an effect, if there is one.

Finland is the only market where marketing channels are not ROI-driven, due to the fact that geo location based A/B testing is not straightforward in this country.

The reason for this is that the majority of the Finnish population lives in the south, with about 30% in the region Uusimaa alone [1].

Consequently, there is a large regional difference of the total order amount.

The low number of daily orders in northern Finland regions make it almost impossible to find highly correlated groups when comparing time series based on given regional aggregates, e.


federal state, municipality.

As a result, we need to find a different location definition for Finland.

One solution to overcome this issue, is to split the country into smaller clusters compared to the existing region definition.

In this way, the region splitting combinations can be increased and therefore also the possibility to find “regional twins”.

This could be achieved by using Google Marketing Areas (GMA’s).

However, GMA definitions are not available for Finland.

For this reason, we create city cluster definitions in order to make Finland testable.

This is realized with the K-Means Cluster Analysis method.

Finding clusters in city location dataThe K-Means Clustering Algorithm is a common method to find K different classes of similar objects in data sets [2].

In this case, the clustering approach relies on publicly available city location data [3], including the GPS coordinates with the latitude and longitude information of 317 cities in Finland [Fig.


Figure 1: GPS data of 317 Finnish locations, shown in a Cartesian coordinate system (left) and in a cylindrical projection map [4] (right).

In order to show the regional distribution of the Finnish cities, the Cartesian coordinates were transformed into cylindrical coordinates and plotted on the map of Finland by means of the Geo Point Plotter [4].

The used data set is restricted to the list of cities which are steerable by ad delivery systems as Google and Facebook.

If the input data would include also not steerable cities, there could be the risk of a non-useable clustering result due to unclean A/B test splits, and probably insufficient number of observations within the clusters.

Besides the raw data, the algorithm [Fig.

2] requires the cluster number K as input.

The reason is that clustering is an unsupervised learning method [5], meaning that the algorithm “uncover[s] hidden structure from unlabeled data” [6] and does not derive the optimum K parameter automatically [7].

Figure 2: K-Means Clustering Algorithm.

After the K parameter has been defined, the K-Means Clustering Algorithm proceeds in 3 steps [8]:Initialization of K cluster centers (centroids) at random locations;Grouping of the data based on minimum Euclidean distance between the data points and the centroids;Recalculation of the cluster centers by averaging all data points which have been assigned to the respective cluster.

The last two steps are iteratively repeated until the algorithm converges into a stable cluster assignment.

The convergence criteria is reached, when the intra-cluster variance cannot be lowered anymore [7], so that the clusters are as compact as possible [9].

More formally, given the data points {x1,…,xn} and the centroids {c1,…,ck} of k clusters, this means “to minimize […] the squared error function” [7]:In other words, the objective of the equation 1.

1 is to minimize the sum over all groups of the sum of squared distances within the clusters.

The threshold of convergence is 1e-4 [10].

The expressionin the above formula is the Euclidean distance function and can also be written as [11]:The Euclidean distance is a common metric for distance measuring and defined as “the square root of the sum of squared differences between corresponding elements of the two vectors” [11] x and c.

This metric is used to assign the data points to the nearest centroid based on the minimum distance between them [12].

It is worth mentioning that the algorithm converges in any case to a local minimum, which is not necessarily the global minimum [12].

This means it cannot be guaranteed that the current result is the best possible output, since the random choice of the initial centroids can lead to different clustering results for each run.

In order to find a potentially better outcome, the algorithm execution should be repeated several times, as shown in figure 3.

Figure 3: Clustering score for a fixed K (K=7) after 1000 iterations.

Figure 3 indicates that the sum of the square distances (SSE) is not completely stable over the 1000 iteration steps, meaning that each run can yield to slightly different SSE due to random start parameters.

Choosing the optimum number of clustersThe value of the K parameter determines the number of clusters, and thus also the assignment of the data points to the clusters.

In our case, the “elbow method” is used as a common technique to estimate the optimum number of clusters K.

As a second step, we validate the result of the elbow method by comparing the number of orders between the clusters.

Balancing the order volume of the clusters is important, as noisy data produced by too small clusters diminishes the probability to achieve significant results in geo A/B testing.

The “elbow method” includes the following steps [13]:Executing the K-Means algorithm on the data set for different values of K -in this case K ranges between 1 and 20;Computing the SSE between the cities and the centroids for each K;Plotting the results in a line chart.

The SSE decreases with increasing K parameter.

The chosen K parameter should be at a point where its value and also the SSE are still small.

There is one K at which the rate of the SSE decrease sharply changes and the curve starts to level off.

This is the optimal K parameter and called “elbow point”.

Figure 4: Clustering score for a range of different K parameters.

Figure 4 shows that the elbow point is not clear to identify in this case but a range of K parameters can be considered, as the rate of the SSE decrease is higher for smaller K values from 1 up to about 8, before the curve starts to flatten.

For this reason, the K-Means algorithm was executed for this value range of different K parameters.

In this way, the optimal K parameter could be found by comparing the respective clustering results and choosing the K value which yields a cluster allocation where the intra cluster variance of the total order amount is as low as possible.

The lower the intra cluster variance, the higher the probability for highly correlated clusters and thus to detect a potential effect in the geo A/B test.

Since at least two clusters are needed as control or test group, a K parameter of 1 can be neglected.

K values between 2 and 6 result in a suboptimal grouping of the cities, which is reflected by the fact that the order volume varies considerably between the clusters.

High variation in the clusters order level would mean that the country is rather split into high populated regions and rural areas.

This could lead to lower intra cluster correlations due to higher variations in the sales behaviors between the regions and thus to a smaller probability to detect a potential effect.

For a K value greater than 6, the biggest cities Tampere and Helsinki are grouped separately and the number of orders per cluster is more balanced, which increases the chance for highly correlated clusters.

The K parameter K=7 turns out to be the best value for the used data set, as a K parameter of K=8 would not lead to a further improvement of the clustering output.

The reason is that this simply results in a further splitting of northern Finland into an additional cluster, meaning that the less populated part of the country is grouped into even smaller clusters.

The consequence is an again increasing intra cluster variance of the order volume.

Validation of the K parameterAs the initial centroids are randomly located, the algorithm could deliver different results for every run [14].

The magnitude of the deviation of the sum of the square distances between several iterations can indicate how stable the clustering algorithm is.

To figure this out, the algorithm has been executed 1000 times and the sum of the square distances was plotted for every run [see Fig.


The line plotted in Figure 3 shows that although the output of the algorithm is not identical for each run, the variation of the sum of square distances between the 1000 iterations is small.

That means the clustering results differ only slightly and the clustering score is quite stable for K=7.

In order to analyze the variability of the cluster assignments between the cluster results of several runs, 6 clustering outputs have been generated by repeating the algorithm for a fixed K value of K=7 (Fig.


Figure 5: Clustering results of 6 runs of the K-Means algorithm for K=7 with the 7 city clusters and their centroids (marked as stars).

The comparison perspective is the variability of the centroids locations over the 6 algorithm runs to figure out which cluster variations are responsible for the small fluctuations in the clustering score.

The comparison of the cluster outputs shows that the locations of the centroids remain quite stable over the 6 runs.

This implies that the cluster assignments change just for a few individual cities, the largest proportion of the data points stays in the same cluster across the iterations.

Only the most northern cluster which mainly includes the Lapland region shows visible changes in the centroid location.

This is due to the low number of observations in this region and the resulting larger distances between them.

Thus, the variations in the sum of square distances between the 1000 iterations (as described above) is probably mainly caused by the relatively unstable group assignment of the northern located cities.

Since the cluster of these cities will be grouped with its neighboring clusters, as shown below, the variation in the cluster assignment can be neglected.

Clustering result: Testable geo splitThe K-Means classification yielded seven city clusters with its centers (Fig.

6 left).

In order to ensure that the daily order amount in each of these clusters is on a testable level, five smaller clusters were grouped to form a larger region (Fig.

6 right).

Figure 6: Result of the K-Means Cluster Analysis, showing the output with 7 clusters (left) and the final grouping into 3 geo testing regions (right).

The lower the number of daily orders, the higher the risk for noise in the data.

The consequences of noisy data are lower correlations between the clusters and thus a lower probability to detect a potential effect.

As a result, we got three geo split regions: “Helsinki” (region 1), “Tampere” (region 2) and “Northern Finland” (region 3).

This region split proves to be a testable setup, since the correlations between the regional time series reach values about 98% (in terms of total orders) for the time period of October 2018 until mid-January 2019.

In addition, the values roughly remain on the same scale (96%-98%) when calculating the correlations for the entire year 2018.

Since Helsinki is the biggest region in terms of total orders, it was assigned to the test group and Tampere and Northern Finland to the control group.

The test region should have the highest total order share, as this is the group in which the tested channel is turned on.

This ensures that the tested campaign runs in a large part of the country whereby the reduction of the total impression number by the geo A/B test is kept as small as possible.

This is important to guarantee the optimal campaign reach despite testing, as testing always requires that some regions have to be excluded from the campaign targeting.

To validate the split, the regions are tested on possible effects.

The relative effect differences between Helsinki (test group) and the two regions Tampere and Northern Finland (control group) were tested by an uplift analysis.

The analysis resulted in a not significant effect of 0.

2% (52% significance) with a correlation of 99.

5% between both groups.

This result confirms the highly correlated total orders behavior and that there is no significant difference between the groups before the test, as the measured effect is close to zero and clearly below the significance level of 90%.

The clustering result was used to implement the first geo split setup in Finland.

The objective of this geo test is to measure the incremental performance of Display Programmatic during the Zalando Season Start campaign 2019.

The region Helsinki is defined as test group in which the tested channel is turned on and running for 4 weeks.

The other two regions Tampere and Northern Finland are set as control regions, meaning that Display Programmatic is not turned on in the corresponding cities.

After the test stop the incremental impact on total orders can be calculated which is the cumulated daily difference between the observed data in the test group and the model prediction.

This is the beginning of ROI steering for Finland at Zalando, as the marketing budget allocation in this country could be based on A/B test results for the first time.

Limitations and next stepsThe analysis of the Display Programmatic geo split test during the Zalando Season Start campaign 2019 yields significant and plausible results.

That means that the illustrated approach has proved to be a useful method to create test and control regions for geo A/B tests in Finland.

Although the used test setup resulted in highly correlated groups, there is still room for optimizing the clustering.

Given the fact that the current split contains only 3 groups and that the optimal campaign reach should be ensured despite testing, the options for regrouping are highly limited.

This increases the risk for the occurrence of systematic errors, as the region Helsinki could cause a metropolitan effect due to a potential different behavior between people in metropolises and rural areas.

Reclustering and using a more variable clustering with a higher number of testable groups can help to overcome this issue.

For this purpose, the most populated regions Helsinki and Tampere could be further split into smaller clusters.

A higher number of regions increases the region splitting combinations and thus the geo A/B testing capacity.

References[1] European Commission, Helsinki-Uusimaa region (2/15/2019), Regional Innovation Monitor Plus.

[2] M.

Khan, KMeans Clustering for Classification (8/2/2017), Towards Data Science.

[3] Finland City & Town Population (2004–2019), Tageo – geographic coordinate information.

[4] D.

Watkins, Geo Point Plotter.

A tool to quickly map out a list of geographic coordinates (n.



[5] P.

Sayak, K-Means Clustering in Python with scikit-learn (7/5/2018), DataCamp Tutorials.

[6] G.

Selding, New Course: Unsupervised Learning in Python (2/22/2017), DataCamp Tutorials.

[7] S.

Sayad, An Introduction to Data Science.

K-means Clustering (2010–2019).

[8] F.

Doukkali, Clustering using K-means algorithm (12/19/2017), Towards Data Science.

[9] B.

Boehmke, UC Business Analytics R Programming Guide.

K-means Cluster Analysis (n.


), University of Cincinnati, University Lecture.

[10] Scikit-learn developers.



KMeans (2007–2018).

[11] S.

Borgatti, Distance and Correlation (Spring 2007), Multivariate Statistics, Boston College, University Lecture.

[12] A.

Trevino, Learn Data Science, Machine Learning.

Introduction to K-means Clustering (12/6/2016).

[13] R.

Gove, Using the elbow method to determine the optimal number of clusters for k-means clustering (12/26/2017), Robert Gove’s Blocks.

[14] M.




Santhi, V.



Sai Leela, P.


Anitha, & D.

Nagamalleswari, Enhancing K-Means Clustering Algorithm (2011).

International Journal of Computer Science & Technology, IJCST, 2(4), 73–77.


. More details

Leave a Reply