The ultimate guide to A/B testing. Part 1: experiment design

Maybe it would have been even worse if the game didn’t have that new mode.

In this case, the best way to be sure would have been to run an a/b test for the new feature, releasing it for only a limited part of the audience and keeping the second part as a control group.

In this article, we will talk about approaches that are most suitable for this and similar cases.

Experiment designIt’s always better to think about experiment design before starting an a/b test.

This includes several considerations:1.

Formulate the null and alternative hypothesisLet’s go back to the example with the game: in this case, we are releasing the new mode and expecting that it will change users’ behavior.

In other words, there are two possible outcomes after releasing the feature: either it affects players behavior or not.

So, the null and alternative hypothesis, in this case, should be formulated as follows:H0 — new game mode hasn’t changed anything within the game, so metrics for the players should be from the one population with certain distribution in both groups (test and control one).

H1 — new feature has actually changed people’s behavior, so metrics have either increased or decreased ( here, you may consider a one-sided test if you are sure about the direction of the effect — i.

e.

they have either increased or decreased).

In this case, you will expect the groups to belong to two different populations with different attributes (mean, standard deviation)The a/b test will aim to reject the null hypothesis with a certain level of reliability (aka p-value)2.

Plan the metrics you are going to check and the possible outcomes of the testAfter formulating the null and alternative hypothesis you already more or less know what to expect, but it is always better to think about the exact metrics to be used in the test.

This lets you calculate the sample sizes needed to detect the significant difference.

Usually, there are three possible categories of metrics:A simple case with only two possible alternatives (yes/no, churned/returned, etc.

)More complicated case with more than two mutually exclusive alternativesThe third category covers continuous variables (an average session time, number of sessions, win rates, etc.

)For the first two categories, the results are expressed as percentages, while the third one is usually summarized in means and standard deviations.

The reason you want to know in advance which category your experiment falls under is because we should use different statistical methods for different types of metrics.

The first two types usually require bigger sample sizes than the third one.

In our case of the new gameplay mode, we could go for a bunch of metrics: Retention, Average session time, Number of sessions per player, etc.

Let’s say that the main target, in this case, would be to increase both Retention rates on day-1 and average session time.

This means we have one metric of type 1 (returned/churned) and one continuous variable (session time).

3.

Estimating sample sizes you need to choose for the testAs soon as we have decided the target, it’s easier to estimate the sample size needed to spot a statistically significant difference.

And for the different metrics’ categories there are different approaches to this problem:3.

1 Confidence interval for a proportionThis type is applicable to the first and second types of metrics defined above: if we are looking for a change in proportion between groups with acceptable precision.

The general formula is as follows:In this formula, n is required sample size,p is the hypothesized population proportion, Z 1-????∕2 is the value from the standard normal distribution table corresponding to half of the alpha level (in other words, it’s the probability of rejecting the null hypothesis when it’s true.

For example, a significance level of 0.

05 indicates a 5% risk of concluding that a difference exists when there is no actual difference), ????.is half of the desired confidence interval.

For the example with Retention on day-1 we can estimate sample size:Let’s say we know that current retention is 40% and expect the new game mode to increase it by at least 2% (our confidence interval, in this case, will be 4% — 2% above and 2% below the estimate), which means p=0.

4 and ????= 0.

02 with ????=0.

05, so Z 1-????∕2= 1.

96So, in our case there are should be at least 2305 users in the test group to be sure that a 2% difference in day-1 retention is statistically significant.

3.

2 Power for the test of the difference between two sample meansThis type of estimation can be used in cases where the target of a test is a continuous variable.

In our example, it will be the average session time.

The formula looks like this:Where Z-values for both a and b depend on ????-level (just like in the previous example, Z 1-????∕2 = 1.

96 with ????=5%) and the level of the statistical power 1-????.(it ranges from 0 to 1, and as statistical power increases, the probability of wrongly failing to reject the null hypothesis (so-called type II error) decreases.

For a type II error with probability equal to β, the corresponding statistical power is 1 − β).

We will compute the sample size required for 80% power, so Z1-????.will be 0.

84.

????.here is the standard deviation and it can be calculated using the following formula (you can use available historical data in the control group):????.is the effect size and it equals the difference between test and control groups divided by the measure of variance (????).

So, let’s say, in our example, we know that the average session time is 8 minutes and we expect the new feature to increase it by 1 minute.

Also, from historical data, the standard deviation for the average session time is 4 minutes.

In this case:Which means we’ll need 250 users in the test group to be sure that the 1-minute difference of the average session time is statistically significant.

As we see, for the retention test there are many more users needed (2300 vs 250), so the right sample sizes for the test should be taken from that estimation.

In the next article, we will talk about the results of a test and statistical techniques suitable for different situations.

Stay tuned!.. More details

Leave a Reply