If we interpreted the p-value as: “Given that the true difference between two versions is zero, probability of observing the difference 4.

32% is 1.

39%”, we would be completely wrong.

False positive rate shouldn’t be confused with the false discovery rateIf we could run the above t-test more than once, say tens of thousands of times, we would get a much clearer picture of what is the likelihood of giving a wrong answer, like the one in the example above.

Luckily, Rstudio for example, lets us do that through simulations.

¹Let’s do 10,000 t-tests.

First, we need two distributions of the conversion rates A and B.

I’ll take the beta distribution.

The two distributions will be defined by their shape parameters which I will set to:Version A# conversions (i.

e the number of visitors who subscribed)> shape1 <- 2 # visitors who chose not to subscribe (complement of shape1)> shape2 <- 131Version B# conversions (i.

e the number of visitors who subscribed)> shape1 <- 4# visitors who chose not to subscribe (complement of shape1)> shape2 <- 137Please note that I picked these distributions as the ground truth.

Meaning, the true mean of the difference is 1.

3%.

The example in Figure 1.

is sampled from these, hence the difference in the conversion rates.

Beta distribution is nearly normal if the two parameters are sufficiently large and close to each other.

But the number of visitors that we have is quite small and the number of subscribers relative to the number of visitors who chose not to subscribe is even more problematic.

Meaning, one of the two conditions that are assumed under the t-test is not satisfied.

Nevertheless, this is our simulation, our playground.

Let’s see what would happen if we assumed the two samples are nearly normal.

²Figure 2.

Density plots after 10,000 simulated t-tests.

Distribution of the null hypothesis is presented in grey.

Simulated data with the mean being the true difference between versions A and B in blue.

Red dotted lines represent 95% confidence interval of the null hypothesis.

The full grey line represents the example observed effect size 4.

32%.

Figure 2.

shows how small and noisy the true effect size is.

It almost overlaps with the null hypothesis.

Small true effect size plus noisy measurement always yields unreliable tests.

Below, Figure 3.

shows the power of our t-test.

It is just 11%.

Given the fact that the null hypothesis is false, only 11% of our t-tests would correctly reject it.

This is an extreme example of an underpowered test.

In order to calculate the true positive rate using the power of our test, we need to postulate the fraction of tests that show the real effect (null hypothesis is false).

This fraction can’t be measured precisely.

It is basically Bayesian prior.

In general, we would have to postulate it before the start of the experiment.

I’m going to use 10% since 1,064 of our 10,000 t-tests show the difference that is positive and its 95% confidence interval doesn’t contain zero.

So we have the following:# number of true positive : > true_positive <- 10000 * 0.

1 * 0.

11> true_positive[1] 110# number of false positive : > false_positive <- 10000 * 0.

9 * 0.

05> false_positive[1] 450# where 10000 is the number of simulated samples# 0.

1 is the fraction of samples that contain the real effect# 0.

11 is the power# 0.

9 is the fraction of samples that don't contain the real #effect# 0.

05 is the significance levelNow we can calculate the false discovery rate as the ratio of false positives and the total number of positive findings:# false discovery rate: > false_discovery_rate <- false_positive / (false_positive + true_positive)> false_discovery_rate[1] 0.

8036Because the power of our test is so low, 80% of the time our “statistically significant” finding will be wrong.

So in our example, the observed difference of 4.

32% with the p-value 0.

0139, the chance that we had made a type-I error is not 1.

39% but 80%.

Type M (exaggeration) and Type S (sign) errorsIn underpowered tests, the problem doesn’t end with false discoveries.

In Figure 4 we can see the distribution of p-values of our 10,000 t-tests.

11,050 of them are equal to or less than 0.

05.

Which is expected since the power is 11%.

The mean value for the differences between the two versions of the website which are found to be significant is 4.

08.

That means that on average, a t-test would produce a significant result that is 3 times larger than the true effect size.

Even further, out of those 11,050 values for the difference, 57 have the wrong sign.

There was a small chance we could have observed version A performing better than version B of our website!There is a library in R which can be used for calculating these two errors and the power.

It’s called retrodesign.

³ The output for our true effect size and pooled standard error is:> library(retrodesign)> retrodesign(0.

013, 0.

017)$power[1] 0,1156893$typeS[1] 0,02948256$exaggeration[1] 3,278983What can be done?Underpowered statistical tests are always misleading.

In order not to draw wrong conclusions for the calculated p-value, few things have to be done before conducting an A/B test:An estimate for the true effect size has to be set.

Due to bias and the tendency of larger effects to have small p-value, observed effect sizes tend to be exaggerated.

Based on that estimate and the measured noise, the proper sample size has to be set in order to achieve an acceptable level of power.

As seen above low power makes the test unreproducible.

The significance level should be set to a value that is less than 0.

003.

According to James Berger [5,6] p-value=0.

0027 corresponds to the false discovery rate of 4,2% which is close to the false positive rate at 0.

05 significance level.

¹ Code for these simulations is available on my GitHub.

² These shape parameters will give us a large standard error and low power, which is what we need for this example.

Non-normal data due to insufficient sample sizes or skewness in the data happens all too often.

Conversion rates sample distributions are carelessly assumed nearly normal without carefully checking the conditions.

This leads, as the example shows, to false findings.

³ retrodesign library has been created by Andrew Gelman and John Carlin in their paper [1].

References[1] Andrew Gelman and John Carlin (2014).

Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors.

Perspectives on Psychological Science.

Vol.

9(6) 641 –651.

(DOI: 10.

1177/1745691614551642)[2] David Colquhoun (2014).

An investigation of the false discovery rate and the misinterpretation of p-values.

R.

Soc.

open sci.

1: 140216.

http://dx.

doi.

org/10.

1098/rsos.

140216[3] Chris Stucchio (2013).

Analyzing conversion rates with Bayes Rule (Bayesian statistics tutorial).

https://www.

chrisstucchio.

com/blog/2013/bayesian_analysis_conversion_rates.

html[4] Peter Borden (2014).

How Optimizely (Almost) Got Me Fired.

https://blog.

sumall.

com/journal/optimizely-got-me-fired.

html[5] Sellke T, Bayarri MJ, Berger JO.

(2001).

Calibration of p values for testing precise null hypotheses.

Am.

Stat.

55, 62–71.

(DOI:10.

1198/000313001300339950)[6] Berger JO, Sellke T.

(1987).

Testing a point null hypothesis: the irreconcilability of p-values and evidence.

J.

Am.

Stat.

Assoc.

82,112–122.

(DOI:10.

1080/01621459.

1987.

10478397).. More details