Even if the lady had no ability to distinguish milk-first from tea-first, she would be expected to get a couple of the milk-first cups by chance.

Fisher’s insight was to quantify the relative likelihood of the experimental outcome, given the lady had no ability.

In essence, this was the null hypothesis that the experiment was designed to collect evidence about.

Ms.

Bristol was to choose 4 cups out of 8 as milk-first.

Using a little bit of combinatorics, it turns out that there are = 70 distinct ways of choosing 4 out of 8 without regard to order.

(To see this, note that there are 8* 7 * 6 * 5 = 1680 ways of choosing 4 out of 8 but 4 * 3 * 2 *1 = 24 different ways of ordering each set of 4, so we get 1680/24 = 70 ways of selecting 4 out of 8 without regard to order).

With no ability to distinguish between cups, and no ability to distinguish milk-first, each of the 70 possible selections by the lady is equally likely.

Implicit here is the concept of a test statistic T, the quantity that the test is designed to observe.

T in this case is the number of cups the lady correctly selects as being milk-first, and can take on integer values between 0 (all wrong) and 4 (all right).

Under the null hypothesis of no ability, one can compute the likelihood of observing each potential outcome of T:That is, if Ms.

Bristol had no ability, a result as extreme as 4 correct choices would be expected 1.

4% of the time, which is quite rare and evidence against the null hypothesis.

3 out of 4 would occur nearly 1/4 of the time (17/70 = 0.

243), a much less rare result under H0.

The actual outcome of the experiment has always been under some dispute, although many believe that Ms.

Bristol did indeed correctly identify all four milk-first cups (take that you scoffers).

What is known is that Ms.

Bristol and Mr.

Roach were later married, so perhaps the test was not as objective as made out to be!What is also known is that this exceeding simple experiment contains many of the basic principles of statistical experimental design, which Fisher expounded on in the ensuing years.

They are:By the middle of the twentieth century, designed experiments had taken hold of scientific research.

Statistical inference became an accepted field of research, with designs and analyses becoming far more complex than the simple experiment described here.

Now, in the world of big data and fast computers, the same principles still apply, are are being extended to a much richer set of problems.

An example of how testing has changed is the concept of a permutation test.

In a permutation test, a treatment is applied and a result observed, as in our tea example.

Assume that the null hypothesis is, as before, that of no treatment effect.

If H0 is true, then a positive result is as likely as a negative one.

A permutation test randomly permutes not the observations themselves but their signs, and repeats this random permutation many times, to obtain a reference distribution that the observed value can be compared against.

An example of a permutation test is as follows.

Let 80 strawberry plants be matched in 40 pairs.

Within each pair, one of the plants is self-fertilized, while the other is cross-fertilized.

Under the null hypothesis of no differences between fertilization methods on plant growth, positive and negative differences in growth within a pair should be equally likely.

Let the test statistic be the observed difference in mean growth between self- and cross-fertilized plants:There are different ways to permute the signs of the 40 differences in growth between self- and cross-fertilized.

While it is possible to enumerate all of them, it is much easier to simply generate a large number (say 10000) of random permutations of 40 signs, and recalculate the difference in means T for each permutation.

The resulting distribution (seen in Figure 2) can then be used to get a reasonable estimate of how extreme the observed result is, just as with the lady tasting tea.

In this example, with an observed T = -3.

007 (as denoted by the red line on the plot), only 3.

5% of simulated permutations were more extreme, providing strong evidence against the null hypothesis of no fertilization-method effect.

Furthermore, as in the previous example, no assumptions were made about the distribution of the differences in growth, nor was one needed.

The design of the experiment and the random permutation of results created the mechanism used to assess the evidence.

It’s worth noting that the permutation distribution of T in this example looks a lot like a normal distribution, and in fact, as the number of pairs grows large, it will resemble a normal distribution more and more.

But had the number of pairs been smaller, or the observed distributions of differences more skewed, the normal approximation would not be as good, but inferences from the simulated permutation distribution would remain just as valid as before.

Bio: Blair Fleming, Director of Digital Marketing at Open Data Group.

Related: var disqus_shortname = kdnuggets; (function() { var dsq = document.

createElement(script); dsq.

type = text/javascript; dsq.

async = true; dsq.

src = https://kdnuggets.

disqus.

com/embed.

js; (document.

getElementsByTagName(head)[0] || document.

getElementsByTagName(body)[0]).

appendChild(dsq); })();.