# Statistics For Real World Data

Statistics For Real World DataSome useful statistical tools for imperfect dataRyan FarmarBlockedUnblockFollowFollowingJun 14IntroductionIn any introductory statistics course, you’ll pretty much always encounter the basic z-tests, t-tests, as well as the occasional chi-squared.

Now these tests can be extremely helpful and very good practice for understanding basic statistical methodologies, but the real world isn’t as perfect (a.

k.

a.

“normal”) as statisticians would like it to be.

A lot of the time you’ll have trouble picking out a parametric test because one or more of its assumptions don’t hold up against the provided data which can lead to a lot of wasted time and statistical analyses with shaky foundations.

This is where non-parametric and randomization tests will be your new best friend.

Unlike the z-test and t-tests you may have learned in intro statistics, these tests don’t have strict assumptions and will be much more applicable to imperfect data.

I’ll be going over Monte Carlo simulations and bootstrapping as well as numerous non-parametric tests that can be applied to different types of data in a wide range of scenarios.

Also, I’ll be using R for these analyses since it was built for statistical computing and I’ll be posting code chunks along the way that you are more than welcome to take so you can experiment with it.

Monte Carlo SimulationsA Monte Carlo simulation is pretty much just repeated sampling from a population or pseudo-population.

This repeated sampling is used to estimate the sampling distribution of a relevant test statistic and determine its behavior over lots of iterations.

Samples are taken from the population such that the size yields an approximately normal sampling distribution held up by the Central Limit Theorem.

A useful tool to check normality is the qq-line that compares the theoretically normal values to the sample values.

The more normal your sample is, the closer the points are to the line.

I’ve put a chunk of code below and the graphical output to visualize this step in the process.

## Sample sizes of 15, 30, 45, 60, and 75 are put through the functionsizes <- c(15,30,45,60,75)samp.

means <- lapply(sizes, function(size) replicate(K, mean(rchisq(size,4))))qq.

means <- function(x){mean_data <- data.

frame(X=samp.

means[[which(sizes==x)]])qqmean <-ggplot(mean_data, aes(sample=X)) + stat_qq() + stat_qq_line() + labs(title=paste(“n=”,x,sep=””))print(qqmean)}lapply(sizes, qq.

means)Using a chi-squared distribution with 4 degrees of freedom, it looks like a sample size of around 60 or 70 would get us as close to normality as we need.

The last steps are repeating the sampling and calculating the relevant test statistic.

In the chunk below I used a Monte Carlo simulation with 10,000 repetitions that estimates the sampling distribution of a sample mean and returns the Type 1 error rate.

N <- 10000 mu <- 2 sd <- 2 alpha <- 0.

05z_test_func <- function(x){ samp <- rchisq(x, df=2) samp_mean <- mean(samp) test_statistic <- (abs(samp_mean – mu))/(sd/sqrt(x)) 2*(1-pnorm(test_statistic)) <= alpha }z_test_func2 <- function(x){ rep_tests <- replicate(N, z_test_func(x)) sum(rep_tests)/N }sizes <- c(10,30,50) vector1 <- sapply(sizes, z_test_func2) vector1BootstrappingBootstrapping is pretty similar to Monte Carlo simulations in the sense that it also uses repeated sampling to determine the behavior of a relevant test statistic.

The only big difference is that we use bootstrapping when we only have a sample of the population.

Several samples are taken from the available data with replacement in order to estimate a sampling distribution.

However, you have to make sure that the data you’re sampling is an independent, representative sample of the actual population.

Also, bootstrapping doesn’t work well on distributions with heavy tails so be on the lookout for that.

I’ve included an example below where I use bootstrapping to estimate the sampling distribution of 95th percentile and constructed a 99% confidence interval around it.

samp <- c(1.

77, 10.

48, 4.

30, 0.

96, 2.

67, 2.

76, 2.

02)N <- 10000boot_samp <- replicate(N , sample(df,replace = T))percentile_func <- function(x){quantile(x,probs = c(0.

95))}p95 <- apply(boot_samp,2, percentile_func)sort_p95 <- sort(p95)mean_p95 <- mean(p95)boot_ci <- c(quantile(sort_p95,probs=c(0.

005)), quantile(sort_p95,probs = c(0.

995)))boot_ciThe Sign TestThis particular non-parametric test is used to draw conclusions about the median of a continuous distribution.

The null hypothesis is that the median equals a specified null value while the alternate hypothesis can be right-tailed, left-tailed, or two-tailed.

Since the median is the center of the data, the number of points above and below the actual value of the median should be approximately equal.

Furthermore, the null hypothesis is assumed to be true, so the null value is assumed to be the center of the data.

The test statistic is the number of sampled data points greater than the null value, which follows a binomial distribution with the sample size as the number of trials and 0.

5 as the probability of success.

The usual method for obtaining the test statistic is obtaining the logical vector (from determining which points lie above the null value) and summing it.

The p-value is found by determining the probability of obtaining a test statistic at least as extreme as the one found in the same direction as the test.

For two-tailed tests, the probability to the left of a small number (less than half) or to the right of a large number (more than half) is multiplied by two.

The code below is a simple left-tailed sign test.

You can use either pbinom or binom.

test to calculate the p-value, but I would suggest binom.

test since you can specify the alternate hypothesis in the function itself.

sample <- c(2,1,3,3,3,3,1,3,16,2,2,12,20,3,1)## Test statisticv <- sum(data>5)print(len(v))## Sample sizen <- length(data)## p-valuepbinom(v,n,0.

5)##or use binom.

testbinom.

test(v,n,alternative=”less”)Sign-Rank TestThe non-parametric sign-rank test is also used to test the median of a continuous distribution but, it has the condition that the distribution must be symmetric.

The sign-rank test works in a similar way to the sign test.

Instead of each value being compared directly to the null value, first, the null value is subtracted from each sampled value, which will yield a positive difference if the sampled value is above the assumed median and a negative difference if the sampled value is below the assumed median.

Next, the absolute values of the differences are ranked and any tied values receive a rank that is the average of the would-be ranks of all the tied values.

Finally, the ranks of the positive differences are summed to find the test statistic.

The p-value can be calculated either through approximation or exact calculation.

The exact calculation utilizes simulations to determine the proportion of all possible test statistics that are at least as large as the one we already got from our data.

The approximate calculation uses a normal distribution to estimate the test statistic, but should only be used when there are ties and/or a large sample size.

Here’s an example of the sign-rank test being used and the different options that come with the sign-rank test (“Wilcox test”)sample <- c(17, 50, 45, 59.

8, 21.

74, 16, 9, 15.

43, 5.

12, 40, 35,13.

35, 13.

4)##Conducting the test with different inputswilcox.

test(sample, mu=8.

5, alternative=”two.

sided”)wilcox.

test(sample, mu=10, alternative=”two.

sided”, exact=FALSE)wilcox.

test(sample, mu=14, alternative=”two.

sided”, exact=FALSE,correct=FALSE)On top of the basic inputs, the wilcox.

test() function requires that you specify whether you want the exact or approximate method for calculation and the accompanying continuity correction.

When the approximation method is used, users have the option to include a continuity correction to adjust the integer-based test statistics to the continuous normal scale.

The Wilcox sign-rank test can also be used to determine if there is a difference between two continuous distributions.

The only conditions are that the distributions have the same shape and that they are dependent.

In this case, the process for determining the test statistic is similar to the previous sign-rank example.

First, the difference between each pair of observations is calculated.

The absolute values of these differences are given ranks, ignoring differences of zero and adjusting ranks for any ties.

Since the order of the pairs changes the signs of the differences, the sum of the positive ranks and the sum of the negative ranks are calculated.

The smaller of the sums is used as the test statistic.

You can see in the code below that the wilcox.

test() function can also take a second set of sample data which we will be using in this scenario.

Other than that the syntax is exactly the same as before.

samp1 <- c(25, 29, 10, 31, 27, 24, 27, 29, 30, 32, 20, 5)samp2 <- c(32, 30, 7, 36, 20, 32, 26, 33, 32, 32, 30, 32)## Conduct the testwilcox.

test(samp1, samp2, alternative=”two.

sided”, paired=TRUE, exact=TRUE)wilcox.

test(samp1, samp2, alternative=”two.

sided”, paired=TRUE, exact=FALSE)wilcox.

test(samp1, samp2, alternative=”two.

sided”, paired=TRUE, exact=FALSE,correct=FALSE)Rank-Sum TestThe rank-sum test also tests the difference between the medians of two continuous populations only this time the populations have to be the same shape and independent of each other.

For the Wilcoxon rank-sum test, the test statistic determination begins by combining both samples and ranking all of the values together.

If the null hypothesis is true, then the sums of the ranks in both groups should be about equal.

Thus, the ranks are summed for each group.

An adjustment is made to each sum to accountfor any difference in sample size between the two groups.

The test statistic is just the smaller adjusted sum.

Again, we use the wilcox.

test() and two samples of data, but now we don’t specify the “paired=TRUE” since the data is independent.

As a result, you’ll notice the chunk below is almost identical to the previous one.

##Independent sample datasamp1 <- c(5.

8, 1, 1.

1, 2.

1, 2.

5, 1.

1, 1, 1.

2, 3.

2, 2.

7)samp2 <- c(1.

5, 2.

7, 6.

6, 4.

6, 1.

1, 1.

2, 5.

7, 3.

2, 1.

2, 1.

3)## Conduct the test with some different optionswilcox.

test(samp1,samp2, alternative=”two.

sided”)wilcox.

test(samp1, samp2, alternative=”two.

sided”, exact=FALSE)wilcox.

test(samp1, samp2, alternative=”two.

sided”, exact=FALSE, correct=FALSE)Fisher’s Exact TestFisher’s test determines whether there is an association between categorical variables that can be classified in a two way table with fixed marginals.

It’s similar to the chi-squared test in that the null hypothesis assumes no association between the two variables.

However, it doesn’t use a test statistic and instead calculates the probability of observing values in the table that indicate an even stronger association.

This probability value is used at the p-value.

sample_matrix <- matrix(c(3,11,9,4), nrow=2, ncol=2, byrow=TRUE)fisher.

test(exp.

res)As the table becomes larger than a 2 by 2, the computational intensity of Fisher’s test skyrockets because it iterates through all possible tables for the given marginal totals.

If you want to specify the number of repetitions there is a simulate.

p.

value that will use any number of simulations that you want (the default is 2000).

Here’s a simple example of how to do this in R.

samp <- matrix(c(3,6,8,2,4,1), nrow=2, ncol=3, byrow=TRUE)#regular wayfisher.

test(samp)#or we can specify the number of simulationsfisher.

test(samp, simulate.

p.

value=TRUE, B=10000)ConclusionThese are just a few of all the non-parametric tests out there, but they have proven extremely useful as I encounter more and more real world data.

Hopefully these testing and simulation techniques will help you when you encounter imperfect data as well.

Here’s a link to a website that goes over a whole bunch of non-parametric methodologies as well as a few randomization also.

DictionaryNon-parametric — Non-parametric statistics is based on either being distribution-free or having a specified distribution but with the distribution’s parameters unknown.

Pseudo-Population — A sample of observations that is representative of a larger super-populationNormality — An assumption of many basic statistical tests.

Assumes that a distribution has the characteristics of a normal distribution (e.

g.

symmetric).