Election Poll Simulation, Margin of Error and Central Limit Theorem with PythonWaldecir FariaBlockedUnblockFollowFollowingMar 14While learning about the Central Limit Theorem (CLT) I missed something more basic, things like some graphs displaying how the sample size impacts the results told by that theorem, how it is related with concepts like election polls and some code over it.
There are lots of nice articles explaining what is the CLT, why is it important and the math behind it, here are some good links which I found:Understanding The Central Limit TheoremWhat Can a Small Sample Teach Us About a Big Population? — Part 1Central limit theorem — WikipediaVideo Lesson from Khan Academy Statistic CourseIn this article I focused more in simulating an election poll to have a sample with variable size and use it as material to work with the Central Limit Theorem.
In the end of this article you should learn a bit of:How to generate random samples and simple graphs using Python;What is the Central Limit Theorem;How the number of people on a election poll impacts its results and the confidence over them;How to choose the size of your sample to have the expected margin of error and confidence on your polls or tests.
For the best use of this article you should also be familiarized with the Normal Distribution and Z-testing, since we will use it to analyze our election poll results with the support of the Central Limit Theorem.
Here is a link for the code used on this story on Kaggle.
Simulating an election pollNote: This code is focused more on being demonstrative than on performance, there are other faster methods to simulate this.
Lets say that we will have a presidential election between two candidates, Alice and Bob, and that we want to try to predict who will be the winner.
Suppose that there is a method which returns 1 if a person will vote on Alice and 0 otherwise.
Note that I used a Binomial Distribution Generator to create a Bernoulli random variable:This function should return 0 or 1 according to a Bernoulli Distribution with Mean = 0.
53.
It is like a biased coin simulation.
If we could ask for all people in the world in who they would vote, we could discover the winner and the distribution behind the vote of the entire population.
However, it is usually impossible to do that, so we do election polls, asking for a small group of people their opinion to try to estimate the real population result.
Lets say that for each different day we are able to get the vote option from n people:This method could have an output with a list like this one: [1, 0, 0, 0, 0, 1, 1, 0, 1, 0]Then suppose that, in an interval of 5 days, we interviewed 100 people per day regarding which would be their candidate for the election.
This way, we would have 5 different samples of the same distribution, each one with size 100.
Since it isn’t simple to read a set of 500 elements, I’ll display the mean value of each day instead:In my case I got this output [0.
51, 0.
48, 0.
45, 0.
51, 0.
54], could we say that Alice would win with only those samples?Now imagine that we have infinite money and time, so we could do the same for 5, 50, 500, 1000, 10000 and 100000 days.
Lets see what would happen using an histogram to summarize the results:Result of the previous embedded code sectionFor only 5 days, the histogram isn’t pretty helpful, but as we increase the number of days (samples), we can start to see a bell-shaped distribution.
It’s kinda intuitive that, as we increase the number of days, the histogram should start to become more and more bell-shaped with mean closer to the population’s real mean since most of days should have a sample mean similar from its original distribution.
However, as you can see, we don’t need an infinite amount of days to guess who will win in this case.
The “intuition” that I mentioned previously can be explained by the Central Limit Theorem (CLT).
According to the Wikipedia:The central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a “bell curve”) even if the original variables themselves are not normally distributed.
The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.
There are lots of articles explaining the mathematical background of this theorem and its consequences, but now I would like to focus on the practical use of this theorem for our election poll example.
For us the important consequence of this theorem is that the if you get multiple samples from any probability distribution, the set of means from those samples tend towards a normal distribution.
This way, we have a mathematical background to assume that any sampling result follows a normal distribution.
Now you should be asking “Then what?”.
For any given normal distribution, if we know its mean and variance, we can find an interval of values of size 2*Y which a random variable may assume with C% of chance.
For example, going back to our elections poll, we could find that Alice would have X% of votes, with a margin of error of Y% and C% of confidence in our results.
So, if I execute the election poll 100 times, at C%*100 times, Alice would win with X% of votes plus a margin of error of Y%.
Lets apply that concepts using an interesting formula derived from the Central Limit Theorem to find those values.
It let us estimate the standard deviation (σ) from the real population mean using the mean value of our sample (s bar) and the size of the sample(n):The estimated standard deviation σ is proportional to the error of our estimation when compared with the real population mean.
This way, the ideal scenario would be where σ is Zero.
One way of achieve that is increasing the sample size.
In the following histograms I’ll simulate multiple polls, keeping constant the number of samples (100) but increasing the sample size n, differently of what I did on the previous histogram:Result of the previous embedded code sectionNote how the x-axis interval becomes smaller and the bars get closer to the mean as we increase the sample size, showing that we are getting more precise results.
In real life we aren’t able to get a sample with that size but, thanks to the Central Theorem Limit, we can calculate the margin of error for a sample of any size n, check its significance and increase the n value if the margin isn’t sufficient to us.
To that, we need to work with our data as we do with the normal distribution.
We use the estimated standard deviation to apply a z-test, finding the interval of values that satisfies our desired margin of error and the desired confidence.
Going back to day 1, from my sample of 100 people, what is the margin of error which I can say which 95% of confidence that Alice is going to win?After getting those values, we can just consult a standard normal table to see which amount of standard deviations satisfies the desired confidence.
Based on this rule we know that an interval of two standard deviations should cover 95% of all cases.
z-test based margin of error test steps:Get the sample;Calculate the sample mean and sample standard deviation;Use the Central Limit Theorem to find its estimated standard deviation from the population’s real mean;Use a standard normal table or this rule to know how much standard deviations our interval should contain to satisfy our desired confidence;Since we want 95% of confidence, we will need 2 standard deviations.
Here is a method to execute the z-test over our election poll method:Method to get the mean with some margin of error from our election poll simulation.
For example, I got this output when I executed it in my machine: {‘error’: 0.
09981983770774223, ‘mean’: 0.
53}If you get the error value from the last code execution when n =100, you will notice that it is pretty big (almost 10%), so this wouldn’t be too helpful to us because we would have an interval between 43% and 63%.
If you remember the formula to estimate the standard deviation from the population’s real mean, you’ll notice that as we increase n, we reduce the standard deviation.
So, this explains how making a poll with more and more people give us a more precise result.
Lets analyze the impact of increasing the sample size in our election poll scenario:Result of the previous embedded code sectionWith sample size of 10 we have almost 30% of margin of error (ergh!), then with 100 people we have 10% and we start to get nice values with just 1000 people (3%).
The sample mean also becomes stable around 0.
53 when the sample size is greater or equal than 1000.
This is a nice result since we can also use some pre-calculated table as this one to discover which sample size we should use to get the desired margin of error and notice that those values are almost the same that we got in the previous graph.
Finally we could say that if we had made a poll with 10000 people, with 95% of confidence, Alice would win with 53% of votes, with a margin of error of +-1%.
Lets use our method to check this result:This method call returned: {‘error’: 0.
009978328316907597, ‘mean’: 0.
5329} which is what I said on the previous paragraph, yay!What after the CLT?Now you learned a bit more how to measure your polls, control the sample size and discuss its results.
An interesting application of this knowledge is to do Null hypothesis tests (the source of that famous p-value which you find in lots of academic studies and cia).
Check this video from the Khan Academy about how to apply the Null hypothesis test over a question to check if a new drug is really effective or not.
Also please let me know what you thought about this story, please send a commentary if there is some error on the text or click on the Clap button if you enjoyed the reading.
See you!.