# What is the Chi-Square Test and How Does it Work? An Intuitive Explanation with R Code

Step 1: First, import the data Step 2: Validate it for correctness in R: View the code on Gist.

Output: #Count of Rows and columns  1470 2 #View top 10 rows of the dataset age.

intervals Experience.

intervals 1 41 – 50 6 – 10 Years 2 41 – 50 6 – 10 Years 3 31 – 40 6 – 10 Years 4 31 – 40 6 – 10 Years 5 18 – 30 6 – 10 Years 6 31 – 40 6 – 10 Years 7 51 – 60 11 – 20 Years 8 18 – 30 Upto 5 Years 9 31 – 40 6 – 10 Years 10 31 – 40 11 – 20 Years Step 3: Create a proportion table for expected frequencies: View the code on Gist.

Output: 11 – 20 Years 21 – 40 Years 6 – 10 Years Upto 5 Years 0.

2312925 0.

1408163 0.

4129252 0.

2149660 Step – 4: Calculate the chi-square value: View the code on Gist.

Output: Chi-square test for given probabilities data: table(data\$Experience.

intervals) X-squared = 14.

762, df = 3, p-value = 0.

002032 The p-value here is less than 0.

05.

Therefore, we will reject our null hypothesis.

Hence, the distribution of experience of the employees of different departments differs from what the organization states.

Chi-Square Test for Association/Independence The second type of chi-square test is the Pearson’s chi-square test of association.

This test is used when we have categorical data for two independent variables and we want to see if there is any relationship between the variables.

Let’s take another example to understand this.

A teacher wants to know the answer to whether the outcome of a mathematics test is related to the gender of the person taking the test.

Or in other words, she wants to know if males show a different pattern of pass/fail rates than females.

So, here are two categorical variables: Gender (Male and Female) and mathematics test outcome (Pass or Fail).

Let us now look at the contingency table: Boys Girls Pass 17 20 Fail 8 5   By looking at the above contingency table, we can see that the girls have a comparatively higher pass rate than boys.

However, to test whether this observed difference is significant or not, we will carry out the chi-square test.

The steps to calculate the chi-square value are as follows: Step 1: Calculate the row and column total of the above contingency table: Boys Girls Total Pass 17 20 37 Fail 8 5 13 Total 25 25 50 Step 2: Calculate the expected frequency for each individual cell by multiplying row sum by column sum and dividing by total number: Expected Frequency = (Row Total x Column Total)/Grand Total For the first cell, the expected frequency would be (37*25)/50 = 18.

5.

Now, write them below the observed frequencies in brackets: Boys Girls Total Pass 17 (18.

5) 20 (18.

5) 37 Fail 8 (6.

5) 5 (6.

5) 13 Total 25 25 50   Step 3: Calculate the value of chi-square using the formula: Calculate the right-hand side part of each cell.

For example, for the first cell, ((17-18.

5)^2)/18.

5 = 0.

1216.

Step 4: Then, add all the values obtained for each cell.

In this case, the values are: 0.

1216+0.

1216+0.

3461+0.

3461 = 0.

9354 Step 5: Calculate the degrees of freedom, i.

e.

(Number of rows-1)*(Number of columns-1) = 1*1 = 1 The next task is to compare it with the critical chi-square value from the table we saw above.

The Chi-Square calculated value is 0.

9354 which is less than the critical value of 3.

84.

So in this case, we fail to reject the null hypothesis.

This means there is no significant association between the two variables, i.

e, boys and girls have a statistically similar pattern of pass/fail rates on their mathematics tests.

Let’s further solidify our understanding by performing the chi-square test in R.

Test for Independence in R Problem statement A Human Resources department of an organization wants to check whether age and experience of the employees are dependent on each other.

For this purpose, a random sample of 1470 employees is collected with their age and experience.

Setting up the hypothesis Null hypothesis: Age and Experience are two independent variables Alternative hypothesis: Age and Experience are two dependent variables Let’s begin!.Step 1: First, import the data Step 2: Validate it for correctness in R: View the code on Gist.

Output: #Count of Rows and columns  1470 2 > #View top 10 rows of the dataset age.

intervals Experience.

intervals 1 41 – 50 6 – 10 Years 2 41 – 50 6 – 10 Years 3 31 – 40 6 – 10 Years 4 31 – 40 6 – 10 Years 5 18 – 30 6 – 10 Years 6 31 – 40 6 – 10 Years 7 51 – 60 11 – 20 Years 8 18 – 30 Upto 5 Years 9 31 – 40 6 – 10 Years 10 31 – 40 11 – 20 Years Step 3: Construct a contingency table and calculate the chi-square value: View the code on Gist.

Output: ct<-table(data\$age.

intervals,data\$Experience.

intervals) > ct 11 – 20 Years 21 – 40 Years 6 – 10 Years Upto 5 Years 18 – 30 22 0 172 192 31 – 40 190 20 308 101 41 – 50 85 112 110 15 51 – 60 43 75 17 8 > chisq.

test(ct) Pearsons Chi-squared test data: ct X-squared = 679.

97, df = 9, p-value < 2.

2e-16 The p-value here is less than 0.

05.

Therefore, we will reject our null hypothesis.

We can conclude that age and experience are two dependent variables, aka as the experience increases, the age also increases (and vice versa).

End Notes In this article, we learned how to analyze the significant difference between data that contains categorical measures in it with the help of chi-square tests.

We enhanced our knowledge on the use of chi-square, assumptions involved in carrying out the test, and how to conduct different types of chi-square tests both manually and in R.

If you are new to statistics, want to cover your basics, and also want to get a start in data science, I recommend taking the Introduction to Data Science course.

It gives you a comprehensive overview of both descriptive and inferential statistics before diving into data science techniques.