# Chi Square – Explanation & Examples

The definition of the chi-square test is:

“The chi-square test compares two variables in a contingency table to see if they are related.”

In this topic, we will discuss the chi-square test from the following aspects:

1. What is the chi-square test?
2. Hypothesis testing using the chi-square test.
3. Steps of hypothesis testing performed by the chi-square test.
4. How to calculate the chi-square test?
5. Practice questions.

## 1. What is the chi-square test?

The chi-square test of independence, also called χ^2 test, is used to analyze the contingency table formed by two categorical variables.

The chi-square test evaluates whether there is a significant association between the categories of the two variables.

An R × C contingency table is a table with R rows and C columns. It displays the relationship between two variables, where the variable in the rows has R categories and the variable in the columns has C categories.

### – Example of 2 X 2 contingency table

A study in 1994 examined 491 dogs that had developed cancer and 945 dogs as a control group (without cancer) to determine whether there is an increased risk of cancer in dogs that are exposed to the herbicide 2,4-Dichlorophenoxyacetic acid (2,4-D).

Reference:

Hayes HM, Tarone RE, Cantor KP, Jessen CR, McCurnin DM, and Richardson RC. 1991. Case-Control Study of Canine Malignant Lymphoma: Positive Association With Dog Owner’s Use of 2, 4- Dichlorophenoxyacetic Acid Herbicides. Journal of the National Cancer Institute 83(17):1226-1231.

The results of this study are shown in the following table.

 cancer no cancer Sum 2,4-D 191 304 495 no 2,4-D 300 641 941 Sum 491 945 1436

We see that:

• The 2,4-D exposure categories are indicated along the rows and the cancer status categories are indicated along with the columns.
• The data are arranged in the form of a 2 × 2 contingency table because the 2,4-D exposure has 2 categories and the cancer status has 2 categories also.
• The “cancer” and “no cancer” columns for dogs that developed and did not develop cancer respectively.
• The “2,4-D” and “no 2,4-D” rows for dogs that were exposed and were not exposed to 2,4-D respectively.
• 191 dogs were exposed to 2,4-D and developed cancer.
• 304 dogs were exposed to 2,4-D and did not develop cancer.
• 300 dogs were not exposed to 2,4-D and developed cancer.
• 641 dogs were not exposed to 2,4-D and did not develop cancer.

We want to test for the relationship between exposure to 2,4-D and developing cancer. We use the χ^2 test to test if this relationship truly exists.

### – Example of 2 X 5 contingency table

Survey responses for 20,000 responses to the Behavioral Risk Factor Surveillance System.

Source Office of Surveillance, Epidemiology, and Laboratory Services Behavioral Risk Factor Surveillance System, BRFSS 2010 Survey Data.

The results of this study are shown in the following table.

 Excellent Fair Good Poor Very good Sum No 459 385 854 99 727 2524 Yes 4198 1634 4821 578 6245 17476 Sum 4657 2019 5675 677 6972 20000

We see that:

• The health coverage categories are indicated along the rows and the health status categories are indicated along with the columns.
• The data are arranged in the form of a 2 × 5 contingency table because the health coverage has 2 categories and the health status has 5 categories.
• The “Excellent”, “Fair”, “Good”, “Poor”, and “Very good” columns are for the person’s health status.
• The “No” and “Yes” rows are for whether the person had health coverage or not.
• 459 persons had excellent health status and were not having health coverage.
• 4198 persons had excellent health status and were having health coverage.
• 385 persons had fair health status and were not having health coverage.
• 1634 persons had fair health status and were having health coverage.
• 854 persons had good health status and were not having health coverage.
• 4821 persons had good health status and were having health coverage.
• 99 persons had poor health status and were not having health coverage.
• 578 persons had poor health status and were having health coverage.
• 727 persons had very good health status and were not having health coverage.
• 6245 persons had very good health status and were having health coverage.

We want to test for the relationship between health status and health coverage. We use the χ^2 test to test if this relationship truly exists.

### – Example of 4 X 3 contingency table

A 2010 Pew Research poll asked 1,306 Americans, “From what you’ve read and heard, is there solid evidence that the average temperature on earth has been getting warmer over the past few decades, or not?”

Source: Pew Research Center, Majority of Republicans No Longer See Evidence of Global Warming, data collected on October 27, 2010.

The results of this study are shown in the following table.

 Don’t know / refuse to answer Earth is warming Not warming Sum Conservative Republican 45 248 450 743 Liberal Democrat 23 405 23 451 Mod/Cons Democrat 45 563 158 766 Mod/Lib Republican 23 135 135 293 Sum 136 1351 766 2253

We see that:

• The party categories are indicated along the rows and the response categories are indicated along with the columns.
• The data are arranged in the form of a 4 × 3 contingency table because the party has 4 categories and the response has 3 categories.
• The “Don’t know / refuse to answer”, “Earth is warming”, and “Not warming” columns are the response categories.
• The “Conservative Republican”, “Liberal Democrat”, “Mod/Cons Democrat”, and “Mod/Lib Republican” rows are the party categories.
• 45 persons responded “Don’t know / refuse to answer” and were having “Conservative Republican” party, compared to 23 persons having “Liberal Democrat” party, 45 persons having “Mod/Cons Democrat” party, and 23 persons having “Mod/Lib Republican” party.
• 248 persons responded “Earth is warming” and were having “Conservative Republican” party, compared to 405 persons having “Liberal Democrat” party,
• 563 persons having “Mod/Cons Democrat” party, and 135 persons having “Mod/Lib Republican” party.
• 450 persons responded “Not warming” and were having “Conservative Republican” party, compared to 23 persons having “Liberal Democrat” party,
• 158 persons having “Mod/Cons Democrat” party, and 135 persons having “Mod/Lib Republican” party.

We want to test for the relation between the party categories and the response categories. We use the χ^2 test to test if this relationship truly exists.

## 2. Hypothesis testing using the chi-square test

Where you start with two exclusive possibilities for the unknown truth. Then, use the sample to choose between these two possibilities for the truth. The two possibilities are the Null hypothesis, Ho, and the Alternative hypothesis, Ha.

• The null hypothesis, Ho: There is no difference between the two populations or the two categorical variables, and the difference = zero.
• The alternative hypothesis, Ha: There is a difference between the two populations so the difference ≠ zero.

Hypothesis testing is denoted as:

• Ho: p_1=p_2 or p_1-p_2=0. The proportions of one variable are the same for different values of the other variable.

In testing the relation between exposure to 2,4-D and developing cancer, this means that the proportion of developing cancer is similar for dogs exposed and not exposed to 2,4-D.

• Ha: p_1≠p_2 or p_1-p_2≠0. In testing the relation between exposure to 2,4-D and developing cancer, this means that the proportion of developing cancer is different for dogs exposed and not exposed to 2,4-D.

Note: Although the hypothesis testing for the chi-square test compares proportions, the chi-square test uses the actual count to test that.

### – Steps of hypothesis testing performed by the chi-square test

• The chi-square test uses the contingency table of data to calculate an expected table.

The expected table contains the theoretical data counts that would be expected when there is no relation between the rows and the columns i.e. the null hypothesis is true, Ho: p_1=p_2.

• The test calculates the discrepancies between each observed and expected value and aggregates them.
• If the null hypothesis is true, the aggregated value, called the χ^2 statistic, has a chi-square distribution. Define the probability of the aggregated value under this chi-square distribution. This is the p-value.

The p-value is the probability of our sample results if the null hypothesis is true.

The null hypothesis means that the proportion of developing cancer is similar for dogs exposed and not exposed to 2,4-D.

Generally in research, the cut-off used is 0.05. This 0.05 is called the rejection level, α level, or significance level.

• Make a decision, accept Ha, or fail to reject Ho.

If p-value < 0.05, so it is a statistically significant result at 0.05 level. Reject the null hypothesis and conclude that our sample data are unlikely under the Ho, null hypothesis, they have a probability of less than 0.05.

If p-value >= 0.05, so it is a statistically non-significant result at 0.05 level and we fail to reject the Null hypothesis.

We say fail to reject the Null hypothesis because if we have a p-value of 0.25. This means that our sample data have a probability of 25% under the null hypothesis which is considered a large percentage. In your opinion, you may consider it small and accept Ha.

Note: No expected value in the expected table is less than 5 (sometimes known as “the rule of five”). If any expected value is less than 5, the chi-square test is not applicable and other tests are applied (Fisher Exact test).

## 3. How to calculate the chi-square test?

### – Example of 2 X 2 contingency table

A study in 1994 examined 491 dogs that had developed cancer and 945 dogs as a control group to determine whether there is an increased risk of cancer in dogs that are exposed to the herbicide 2,4-Dichlorophenoxyacetic acid (2,4-D).

The following 2 X 2 contingency table is obtained:

 cancer no cancer Sum 2,4-D 191 304 495 no 2,4-D 300 641 941 Sum 491 945 1436

To see the different proportions of cancer development per 2,4-D exposure, we can use the following table:

 cancer no cancer 2,4-D 0.39 0.61 no 2,4-D 0.32 0.68

The sum of every row is 1.00 or 100%.

We see that 0.39 or 39% of dogs exposed to 2,4-D developed cancer compared to 0.32 or 32% of dogs not exposed to 2,4-D.

We can plot these proportions in the following bar plot.

To test for the relationship between exposure to 2,4-D and developing cancer, we follow these steps:

• Use the 2 X 2 table to calculate the expected count of each cell.

The expected count in the (i, j) cell =
(the total count in the ith row X the total count in the jth column)/ the total count in the table.

The expected counts indicate no association between the rows and columns. In other words, there is no association between exposure to 2,4-D and cancer development.

The sum of the expected values across any row or column must equal the corresponding row or column total.

The expected count for dogs exposed to 2,4-D and developed cancer = (total in 1st row X total in 1st column)/table total = (495X491)/1436 = 169.2514.

The expected count for dogs not exposed to 2,4-D and developed cancer = (total in 2nd row X total in 1st column)/table total = (941X491)/1436 = 321.7486.

The expected count for dogs exposed to 2,4-D and did not develop cancer = (total in 1st row X total in 2nd column)/table total = (495X945)/1436 = 325.7486.

The expected count for dogs not exposed to 2,4-D and did not develop cancer = (total in 2nd row X total in 2nd column)/table total = (941X945)/1436 = 619.2514.

The following table will be produced:

 cancer no cancer Sum 2,4-D 169.2514 325.7486 495 no 2,4-D 321.7486 619.2514 941 Sum 491.0000 945.0000 1436

We see that all expected values are larger than 5 so the chi-square test can be used.

To see the different proportions of cancer development per 2,4-D exposure in the expected table:

 cancer no cancer 2,4-D 0.34 0.66 no 2,4-D 0.34 0.66

The sum of every row is 1.00 or 100%.

We see that 0.34 or 34% of dogs exposed to 2,4-D developed cancer and also 0.34 or 34% of dogs not exposed to 2,4-D.

• Make a table with 2 columns for different cells and their observed counts.

We have 4 cells in this 2 X 2 table:

• A cell for dogs exposed to 2,4-D and had cancer.
• A cell for dogs exposed to 2,4-D and had not cancer.
• A cell for dogs not exposed to 2,4-D and had cancer.
• A cell for dogs not exposed to 2,4-D and had not cancer.
 category observed 2,4-D, cancer 191 no 2,4-D,cancer 300 2,4-D,no cancer 304 no 2,4-D, no cancer 641
• Add a column for the expected count of each cell.
 category observed expected 2,4-D,cancer 191 169.2514 no 2,4-D,cancer 300 321.7486 2,4-D,no cancer 304 325.7486 no 2,4-D,no cancer 641 619.2514
• Subtract the expected value from the Observed value and place the result in the “obs-exp” column.
 category observed expected obs-exp 2,4-D,cancer 191 169.2514 21.75 no 2,4-D,cancer 300 321.7486 -21.75 2,4-D,no cancer 304 325.7486 -21.75 no 2,4-D,no cancer 641 619.2514 21.75
• Square the differences from Step 4 and place the result in the “(obs-exp)^2” column.
 category observed expected obs-exp (obs-exp)^2 2,4-D, cancer 191 169.2514 21.75 473.06 no 2,4-D,cancer 300 321.7486 -21.75 473.06 2,4-D,no cancer 304 325.7486 -21.75 473.06 no 2,4-D,no cancer 641 619.2514 21.75 473.06
• Divide the squared differences by their respective expected value and place the result in the “(obs-exp)^2/exp” column.
 category observed expected obs-exp (obs-exp)^2 (obs-exp)^2/exp 2,4-D,cancer 191 169.2514 21.75 473.06 2.80 no 2,4-D,cancer 300 321.7486 -21.75 473.06 1.47 2,4-D,no cancer 304 325.7486 -21.75 473.06 1.45 no 2,4-D,no cancer 641 619.2514 21.75 473.06 0.76
• Sum all the values in the last column to get the chi-square statistic:

The χ^2 statistic = 2.80+1.47+1.45+0.76 = 6.48.

The last column is summed to get an overall measure of agreement between the observed and expected tables.

• If the null hypothesis is true, the χ^2 statistic has a chi-square distribution with (R − 1) × (C − 1) degrees of freedom or df.

Define the probability (or the p-value) of the χ^2 statistic under this chi-square distribution.

The p-value is given by the area to the right of the χ^2 statistic under this chi-square distribution.

In our 2X2 contingency table, the df = (2-1)X(2-1) = 1.

The following is the chi-square distribution with 1 df.

The total area under the curve is 1.00 or 100%.

In the first plot, we see that when the χ^2 value = 3.84, the area to the right or the p-value = 0.05.

In our contingency table, the χ^2 value = 6.84 (plotted as a vertical line in the second plot), so the p-value is smaller than 0.05.

• Make a decision, accept Ha, or fail to reject Ho.

The p-value < 0.05, so it is a statistically significant result. We reject the null hypothesis and conclude that our sample data are unlikely under the Ho, null hypothesis, they have a probability of less than 0.05.

We conclude that there is a significant relationship between exposure to 2,4-D and cancer development in dogs.

### – Example of 2 X 5 contingency table

The Survey responses for 20,000 responses to the Behavioral Risk Factor Surveillance System.

The results of this study are shown in the following table.

 Excellent Fair Good Poor Very good Sum No 459 385 854 99 727 2524 Yes 4198 1634 4821 578 6245 17476 Sum 4657 2019 5675 677 6972 20000

To see the different proportions of health status per health coverage, we can use the following table:

 Excellent Fair Good Poor Very good No 0.18 0.15 0.34 0.04 0.29 Yes 0.24 0.09 0.28 0.03 0.36

The sum of every row is 1.00 or 100%.

We see that 0.18 or 18% of persons who do not have health coverage had excellent health status compared to 0.24 or 24% of persons who do have health coverage.

We see that 0.15 or 15% of persons who do not have health coverage had fair health status compared to 0.09 or 9% of persons who do have health coverage, and so on.

We can plot these proportions in the following bar plot.

To test for the relationship between health coverage and health status, we follow these steps:

• Use the 2 X 5 table to calculate the expected count of each cell.

The expected counts indicate no association between the rows and columns. In other words, there is no association between health coverage and health status.

The expected count in the (i, j) cell =
(the total count in the ith row X the total count in the jth column)/ the total count in the table.

For example, the expected count for persons who do not have a health coverage and with excellent health status = (row total X column total) / table total = (2524 X 4657)/20000 = 587.7134.

The following table will be produced:

 Excellent Fair Good Poor Very good Sum No 587.7134 254.7978 716.185 85.4374 879.8664 2524 Yes 4069.2866 1764.2022 4958.815 591.5626 6092.1336 17476 Sum 4657.0000 2019.0000 5675.000 677.0000 6972.0000 20000

We see that all expected values are larger than 5 so the chi-square test can be used.

To see the different proportions of health statuses per health coverage in the expected table:

 Excellent Fair Good Poor Very good No 0.23 0.1 0.28 0.03 0.35 Yes 0.23 0.1 0.28 0.03 0.35

We see that 0.23 or 23% of persons who do or do not have health coverage had excellent health status.

All other proportions of different health statuses across the health coverage are equal.

• Make a table with 2 columns for different cells and their observed counts.

We have 10 cells in this 2 X 5 table which are shown in the following table:

 category observed No, Excellent 459 Yes, Excellent 4198 No, Fair 385 Yes, Fair 1634 No, Good 854 Yes, Good 4821 No, Poor 99 Yes, Poor 578 No, Very good 727 Yes, Very good 6245

For example, the “No, Excellent” category means persons without health coverage and excellent health status.

• Add a column for the expected count of each cell.
 category observed expected No, Excellent 459 587.7134 Yes, Excellent 4198 4069.2866 No, Fair 385 254.7978 Yes, Fair 1634 1764.2022 No, Good 854 716.1850 Yes, Good 4821 4958.8150 No,Poor 99 85.4374 Yes, Poor 578 591.5626 No, Very good 727 879.8664 Yes, Very good 6245 6092.1336
• Subtract the expected value from the Observed value and place the result in the “obs-exp” column.
 category observed expected obs-exp No,Excellent 459 587.7134 -128.71 Yes,Excellent 4198 4069.2866 128.71 No,Fair 385 254.7978 130.20 Yes,Fair 1634 1764.2022 -130.20 No,Good 854 716.1850 137.82 Yes,Good 4821 4958.8150 -137.81 No,Poor 99 85.4374 13.56 Yes,Poor 578 591.5626 -13.56 No,Very good 727 879.8664 -152.87 Yes, Very good 6245 6092.1336 152.87
• Square the differences from Step 4 and place the result in the “(obs-exp)^2” column.
 category observed expected obs-exp (obs-exp)^2 No,Excellent 459 587.7134 -128.71 16566.26 Yes,Excellent 4198 4069.2866 128.71 16566.26 No,Fair 385 254.7978 130.20 16952.04 Yes,Fair 1634 1764.2022 -130.20 16952.04 No,Good 854 716.1850 137.82 18994.35 Yes,Good 4821 4958.8150 -137.81 18991.60 No,Poor 99 85.4374 13.56 183.87 Yes,Poor 578 591.5626 -13.56 183.87 No,Very good 727 879.8664 -152.87 23369.24 Yes,Very good 6245 6092.1336 152.87 23369.24
• Divide the squared differences by their respective expected value and place the result in the “(obs-exp)^2/exp” column.
 category observed expected obs-exp (obs-exp)^2 (obs-exp)^2/exp No,Excellent 459 587.7134 -128.71 16566.26 28.19 Yes,Excellent 4198 4069.2866 128.71 16566.26 4.07 No,Fair 385 254.7978 130.20 16952.04 66.53 Yes,Fair 1634 1764.2022 -130.20 16952.04 9.61 No,Good 854 716.1850 137.82 18994.35 26.52 Yes,Good 4821 4958.8150 -137.81 18991.60 3.83 No,Poor 99 85.4374 13.56 183.87 2.15 Yes,Poor 578 591.5626 -13.56 183.87 0.31 No,Very good 727 879.8664 -152.87 23369.24 26.56 Yes,Very good 6245 6092.1336 152.87 23369.24 3.84
• Sum all the values in the last column to get the chi-square statistic:

The χ^2 statistic = 28.19+ 4.07+ 66.53+ 9.61+ 26.52+ 3.83+ 2.15+ 0.31+ 26.56+ 3.84 = 171.61.

• If the null hypothesis is true, the χ^2 statistic has a chi-square distribution with (R − 1) × (C − 1) degrees of freedom.

In our 2X5 contingency table, the df = (2-1)X(5-1) = 4.

The following is the chi-square distribution with 4 df.

In the first plot, we see that when the χ^2 value = 9.49, the area to the right or the p-value = 0.05.

In our contingency table, the χ^2 value = 171.61 (plotted as a vertical line in the second plot), so the p-value is very much smaller than 0.05.

• Make a decision, accept Ha, or fail to reject Ho.

The p-value < 0.05, so it is a statistically significant result. We reject the null hypothesis and conclude that our sample data are unlikely under the Ho or the null hypothesis.

We conclude that there is a significant relationship between health statuses and health coverage in the persons surveyed.

### – Example of 4 X 3 contingency table

A 2010 Pew Research poll asked 1,306 Americans, “From what you’ve read and heard, is there solid evidence that the average temperature on earth has been getting warmer over the past few decades, or not?”.

The results of this study are shown in the following table.

 Don’t know / refuse to answer Earth is warming Not warming Sum Conservative Republican 45 248 450 743 Liberal Democrat 23 405 23 451 Mod/Cons Democrat 45 563 158 766 Mod/Lib Republican 23 135 135 293 Sum 136 1351 766 2253

To see the different proportions of responses per different parties, we can use the following table:

 Don’t know / refuse to answer Earth is warming Not warming Conservative Republican 0.06 0.33 0.61 Liberal Democrat 0.05 0.90 0.05 Mod/Cons Democrat 0.06 0.73 0.21 Mod/Lib Republican 0.08 0.46 0.46

The sum of every row is 1.00 or 100%.

We see that:

• 0.33 or 33% of “Conservative Republican” persons responded that Earth is warming, compared to 0.90 or 90% of “Liberal Democrat” persons, 0.73 or 73% of “Mod/Cons Democrat” persons, and 0.46 or 46% of “Mod/Lib Republican” persons.
• 0.61 or 61% of “Conservative Republican” persons responded that Earth is not warming, compared to only 0.05 or 5% of “Liberal Democrat” persons, 0.21 or 21% of “Mod/Cons Democrat” persons, and 0.46 or 46% of “Mod/Lib Republican” persons.

We can plot these proportions in the following bar plot.

To test for the relationship between parties and responses, we follow these steps:

• Use the 4 X 3 table to calculate the expected count of each cell.

The expected counts indicate no association between the rows and columns. In other words, there is no association between the parties and the responses.

The expected count in the (i, j) cell =
(the total count in the ith row X the total count in the jth column)/ the total count in the table.

For example, the expected count for “Conservative Republican” persons who responded that Earth is warming = (row total X column total) / table total = (743 X 1351)/2253 = 445.5362.

The following table will be produced:

 Don’t know / refuse to answer Earth is warming Not warming Sum Conservative Republican 44.85042 445.5362 252.6134 743 Liberal Democrat 27.22415 270.4399 153.3360 451 Mod/Cons Democrat 46.23879 459.3280 260.4332 766 Mod/Lib Republican 17.68664 175.6960 99.6174 293 Sum 136.00000 1351.0000 766.0000 2253

We see that all expected values are larger than 5 so the chi-square test can be used.

To see the different proportions of responses per parties in the expected table:

 Don’t know / refuse to answer Earth is warming Not warming Conservative Republican 0.06 0.6 0.34 Liberal Democrat 0.06 0.6 0.34 Mod/Cons Democrat 0.06 0.6 0.34 Mod/Lib Republican 0.06 0.6 0.34

All proportions of different responses across the different parties are the same.

• Make a table with 2 columns for different cells and their observed counts.

We have 12 cells in this 4 X 3 table which are shown in the following table:

 category observed Conservative Republican,Don’t know / refuse to answer 45 Liberal Democrat,Don’t know / refuse to answer 23 Mod/Cons Democrat,Don’t know / refuse to answer 45 Mod/Lib Republican,Don’t know / refuse to answer 23 Conservative Republican,Earth is warming 248 Liberal Democrat,Earth is warming 405 Mod/Cons Democrat,Earth is warming 563 Mod/Lib Republican,Earth is warming 135 Conservative Republican,Not warming 450 Liberal Democrat,Not warming 23 Mod/Cons Democrat,Not warming 158 Mod/Lib Republican,Not warming 135

For example, the “Conservative Republican,Earth is warming” category means “Conservative Republican” persons who responded that Earth is warming.

• Add a column for the expected count of each cell.
 category observed expected Conservative Republican,Don’t know / refuse to answer 45 44.85042 Liberal Democrat,Don’t know / refuse to answer 23 27.22415 Mod/Cons Democrat,Don’t know / refuse to answer 45 46.23879 Mod/Lib Republican,Don’t know / refuse to answer 23 17.68664 Conservative Republican,Earth is warming 248 445.53617 Liberal Democrat,Earth is warming 405 270.43986 Mod/Cons Democrat,Earth is warming 563 459.32801 Mod/Lib Republican,Earth is warming 135 175.69596 Conservative Republican,Not warming 450 252.61340 Liberal Democrat,Not warming 23 153.33600 Mod/Cons Democrat,Not warming 158 260.43320 Mod/Lib Republican,Not warming 135 99.61740
• Subtract the expected value from the Observed value and place the result in the “obs-exp” column.
 category observed expected obs-exp Conservative Republican,Don’t know / refuse to answer 45 44.85042 0.15 Liberal Democrat,Don’t know / refuse to answer 23 27.22415 -4.22 Mod/Cons Democrat,Don’t know / refuse to answer 45 46.23879 -1.24 Mod/Lib Republican,Don’t know / refuse to answer 23 17.68664 5.31 Conservative Republican,Earth is warming 248 445.53617 -197.54 Liberal Democrat,Earth is warming 405 270.43986 134.56 Mod/Cons Democrat,Earth is warming 563 459.32801 103.67 Mod/Lib Republican,Earth is warming 135 175.69596 -40.70 Conservative Republican,Not warming 450 252.61340 197.39 Liberal Democrat,Not warming 23 153.33600 -130.34 Mod/Cons Democrat,Not warming 158 260.43320 -102.43 Mod/Lib Republican,Not warming 135 99.61740 35.38
• Square the differences from Step 4 and place the result in the “(obs-exp)^2” column.
 category observed expected obs-exp (obs-exp)^2 Conservative Republican,Don’t know / refuse to answer 45 44.85042 0.15 0.02 Liberal Democrat,Don’t know / refuse to answer 23 27.22415 -4.22 17.81 Mod/Cons Democrat,Don’t know / refuse to answer 45 46.23879 -1.24 1.54 Mod/Lib Republican,Don’t know / refuse to answer 23 17.68664 5.31 28.20 Conservative Republican,Earth is warming 248 445.53617 -197.54 39022.05 Liberal Democrat,Earth is warming 405 270.43986 134.56 18106.39 Mod/Cons Democrat,Earth is warming 563 459.32801 103.67 10747.47 Mod/Lib Republican,Earth is warming 135 175.69596 -40.70 1656.49 Conservative Republican,Not warming 450 252.61340 197.39 38962.81 Liberal Democrat,Not warming 23 153.33600 -130.34 16988.52 Mod/Cons Democrat,Not warming 158 260.43320 -102.43 10491.90 Mod/Lib Republican,Not warming 135 99.61740 35.38 1251.74
• Divide the squared differences by their respective expected value and place the result in the “(obs-exp)^2/exp” column.
 category observed expected obs-exp (obs-exp)^2 (obs-exp)^2/exp Conservative Republican,Don’t know / refuse to answer 45 44.85042 0.15 0.02 0.00 Liberal Democrat,Don’t know / refuse to answer 23 27.22415 -4.22 17.81 0.65 Mod/Cons Democrat,Don’t know / refuse to answer 45 46.23879 -1.24 1.54 0.03 Mod/Lib Republican,Don’t know / refuse to answer 23 17.68664 5.31 28.20 1.59 Conservative Republican,Earth is warming 248 445.53617 -197.54 39022.05 87.58 Liberal Democrat,Earth is warming 405 270.43986 134.56 18106.39 66.95 Mod/Cons Democrat,Earth is warming 563 459.32801 103.67 10747.47 23.40 Mod/Lib Republican,Earth is warming 135 175.69596 -40.70 1656.49 9.43 Conservative Republican,Not warming 450 252.61340 197.39 38962.81 154.24 Liberal Democrat,Not warming 23 153.33600 -130.34 16988.52 110.79 Mod/Cons Democrat,Not warming 158 260.43320 -102.43 10491.90 40.29 Mod/Lib Republican,Not warming 135 99.61740 35.38 1251.74 12.57
• Sum all the values in the last column to get the chi-square statistic:

The χ^2 statistic = 507.52.

• If the null hypothesis is true, the χ^2 statistic has a chi-square distribution with (R − 1) × (C − 1) degrees of freedom.

In our 4X3 contingency table, the df = (4-1)X(3-1) = 6.

The following is the chi-square distribution with 6 df.

In the first plot, we see that when the χ^2 value = 12.59, the area to the right or the p-value = 0.05.

In our contingency table, the χ^2 value = 507.52 (plotted as a vertical line in the second plot), so the p-value is very much smaller than 0.05.

• Make a decision, accept Ha, or fail to reject Ho.

The p-value < 0.05, so it is a statistically significant result. We reject the null hypothesis and conclude that our sample data are unlikely under the null hypothesis.

We conclude that there is a significant relationship between the different parties and the response type in the persons surveyed.

## 5. Practice questions

1. The Data from the 2010 General Social Survey shows the following table.

 LEGAL NOT LEGAL Sum BACHELOR 119 112 231 GRADUATE 73 63 136 HIGH SCHOOL 304 307 611 JUNIOR COLLEGE 42 44 86 LT HIGH SCHOOL 65 130 195 Sum 603 656 1259

The rows are for the educational degree and the columns for answering the question “Do you think the use of marijuana should be made legal, or not?”.

Is there a relationship between educational degree and the answer type?

2. A sample of categorical variables from the General Social survey showed the following table.

 Other Black White Sum $25000 or more 621 886 5856 7363$20000 – 24999 112 220 951 1283 $15000 – 19999 134 180 734 1048$10000 – 14999 126 210 832 1168 $8000 to 9999 41 56 243 340$7000 to 7999 24 27 137 188 $6000 to 6999 26 35 154 215$5000 to 5999 27 40 160 227 $4000 to 4999 34 38 154 226$3000 to 3999 35 59 182 276 $1000 to 2999 47 71 277 395 Lt$1000 36 51 199 286 Sum 1263 1873 9879 13015

The rows are for the reported income and the columns for race categories.

Is there a relationship between race and the reported income?

3. A study from the 1970s about whether gender influences hiring recommendations showed the following table.

 not promoted Sum female 10 14 24 male 3 21 24 Sum 13 35 48

The rows are for the gender and the columns for the promotions.

All expected values are larger than 5 so the chi-square test can be used.

The χ^2 statistic = 5.1692.

The following is the chi-square distribution with 1 df.

Is that a significant result?

4. The demographic information on every member of a 1000 random sample of the US armed forces showed the following table.

 female male Sum air force 40 175 215 army 54 351 405 marine corps 6 148 154 navy 34 192 226 Sum 134 866 1000

The rows are for the branch of the armed forces: air force, army, marine corps, or navy, and the columns for the gender.

All expected values are larger than 5 so the chi-square test can be used.

The χ^2 statistic = 17.534.

The following is the chi-square distribution with 3 df.

Is that a significant result?

5. The demographic information on every member of a random 2000 sample of the US armed forces showed the following table.

 asian black white Sum air force 11 91 364 466 army 38 176 586 800 marine corps 6 39 250 295 navy 30 99 310 439 Sum 85 405 1510 2000

The rows are for the branch of the armed forces: air force, army, marine corps, or navy, and the columns for the race.

All expected values are larger than 5 so the chi-square test can be used.

The χ^2 statistic = 30.051.

The following is the chi-square distribution with 6 df.

Is that a significant result?

1. We follow the same steps above to calculate the expected table.

 LEGAL NOT LEGAL Sum BACHELOR 110.63781 120.36219 231 GRADUATE 65.13741 70.86259 136 HIGH SCHOOL 292.63940 318.36060 611 JUNIOR COLLEGE 41.18983 44.81017 86 LT HIGH SCHOOL 93.39555 101.60445 195 Sum 603.00000 656.00000 1259

We see that all expected values are larger than 5 so the chi-square test can be used.

Then, we follow the above steps to get the chi-square statistic.

The χ^2 statistic = 20.48.

In our 5X2 contingency table, the df = (5-1)X(2-1) = 4.

The following is the chi-square distribution with 4 df.

In the first plot, we see that when the χ^2 value = 9.49, the area to the right or the p-value = 0.05.

In our contingency table, the χ^2 value = 20.48 (plotted as a vertical line in the second plot), so the p-value is very much smaller than 0.05.

The p-value < 0.05, so it is a statistically significant result.

We reject the null hypothesis and conclude that there is a significant relationship between the different educational degrees and the response type in the persons surveyed.

This means that the proportions of response type are different across different educational degrees.

2. We follow the same steps above to calculate the expected table.

 Other Black White Sum $25000 or more 714.51932 1059.61575 5588.8649 7363$20000 – 24999 124.50473 184.63765 973.8576 1283 $15000 – 19999 101.69988 150.81859 795.4815 1048$10000 – 14999 113.34491 168.08790 886.5672 1168 $8000 to 9999 32.99424 48.92970 258.0761 340$7000 to 7999 18.24387 27.05524 142.7009 188 $6000 to 6999 20.86400 30.94084 163.1952 215$5000 to 5999 22.02851 32.66777 172.3037 227 $4000 to 4999 21.93146 32.52386 171.5447 226$3000 to 3999 26.78356 39.71940 209.4970 276 $1000 to 2999 38.33154 56.84479 299.8237 395 Lt$1000 27.75398 41.15851 217.0875 286 Sum 1263.00000 1873.00000 9879.0000 13015

We see that all expected values are larger than 5 so the chi-square test can be used.

Then, we follow the above steps to get the chi-square statistic.

The χ^2 statistic = 148.13.

In our 12X3 contingency table, the df = (12-1)X(3-1) = 22.

The following is the chi-square distribution with 22 df.

In the first plot, we see that when the χ^2 value = 33.92, the area to the right or the p-value = 0.05.

In our contingency table, the χ^2 value = 148.13 (plotted as a vertical line in the second plot), so the p-value is very much smaller than 0.05.

The p-value < 0.05, so it is a statistically significant result.

We reject the null hypothesis and conclude that there is a significant relationship between race and the reported income in the persons surveyed.

This means that the proportions of incomes are different across different races.

3. We see, from the plot, that when the χ^2 value = 3.84, the area to the right or the p-value = 0.05.

In our contingency table, the χ^2 value = 5.1692, so the p-vale is smaller than 0.05.
The p-value < 0.05, so it is a statistically significant result.

We reject the null hypothesis and conclude that there is a significant relationship between gender and being promoted.

This means that the proportions of promotions are different across the 2 sexes.

4. We see, from the plot, that when the χ^2 value = 7.81, the area to the right or the p-value = 0.05.

In our contingency table, the χ^2 value = 17.534, so the p-vale is smaller than 0.05.
The p-value < 0.05, so it is a statistically significant result.

We reject the null hypothesis and conclude that there is a significant relationship between gender and the branch of the armed forces.

This means that the proportions of branches from the US armed forces are different across the 2 sexes.

5. We see, from the plot, that when the χ^2 value = 12.59, the area to the right or the p-value = 0.05.

In our contingency table, the χ^2 value = 30.051, so the p-value is smaller than 0.05.
The p-value < 0.05, so it is a statistically significant result.

We reject the null hypothesis and conclude that there is a significant relationship between race and the branch of the armed forces.

This means that the proportions of branches are different across the different races.