10  The Chi-Square Test

In previous sections, we used cross-tabulations to describe the relationship between two categorical variables. But how do we know whether an observed association is statistically significant — i.e., unlikely to be due to chance alone? This is what the chi-square test (χ²) tells us.

An interactive introduction

Before reading further, we recommend exploring this interactive tool which illustrates the logic of the chi-square test visually: https://istats.shinyapps.io/ChiSquaredTest/


10.1 How Does the Chi-Square Test Work?

The chi-square test compares what we observe in the data to what we would expect if there were no association between the two variables (the null hypothesis of independence).

For each cell of the cross-tabulation, the expected count is:

\[E_{ij} = \frac{\text{Row total}_i \times \text{Column total}_j}{\text{Grand total}}\]

The test statistic is then:

\[\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\]

where \(O_{ij}\) is the observed count and \(E_{ij}\) the expected count in each cell. The larger the discrepancy between observed and expected, the larger the χ² statistic, and the less likely the association is due to chance.

The p-value is derived from comparing this statistic to a chi-square distribution with \((r-1)(c-1)\) degrees of freedom (d.f.), where \(r\) is the number of rows and \(c\) the number of columns.

The chi-square test is sensitive to sample size

A very large sample will almost always produce a significant chi-square, even for trivially small associations. Conversely, a very small sample may fail to detect a real association. This is why the chi-square test should be calculated on normalized survey weights. It can be accompanied by a measure of effect size (such as Cramér’s V) and a careful look at the residuals to understand where the association lies.


10.2 Setup

We continue with the HIES data, using our binary variable walk (whether a child walks to school) and examining its association with household wealth, parental education, province, and sex.

library(tidyverse)
library(questionr)
library(kableExtra)
library(survey)
library(srvyr)
library(gtsummary)

setwd("/home/groups/3genquanti/SoMix/HIES for workshop")
edu <- readRDS("edu.rds")

edu_school <- edu |>
  mutate(walk = if_else(transport_medium == 1, 1, 0)) |>
  filter(!is.na(distance))

10.3 Method 1: The Simple Approach

This approach follows the logic introduced in the cross-tabulation section of the tutorial.

10.3.1 Step 1: Row percentages

edu_school |> 
  freqtable(hhwealthcat, walk, weights = finalweight_25per) |>
  rprop() |>
  kbl(digits = 0) |> 
  kable_classic(full_width = FALSE)
0 1 Total
Poorest 49 51 100
Poor 64 36 100
Middle 70 30 100
Rich 76 24 100
Richest 90 10 100
All 71 29 100

10.3.2 Step 2: Normalize the weights before running the test

The chi-square test is based on weighted counts. If we feed it the raw survey weights, the sample size becomes artificially inflated, which will produce a significant result almost by construction.

The solution is to normalize the weights so that they sum to the actual sample size:

edu_school <- edu_school |>
  mutate(weightnorm = finalweight_25per / sum(finalweight_25per) * n())

# Check: normalized weights should sum to the sample size
sum(edu_school$weightnorm)
[1] 3927

10.3.3 Step 3: Run the chi-square test

tabfreqwn <- edu_school |> 
  freqtable(hhwealthcat, walk, weights = weightnorm)

chi <- chisq.test(tabfreqwn)
chi

    Pearson's Chi-squared test

data:  tabfreqwn
X-squared = 308.09, df = 4, p-value < 2.2e-16

10.3.4 Step 4: Inspect observed, expected, and residuals

Beyond the p-value, the chi-square test produces useful diagnostics that tell us where the association is strongest:

# Observed counts (weighted)
chi$observed
           walk
hhwealthcat         0         1
    Poorest 286.77405 295.79092
    Poor    551.12499 309.36991
    Middle  556.09568 242.39552
    Rich    665.40623 207.19191
    Richest 730.75305  82.09774
# Expected counts under independence
chi$expected
           walk
hhwealthcat        0        1
    Poorest 413.9155 168.6495
    Poor    611.3861 249.1088
    Middle  567.3322 231.1590
    Rich    619.9855 252.6126
    Richest 577.5347 235.3161
# Pearson residuals: (observed - expected) / sqrt(expected)
chi$residuals
           walk
hhwealthcat          0          1
    Poorest -6.2492970  9.7902648
    Poor    -2.4371343  3.8180598
    Middle  -0.4717499  0.7390522
    Rich     1.8241603 -2.8577635
    Richest  6.3756084 -9.9881465
# Standardized residuals: values > |2| indicate cells that deviate significantly
chi$stdres
           walk
hhwealthcat           0           1
    Poorest -12.5857723  12.5857723
    Poor     -5.1258713   5.1258713
    Middle   -0.9823206   0.9823206
    Rich      3.8442357  -3.8442357
    Richest  13.3064440 -13.3064440
Reading the standardized residuals

Standardized residuals greater than +2 indicate cells where the observed count is significantly higher than expected; values below -2 indicate cells where it is significantly lower. This tells us exactly which combinations of categories drive the overall association.

For example, if Richest × walk=0 has a standardized residual of +13.3, it means that children from the richest households are far more likely not to walk to school than we would expect under independence.


10.4 Method 2: Using the Survey Design

The approach above uses normalized weights, which is a pragmatic solution. The more rigorous approach is to use svychisq() from the survey package, which correctly accounts for the complex sampling design. Here, we do not need to use normalized sampling weights, the function will do it for us.

10.4.1 Define the survey design

design <- edu_school |> 
  as_survey_design(weights = finalweight_25per)

10.4.2 Descriptive table with tbl_svysummary

design |> 
  mutate(walk = factor(walk)) |>
  tbl_svysummary(
    include    = hhwealthcat,
    by         = walk,
    statistic  = all_categorical() ~ "{p}% ({n_unweighted} obs.)",
    percent    = "row"
  ) |> 
  add_overall(last = TRUE) |> 
  modify_header(
    all_stat_cols() ~ "**{level}** ({n_unweighted} obs.)"
  )
Characteristic 0 (2739 obs.)1 1 (1188 obs.)1 Overall (3927 obs.)1
hhwealthcat


    Poorest 49% (328 obs.) 51% (336 obs.) 100% (664 obs.)
    Poor 64% (564 obs.) 36% (326 obs.) 100% (890 obs.)
    Middle 70% (543 obs.) 30% (244 obs.) 100% (787 obs.)
    Rich 76% (648 obs.) 24% (198 obs.) 100% (846 obs.)
    Richest 90% (656 obs.) 10% (84 obs.) 100% (740 obs.)
1

10.4.3 Survey-corrected chi-square test

svychisq(~hhwealthcat + walk, 
         design    = design, 
         statistic = "Chisq")

    Pearson's X^2: Rao & Scott adjustment

data:  NextMethod()
X-squared = 308.09, df = 4, p-value < 2.2e-16
chisq.test vs svychisq

svychisq() uses a Rao-Scott correction that adjusts for the survey design (weights, clustering, stratification).

10.4.4 Combine table and test in one step

gtsummary allows us to add the p-value from svychisq directly to the summary table with add_p():

design |> 
  mutate(walk = factor(walk)) |>
  tbl_svysummary(
    include    = hhwealthcat,
    by         = walk,
    statistic  = all_categorical() ~ "{p}% ({n_unweighted} obs.)",
    percent    = "row",
    missing_stat = "{N_miss_unweighted} obs."
  ) |> 
  add_overall(last = TRUE) |> 
  add_p() |>
  modify_header(
    all_stat_cols() ~ "**{level}** ({n_unweighted} obs.)"
  )
Characteristic 0 (2739 obs.)1 1 (1188 obs.)1 Overall (3927 obs.)1 p-value2
hhwealthcat


<0.001
    Poorest 49% (328 obs.) 51% (336 obs.) 100% (664 obs.)
    Poor 64% (564 obs.) 36% (326 obs.) 100% (890 obs.)
    Middle 70% (543 obs.) 30% (244 obs.) 100% (787 obs.)
    Rich 76% (648 obs.) 24% (198 obs.) 100% (846 obs.)
    Richest 90% (656 obs.) 10% (84 obs.) 100% (740 obs.)
1
2 Pearson’s X^2: Rao & Scott adjustment

10.5 Testing Several Variables Simultaneously

One of the great advantages of tbl_svysummary is that we can test the association of walk with multiple variables at once, each with its own chi-square test:

design |> 
  mutate(walk = factor(walk)) |>
  tbl_svysummary(
    include    = c(hhwealthcat, edu_parent, province, sex),
    by         = walk,
    statistic  = all_categorical() ~ "{p}% ({n_unweighted} obs.)",
    percent    = "row",
    missing_stat = "{N_miss_unweighted} obs."
  ) |> 
  add_overall(last = TRUE) |> 
  add_p() |>
  modify_header(
    all_stat_cols() ~ "**{level}** ({n_unweighted} obs.)"
  )
Characteristic 0 (2739 obs.)1 1 (1188 obs.)1 Overall (3927 obs.)1 p-value2
hhwealthcat


<0.001
    Poorest 49% (328 obs.) 51% (336 obs.) 100% (664 obs.)
    Poor 64% (564 obs.) 36% (326 obs.) 100% (890 obs.)
    Middle 70% (543 obs.) 30% (244 obs.) 100% (787 obs.)
    Rich 76% (648 obs.) 24% (198 obs.) 100% (846 obs.)
    Richest 90% (656 obs.) 10% (84 obs.) 100% (740 obs.)
Educational attainment


<0.001
    Less than primary 52% (53 obs.) 48% (51 obs.) 100% (104 obs.)
    Primary 53% (202 obs.) 47% (187 obs.) 100% (389 obs.)
    Junior secondary 68% (892 obs.) 32% (422 obs.) 100% (1,314 obs.)
    Senior secondary 74% (528 obs.) 26% (197 obs.) 100% (725 obs.)
    Collegiate 83% (448 obs.) 17% (97 obs.) 100% (545 obs.)
    Tertiary 87% (116 obs.) 13% (18 obs.) 100% (134 obs.)
    Unknown 500 obs. 216 obs. 716 obs.
province


<0.001
    Western 79% (662 obs.) 21% (178 obs.) 100% (840 obs.)
    Central 60% (306 obs.) 40% (221 obs.) 100% (527 obs.)
    Southern 74% (414 obs.) 26% (136 obs.) 100% (550 obs.)
    Northern 72% (303 obs.) 28% (125 obs.) 100% (428 obs.)
    Eastern 49% (226 obs.) 51% (235 obs.) 100% (461 obs.)
    North Western 83% (309 obs.) 17% (65 obs.) 100% (374 obs.)
    North Central 82% (184 obs.) 18% (46 obs.) 100% (230 obs.)
    Uva 60% (161 obs.) 40% (99 obs.) 100% (260 obs.)
    Sabaragamuwa 68% (174 obs.) 32% (83 obs.) 100% (257 obs.)
sex


0.7
    Male 71% (1,302 obs.) 29% (568 obs.) 100% (1,870 obs.)
    Female 71% (1,437 obs.) 29% (620 obs.) 100% (2,057 obs.)
1
2 Pearson’s X^2: Rao & Scott adjustment

This table gives us a clear overview of which variables are significantly associated with walking to school — and serves as a natural bivariate screening before running the multivariate logistic regression of Session 3.

Exercise
  1. Run the chi-square test for the association between edu_parent and walk. Examine the standardized residuals — which parental education categories show the strongest deviation from independence?

  2. Is the association between sex and walk statistically significant?

  3. Run tbl_svysummary with add_p() for the variables hhwealthcat, edu_parent, province, and sex on the likelihood of going to school by bus. Comment.

  4. Bonus: Compute Cramér’s V for the association between hhwealthcat and walk using the rcompanion package (cramerV(tabfreqwn)). How does the effect size compare to what the p-value alone suggested?