10 The Chi-Square Test

In previous sections, we used cross-tabulations to describe the relationship between two categorical variables. But how do we know whether an observed association is statistically significant — i.e., unlikely to be due to chance alone? This is what the chi-square test (χ²) tells us.

An interactive introduction

Before reading further, we recommend exploring this interactive tool which illustrates the logic of the chi-square test visually: https://istats.shinyapps.io/ChiSquaredTest/

10.1 How Does the Chi-Square Test Work?

The chi-square test compares what we observe in the data to what we would expect if there were no association between the two variables (the null hypothesis of independence).

For each cell of the cross-tabulation, the expected count is:

\[E_{ij} = \frac{\text{Row total}_i \times \text{Column total}_j}{\text{Grand total}}\]

The test statistic is then:

\[\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\]

where \(O_{ij}\) is the observed count and \(E_{ij}\) the expected count in each cell. The larger the discrepancy between observed and expected, the larger the χ² statistic, and the less likely the association is due to chance.

The p-value is derived from comparing this statistic to a chi-square distribution with \((r-1)(c-1)\) degrees of freedom (d.f.), where \(r\) is the number of rows and \(c\) the number of columns.

The chi-square test is sensitive to sample size

A very large sample will almost always produce a significant chi-square, even for trivially small associations. Conversely, a very small sample may fail to detect a real association. This is why the chi-square test should be calculated on normalized survey weights. It can be accompanied by a measure of effect size (such as Cramér’s V) and a careful look at the residuals to understand where the association lies.

10.2 Setup

We continue with the HIES data, using our binary variable walk (whether a child walks to school) and examining its association with household wealth, parental education, province, and sex.

library(tidyverse)
library(questionr)
library(kableExtra)
library(survey)
library(srvyr)
library(gtsummary)

setwd("/home/groups/3genquanti/SoMix/HIES for workshop")
edu <- readRDS("edu.rds")

edu_school <- edu |>
  mutate(walk = if_else(transport_medium == 1, 1, 0)) |>
  filter(!is.na(distance))

10.3 Method 1: The Simple Approach

This approach follows the logic introduced in the cross-tabulation section of the tutorial.

10.3.1 Step 1: Row percentages

edu_school |> 
  freqtable(hhwealthcat, walk, weights = finalweight_25per) |>
  rprop() |>
  kbl(digits = 0) |> 
  kable_classic(full_width = FALSE)

	0	1	Total
Poorest	49	51	100
Poor	64	36	100
Middle	70	30	100
Rich	76	24	100
Richest	90	10	100
All	71	29	100

10.3.2 Step 2: Normalize the weights before running the test

The chi-square test is based on weighted counts. If we feed it the raw survey weights, the sample size becomes artificially inflated, which will produce a significant result almost by construction.

The solution is to normalize the weights so that they sum to the actual sample size:

edu_school <- edu_school |>
  mutate(weightnorm = finalweight_25per / sum(finalweight_25per) * n())

# Check: normalized weights should sum to the sample size
sum(edu_school$weightnorm)

[1] 3927

10.3.3 Step 3: Run the chi-square test

tabfreqwn <- edu_school |> 
  freqtable(hhwealthcat, walk, weights = weightnorm)

chi <- chisq.test(tabfreqwn)
chi


    Pearson's Chi-squared test

data:  tabfreqwn
X-squared = 308.09, df = 4, p-value < 2.2e-16

10.3.4 Step 4: Inspect observed, expected, and residuals

Beyond the p-value, the chi-square test produces useful diagnostics that tell us where the association is strongest:

# Observed counts (weighted)
chi$observed

           walk
hhwealthcat         0         1
    Poorest 286.77405 295.79092
    Poor    551.12499 309.36991
    Middle  556.09568 242.39552
    Rich    665.40623 207.19191
    Richest 730.75305  82.09774

# Expected counts under independence
chi$expected

           walk
hhwealthcat        0        1
    Poorest 413.9155 168.6495
    Poor    611.3861 249.1088
    Middle  567.3322 231.1590
    Rich    619.9855 252.6126
    Richest 577.5347 235.3161

# Pearson residuals: (observed - expected) / sqrt(expected)
chi$residuals

           walk
hhwealthcat          0          1
    Poorest -6.2492970  9.7902648
    Poor    -2.4371343  3.8180598
    Middle  -0.4717499  0.7390522
    Rich     1.8241603 -2.8577635
    Richest  6.3756084 -9.9881465

# Standardized residuals: values > |2| indicate cells that deviate significantly
chi$stdres

           walk
hhwealthcat           0           1
    Poorest -12.5857723  12.5857723
    Poor     -5.1258713   5.1258713
    Middle   -0.9823206   0.9823206
    Rich      3.8442357  -3.8442357
    Richest  13.3064440 -13.3064440

Reading the standardized residuals

Standardized residuals greater than +2 indicate cells where the observed count is significantly higher than expected; values below -2 indicate cells where it is significantly lower. This tells us exactly which combinations of categories drive the overall association.

For example, if Richest × walk=0 has a standardized residual of +13.3, it means that children from the richest households are far more likely not to walk to school than we would expect under independence.

10.4 Method 2: Using the Survey Design

The approach above uses normalized weights, which is a pragmatic solution. The more rigorous approach is to use svychisq() from the survey package, which correctly accounts for the complex sampling design. Here, we do not need to use normalized sampling weights, the function will do it for us.

10.4.1 Define the survey design

design <- edu_school |> 
  as_survey_design(weights = finalweight_25per)

10.4.2 Descriptive table with `tbl_svysummary`

design |> 
  mutate(walk = factor(walk)) |>
  tbl_svysummary(
    include    = hhwealthcat,
    by         = walk,
    statistic  = all_categorical() ~ "{p}% ({n_unweighted} obs.)",
    percent    = "row"
  ) |> 
  add_overall(last = TRUE) |> 
  modify_header(
    all_stat_cols() ~ "**{level}** ({n_unweighted} obs.)"
  )

Characteristic	0 (2739 obs.)¹	1 (1188 obs.)¹	Overall (3927 obs.)¹
hhwealthcat
Poorest	49% (328 obs.)	51% (336 obs.)	100% (664 obs.)
Poor	64% (564 obs.)	36% (326 obs.)	100% (890 obs.)
Middle	70% (543 obs.)	30% (244 obs.)	100% (787 obs.)
Rich	76% (648 obs.)	24% (198 obs.)	100% (846 obs.)
Richest	90% (656 obs.)	10% (84 obs.)	100% (740 obs.)
¹

10.4.3 Survey-corrected chi-square test

svychisq(~hhwealthcat + walk, 
         design    = design, 
         statistic = "Chisq")


    Pearson's X^2: Rao & Scott adjustment

data:  NextMethod()
X-squared = 308.09, df = 4, p-value < 2.2e-16

chisq.test vs svychisq

svychisq() uses a Rao-Scott correction that adjusts for the survey design (weights, clustering, stratification).

10.4.4 Combine table and test in one step

gtsummary allows us to add the p-value from svychisq directly to the summary table with add_p():

design |> 
  mutate(walk = factor(walk)) |>
  tbl_svysummary(
    include    = hhwealthcat,
    by         = walk,
    statistic  = all_categorical() ~ "{p}% ({n_unweighted} obs.)",
    percent    = "row",
    missing_stat = "{N_miss_unweighted} obs."
  ) |> 
  add_overall(last = TRUE) |> 
  add_p() |>
  modify_header(
    all_stat_cols() ~ "**{level}** ({n_unweighted} obs.)"
  )

Characteristic	0 (2739 obs.)¹	1 (1188 obs.)¹	Overall (3927 obs.)¹	p-value²
hhwealthcat				<0.001
Poorest	49% (328 obs.)	51% (336 obs.)	100% (664 obs.)
Poor	64% (564 obs.)	36% (326 obs.)	100% (890 obs.)
Middle	70% (543 obs.)	30% (244 obs.)	100% (787 obs.)
Rich	76% (648 obs.)	24% (198 obs.)	100% (846 obs.)
Richest	90% (656 obs.)	10% (84 obs.)	100% (740 obs.)
¹
² Pearson’s X^2: Rao & Scott adjustment

10.5 Testing Several Variables Simultaneously

One of the great advantages of tbl_svysummary is that we can test the association of walk with multiple variables at once, each with its own chi-square test:

design |> 
  mutate(walk = factor(walk)) |>
  tbl_svysummary(
    include    = c(hhwealthcat, edu_parent, province, sex),
    by         = walk,
    statistic  = all_categorical() ~ "{p}% ({n_unweighted} obs.)",
    percent    = "row",
    missing_stat = "{N_miss_unweighted} obs."
  ) |> 
  add_overall(last = TRUE) |> 
  add_p() |>
  modify_header(
    all_stat_cols() ~ "**{level}** ({n_unweighted} obs.)"
  )

Characteristic	0 (2739 obs.)¹	1 (1188 obs.)¹	Overall (3927 obs.)¹	p-value²
hhwealthcat				<0.001
Poorest	49% (328 obs.)	51% (336 obs.)	100% (664 obs.)
Poor	64% (564 obs.)	36% (326 obs.)	100% (890 obs.)
Middle	70% (543 obs.)	30% (244 obs.)	100% (787 obs.)
Rich	76% (648 obs.)	24% (198 obs.)	100% (846 obs.)
Richest	90% (656 obs.)	10% (84 obs.)	100% (740 obs.)
Educational attainment				<0.001
Less than primary	52% (53 obs.)	48% (51 obs.)	100% (104 obs.)
Primary	53% (202 obs.)	47% (187 obs.)	100% (389 obs.)
Junior secondary	68% (892 obs.)	32% (422 obs.)	100% (1,314 obs.)
Senior secondary	74% (528 obs.)	26% (197 obs.)	100% (725 obs.)
Collegiate	83% (448 obs.)	17% (97 obs.)	100% (545 obs.)
Tertiary	87% (116 obs.)	13% (18 obs.)	100% (134 obs.)
Unknown	500 obs.	216 obs.	716 obs.
province				<0.001
Western	79% (662 obs.)	21% (178 obs.)	100% (840 obs.)
Central	60% (306 obs.)	40% (221 obs.)	100% (527 obs.)
Southern	74% (414 obs.)	26% (136 obs.)	100% (550 obs.)
Northern	72% (303 obs.)	28% (125 obs.)	100% (428 obs.)
Eastern	49% (226 obs.)	51% (235 obs.)	100% (461 obs.)
North Western	83% (309 obs.)	17% (65 obs.)	100% (374 obs.)
North Central	82% (184 obs.)	18% (46 obs.)	100% (230 obs.)
Uva	60% (161 obs.)	40% (99 obs.)	100% (260 obs.)
Sabaragamuwa	68% (174 obs.)	32% (83 obs.)	100% (257 obs.)
sex				0.7
Male	71% (1,302 obs.)	29% (568 obs.)	100% (1,870 obs.)
Female	71% (1,437 obs.)	29% (620 obs.)	100% (2,057 obs.)
¹
² Pearson’s X^2: Rao & Scott adjustment

This table gives us a clear overview of which variables are significantly associated with walking to school — and serves as a natural bivariate screening before running the multivariate logistic regression of Session 3.

Exercise

Run the chi-square test for the association between edu_parent and walk. Examine the standardized residuals — which parental education categories show the strongest deviation from independence?
Is the association between sex and walk statistically significant?
Run tbl_svysummary with add_p() for the variables hhwealthcat, edu_parent, province, and sex on the likelihood of going to school by bus. Comment.
Bonus: Compute Cramér’s V for the association between hhwealthcat and walk using the rcompanion package (cramerV(tabfreqwn)). How does the effect size compare to what the p-value alone suggested?

10.1 How Does the Chi-Square Test Work?

10.2 Setup

10.3 Method 1: The Simple Approach

10.3.1 Step 1: Row percentages

10.3.2 Step 2: Normalize the weights before running the test

10.3.3 Step 3: Run the chi-square test

10.3.4 Step 4: Inspect observed, expected, and residuals

10.4 Method 2: Using the Survey Design

10.4.1 Define the survey design

10.4.2 Descriptive table with tbl_svysummary

10.4.3 Survey-corrected chi-square test

10.4.4 Combine table and test in one step

10.5 Testing Several Variables Simultaneously

10.4.2 Descriptive table with `tbl_svysummary`