library(tidyverse)
library(questionr)
library(kableExtra)
library(survey)
library(srvyr)
library(gtsummary)
setwd("/home/groups/3genquanti/SoMix/HIES for workshop")
edu <- readRDS("edu.rds")
edu_school <- edu |>
mutate(walk = if_else(transport_medium == 1, 1, 0)) |>
filter(!is.na(distance))10 The Chi-Square Test
In previous sections, we used cross-tabulations to describe the relationship between two categorical variables. But how do we know whether an observed association is statistically significant — i.e., unlikely to be due to chance alone? This is what the chi-square test (χ²) tells us.
Before reading further, we recommend exploring this interactive tool which illustrates the logic of the chi-square test visually: https://istats.shinyapps.io/ChiSquaredTest/
10.1 How Does the Chi-Square Test Work?
The chi-square test compares what we observe in the data to what we would expect if there were no association between the two variables (the null hypothesis of independence).
For each cell of the cross-tabulation, the expected count is:
\[E_{ij} = \frac{\text{Row total}_i \times \text{Column total}_j}{\text{Grand total}}\]
The test statistic is then:
\[\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\]
where \(O_{ij}\) is the observed count and \(E_{ij}\) the expected count in each cell. The larger the discrepancy between observed and expected, the larger the χ² statistic, and the less likely the association is due to chance.
The p-value is derived from comparing this statistic to a chi-square distribution with \((r-1)(c-1)\) degrees of freedom (d.f.), where \(r\) is the number of rows and \(c\) the number of columns.
A very large sample will almost always produce a significant chi-square, even for trivially small associations. Conversely, a very small sample may fail to detect a real association. This is why the chi-square test should be calculated on normalized survey weights. It can be accompanied by a measure of effect size (such as Cramér’s V) and a careful look at the residuals to understand where the association lies.
10.2 Setup
We continue with the HIES data, using our binary variable walk (whether a child walks to school) and examining its association with household wealth, parental education, province, and sex.
10.3 Method 1: The Simple Approach
This approach follows the logic introduced in the cross-tabulation section of the tutorial.
10.3.1 Step 1: Row percentages
edu_school |>
freqtable(hhwealthcat, walk, weights = finalweight_25per) |>
rprop() |>
kbl(digits = 0) |>
kable_classic(full_width = FALSE)| 0 | 1 | Total | |
|---|---|---|---|
| Poorest | 49 | 51 | 100 |
| Poor | 64 | 36 | 100 |
| Middle | 70 | 30 | 100 |
| Rich | 76 | 24 | 100 |
| Richest | 90 | 10 | 100 |
| All | 71 | 29 | 100 |
10.3.2 Step 2: Normalize the weights before running the test
The chi-square test is based on weighted counts. If we feed it the raw survey weights, the sample size becomes artificially inflated, which will produce a significant result almost by construction.
The solution is to normalize the weights so that they sum to the actual sample size:
edu_school <- edu_school |>
mutate(weightnorm = finalweight_25per / sum(finalweight_25per) * n())
# Check: normalized weights should sum to the sample size
sum(edu_school$weightnorm)[1] 3927
10.3.3 Step 3: Run the chi-square test
tabfreqwn <- edu_school |>
freqtable(hhwealthcat, walk, weights = weightnorm)
chi <- chisq.test(tabfreqwn)
chi
Pearson's Chi-squared test
data: tabfreqwn
X-squared = 308.09, df = 4, p-value < 2.2e-16
10.3.4 Step 4: Inspect observed, expected, and residuals
Beyond the p-value, the chi-square test produces useful diagnostics that tell us where the association is strongest:
# Observed counts (weighted)
chi$observed walk
hhwealthcat 0 1
Poorest 286.77405 295.79092
Poor 551.12499 309.36991
Middle 556.09568 242.39552
Rich 665.40623 207.19191
Richest 730.75305 82.09774
# Expected counts under independence
chi$expected walk
hhwealthcat 0 1
Poorest 413.9155 168.6495
Poor 611.3861 249.1088
Middle 567.3322 231.1590
Rich 619.9855 252.6126
Richest 577.5347 235.3161
# Pearson residuals: (observed - expected) / sqrt(expected)
chi$residuals walk
hhwealthcat 0 1
Poorest -6.2492970 9.7902648
Poor -2.4371343 3.8180598
Middle -0.4717499 0.7390522
Rich 1.8241603 -2.8577635
Richest 6.3756084 -9.9881465
# Standardized residuals: values > |2| indicate cells that deviate significantly
chi$stdres walk
hhwealthcat 0 1
Poorest -12.5857723 12.5857723
Poor -5.1258713 5.1258713
Middle -0.9823206 0.9823206
Rich 3.8442357 -3.8442357
Richest 13.3064440 -13.3064440
Standardized residuals greater than +2 indicate cells where the observed count is significantly higher than expected; values below -2 indicate cells where it is significantly lower. This tells us exactly which combinations of categories drive the overall association.
For example, if Richest × walk=0 has a standardized residual of +13.3, it means that children from the richest households are far more likely not to walk to school than we would expect under independence.
10.4 Method 2: Using the Survey Design
The approach above uses normalized weights, which is a pragmatic solution. The more rigorous approach is to use svychisq() from the survey package, which correctly accounts for the complex sampling design. Here, we do not need to use normalized sampling weights, the function will do it for us.
10.4.1 Define the survey design
design <- edu_school |>
as_survey_design(weights = finalweight_25per)10.4.2 Descriptive table with tbl_svysummary
design |>
mutate(walk = factor(walk)) |>
tbl_svysummary(
include = hhwealthcat,
by = walk,
statistic = all_categorical() ~ "{p}% ({n_unweighted} obs.)",
percent = "row"
) |>
add_overall(last = TRUE) |>
modify_header(
all_stat_cols() ~ "**{level}** ({n_unweighted} obs.)"
)| Characteristic | 0 (2739 obs.)1 | 1 (1188 obs.)1 | Overall (3927 obs.)1 |
|---|---|---|---|
| hhwealthcat | |||
| Poorest | 49% (328 obs.) | 51% (336 obs.) | 100% (664 obs.) |
| Poor | 64% (564 obs.) | 36% (326 obs.) | 100% (890 obs.) |
| Middle | 70% (543 obs.) | 30% (244 obs.) | 100% (787 obs.) |
| Rich | 76% (648 obs.) | 24% (198 obs.) | 100% (846 obs.) |
| Richest | 90% (656 obs.) | 10% (84 obs.) | 100% (740 obs.) |
| 1 | |||
10.4.3 Survey-corrected chi-square test
svychisq(~hhwealthcat + walk,
design = design,
statistic = "Chisq")
Pearson's X^2: Rao & Scott adjustment
data: NextMethod()
X-squared = 308.09, df = 4, p-value < 2.2e-16
chisq.test vs svychisq
svychisq() uses a Rao-Scott correction that adjusts for the survey design (weights, clustering, stratification).
10.4.4 Combine table and test in one step
gtsummary allows us to add the p-value from svychisq directly to the summary table with add_p():
design |>
mutate(walk = factor(walk)) |>
tbl_svysummary(
include = hhwealthcat,
by = walk,
statistic = all_categorical() ~ "{p}% ({n_unweighted} obs.)",
percent = "row",
missing_stat = "{N_miss_unweighted} obs."
) |>
add_overall(last = TRUE) |>
add_p() |>
modify_header(
all_stat_cols() ~ "**{level}** ({n_unweighted} obs.)"
)| Characteristic | 0 (2739 obs.)1 | 1 (1188 obs.)1 | Overall (3927 obs.)1 | p-value2 |
|---|---|---|---|---|
| hhwealthcat | <0.001 | |||
| Poorest | 49% (328 obs.) | 51% (336 obs.) | 100% (664 obs.) | |
| Poor | 64% (564 obs.) | 36% (326 obs.) | 100% (890 obs.) | |
| Middle | 70% (543 obs.) | 30% (244 obs.) | 100% (787 obs.) | |
| Rich | 76% (648 obs.) | 24% (198 obs.) | 100% (846 obs.) | |
| Richest | 90% (656 obs.) | 10% (84 obs.) | 100% (740 obs.) | |
| 1 | ||||
| 2 Pearson’s X^2: Rao & Scott adjustment | ||||
10.5 Testing Several Variables Simultaneously
One of the great advantages of tbl_svysummary is that we can test the association of walk with multiple variables at once, each with its own chi-square test:
design |>
mutate(walk = factor(walk)) |>
tbl_svysummary(
include = c(hhwealthcat, edu_parent, province, sex),
by = walk,
statistic = all_categorical() ~ "{p}% ({n_unweighted} obs.)",
percent = "row",
missing_stat = "{N_miss_unweighted} obs."
) |>
add_overall(last = TRUE) |>
add_p() |>
modify_header(
all_stat_cols() ~ "**{level}** ({n_unweighted} obs.)"
)| Characteristic | 0 (2739 obs.)1 | 1 (1188 obs.)1 | Overall (3927 obs.)1 | p-value2 |
|---|---|---|---|---|
| hhwealthcat | <0.001 | |||
| Poorest | 49% (328 obs.) | 51% (336 obs.) | 100% (664 obs.) | |
| Poor | 64% (564 obs.) | 36% (326 obs.) | 100% (890 obs.) | |
| Middle | 70% (543 obs.) | 30% (244 obs.) | 100% (787 obs.) | |
| Rich | 76% (648 obs.) | 24% (198 obs.) | 100% (846 obs.) | |
| Richest | 90% (656 obs.) | 10% (84 obs.) | 100% (740 obs.) | |
| Educational attainment | <0.001 | |||
| Less than primary | 52% (53 obs.) | 48% (51 obs.) | 100% (104 obs.) | |
| Primary | 53% (202 obs.) | 47% (187 obs.) | 100% (389 obs.) | |
| Junior secondary | 68% (892 obs.) | 32% (422 obs.) | 100% (1,314 obs.) | |
| Senior secondary | 74% (528 obs.) | 26% (197 obs.) | 100% (725 obs.) | |
| Collegiate | 83% (448 obs.) | 17% (97 obs.) | 100% (545 obs.) | |
| Tertiary | 87% (116 obs.) | 13% (18 obs.) | 100% (134 obs.) | |
| Unknown | 500 obs. | 216 obs. | 716 obs. | |
| province | <0.001 | |||
| Western | 79% (662 obs.) | 21% (178 obs.) | 100% (840 obs.) | |
| Central | 60% (306 obs.) | 40% (221 obs.) | 100% (527 obs.) | |
| Southern | 74% (414 obs.) | 26% (136 obs.) | 100% (550 obs.) | |
| Northern | 72% (303 obs.) | 28% (125 obs.) | 100% (428 obs.) | |
| Eastern | 49% (226 obs.) | 51% (235 obs.) | 100% (461 obs.) | |
| North Western | 83% (309 obs.) | 17% (65 obs.) | 100% (374 obs.) | |
| North Central | 82% (184 obs.) | 18% (46 obs.) | 100% (230 obs.) | |
| Uva | 60% (161 obs.) | 40% (99 obs.) | 100% (260 obs.) | |
| Sabaragamuwa | 68% (174 obs.) | 32% (83 obs.) | 100% (257 obs.) | |
| sex | 0.7 | |||
| Male | 71% (1,302 obs.) | 29% (568 obs.) | 100% (1,870 obs.) | |
| Female | 71% (1,437 obs.) | 29% (620 obs.) | 100% (2,057 obs.) | |
| 1 | ||||
| 2 Pearson’s X^2: Rao & Scott adjustment | ||||
This table gives us a clear overview of which variables are significantly associated with walking to school — and serves as a natural bivariate screening before running the multivariate logistic regression of Session 3.
Run the chi-square test for the association between
edu_parentandwalk. Examine the standardized residuals — which parental education categories show the strongest deviation from independence?Is the association between
sexandwalkstatistically significant?Run
tbl_svysummarywithadd_p()for the variableshhwealthcat,edu_parent,province, andsexon the likelihood of going to school by bus. Comment.Bonus: Compute Cramér’s V for the association between
hhwealthcatandwalkusing thercompanionpackage (cramerV(tabfreqwn)). How does the effect size compare to what the p-value alone suggested?