11 Logistic Regression and result scales

In Session 2, we used linear regression to model a continuous outcome (distance to school) and introduced the Linear Probability Model (LPM) as a simple way to handle binary outcomes. In this session, we go further with the proper tool for binary outcomes: logistic regression.

We need to load the packages and data as before, and to create a survey design object:

# Load necessary packages
load.lib <- c("tidyverse", "survey", "srvyr","gtsummary","performance","ggstats","broom.helpers","marginaleffects")
install.lib <- load.lib[!load.lib %in% installed.packages()]
for (lib in install.lib) install.packages(lib, dependencies = TRUE)
sapply(load.lib, require, character = TRUE)

      tidyverse          survey           srvyr       gtsummary     performance 
           TRUE            TRUE            TRUE            TRUE            TRUE 
        ggstats   broom.helpers marginaleffects 
           TRUE            TRUE            TRUE

# Load data
setwd("/home/groups/3genquanti/SoMix/HIES for workshop")
edu <- readRDS("edu.rds")

edu_school <- edu |>
  filter(!is.na(distance))

design_edu <- edu_school |>
  srvyr::as_survey_design(
  weights = finalweight_25per      # The weight variable
)

We continue with the same binary outcome introduced at the end of Session 2: whether a child walks to school (walk), which we constructed as:

design_edu <- design_edu |>
  mutate(walk = if_else(transport_medium == 1, 1L, 0L))

About 29% of children walk to school (svymean(~walk, design_edu, na.rm = TRUE)), making it a well-suited outcome for logistic regression.

By the end of this session, you will be able to:

Understand whether logistic regression is preferable to the LPM for binary outcomes
Estimate a logistic regression model on weighted survey data
Interpret log-odds, odds ratios, and predicted probabilities
Present results in publication-ready tables and figures
Compare LPM and logistic regression results

11.1 Why Logistic Regression?

Recall from Session 2 that the LPM applies linear regression directly to a binary outcome. Its main limitation is that predicted values can fall outside [0, 1] — which is theoretically incoherent for a probability.

Logistic regression solves this by modelling the log-odds of the outcome, using a logistic transformation to keep all predicted probabilities between 0 and 1:

\[\log\left(\frac{P(Y=1)}{1-P(Y=1)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots\]

The trade-off is that coefficients are on the log-odds scale, which is less directly interpretable than the LPM’s percentage point differences — but we will see how to convert them into more intuitive quantities.

11.2 Estimating the Model

With weighted survey data, we use svyglm() with family = quasibinomial():

model_logit <- svyglm(
  walk ~ sex + age + edu_parent + hhwealthcat + province,
  design = design_edu,
  family  = quasibinomial()
)

summary(model_logit)


Call:
svyglm(formula = walk ~ sex + age + edu_parent + hhwealthcat + 
    province, design = design_edu, family = quasibinomial())

Survey design:
Called via srvyr

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 0.55354    0.30233   1.831 0.067205 .  
sexFemale                  -0.01417    0.08871  -0.160 0.873056    
age                        -0.03395    0.01165  -2.913 0.003601 ** 
edu_parentPrimary          -0.05279    0.25539  -0.207 0.836241    
edu_parentJunior secondary -0.36598    0.23743  -1.541 0.123313    
edu_parentSenior secondary -0.54647    0.24845  -2.199 0.027915 *  
edu_parentCollegiate       -0.74053    0.26762  -2.767 0.005688 ** 
edu_parentTertiary         -0.88338    0.36467  -2.422 0.015474 *  
hhwealthcatPoor            -0.56425    0.13042  -4.326 1.56e-05 ***
hhwealthcatMiddle          -0.65508    0.13972  -4.689 2.86e-06 ***
hhwealthcatRich            -0.98726    0.14794  -6.674 2.93e-11 ***
hhwealthcatRichest         -1.77988    0.20025  -8.888  < 2e-16 ***
provinceCentral             0.55647    0.15130   3.678 0.000239 ***
provinceSouthern            0.01338    0.15684   0.085 0.932013    
provinceNorthern           -0.26094    0.17227  -1.515 0.129950    
provinceEastern             0.96948    0.15470   6.267 4.18e-10 ***
provinceNorth Western      -0.57639    0.19108  -3.016 0.002577 ** 
provinceNorth Central      -0.43617    0.23087  -1.889 0.058948 .  
provinceUva                 0.24346    0.19318   1.260 0.207657    
provinceSabaragamuwa        0.13064    0.20220   0.646 0.518257    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 0.9868737)

Number of Fisher Scoring iterations: 4

Why quasibinomial and not binomial?

With survey data, observations are not independent. Using quasibinomial() avoids the assumption of binomial variance and produces correct standard errors for weighted data. The point estimates (coefficients) are identical to binomial().

11.3 Interpreting Coefficients: Odds Ratios

The raw coefficients are on the log-odds scale, which is not intuitive. We exponentiate them to obtain odds ratios (OR):

tbl_regression(model_logit, exponentiate = TRUE)

Characteristic	OR	95% CI	p-value
sex
Male	—	—
Female	0.99	0.83, 1.17	0.9
age	0.97	0.94, 0.99	0.004
Educational attainment
Less than primary	—	—
Primary	0.95	0.57, 1.57	0.8
Junior secondary	0.69	0.44, 1.10	0.12
Senior secondary	0.58	0.36, 0.94	0.028
Collegiate	0.48	0.28, 0.81	0.006
Tertiary	0.41	0.20, 0.85	0.015
hhwealthcat
Poorest	—	—
Poor	0.57	0.44, 0.73	<0.001
Middle	0.52	0.39, 0.68	<0.001
Rich	0.37	0.28, 0.50	<0.001
Richest	0.17	0.11, 0.25	<0.001
province
Western	—	—
Central	1.74	1.30, 2.35	<0.001
Southern	1.01	0.75, 1.38	>0.9
Northern	0.77	0.55, 1.08	0.13
Eastern	2.64	1.95, 3.57	<0.001
North Western	0.56	0.39, 0.82	0.003
North Central	0.65	0.41, 1.02	0.059
Uva	1.28	0.87, 1.86	0.2
Sabaragamuwa	1.14	0.77, 1.69	0.5
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

11.3.1 Interpreting an odds ratio

OR = 1: no association with the outcome
OR > 1: this category is more likely to walk to school
OR < 1: this category is less likely to walk to school

For example, an OR of 0.17 for hhwealthcatRichest means children from the richest households have 6 times lower odds of walking to school compared to the poorest (1/0.17 ≈ 6) — consistent with wealthier families being more likely to use motorised transport. We can also say that children from the richest households have 83% lower odds of walking to school compared to the poorest [(1 - 0.17) × 100 = 83%].

Or we can also say from the results that children from the Central province have 1.74 times higher odds of walking to school compared to children from the Western province.

Odds ratios vs. percentage point differences

Odds ratios describe relative effects and can be counterintuitive, especially when the outcome is common. An OR of 0.5 does not mean the probability is halved — the actual difference in probability depends on the baseline probability. For a more direct interpretation, see the note on marginal effects at the end of this session.

Odds Ratios below 1 are less intuitive to understand and grasp. It may be useful to change the reference category of the variable of interest to get ORs > 1!

11.4 Nested Model Comparison

As in Session 2, we compare nested models to assess the contribution of each block of variables:

m1 <- svyglm(walk ~ sex + age, design = design_edu, family = quasibinomial())
m2 <- svyglm(walk ~ sex + age + edu_parent + hhwealthcat, design = design_edu, family = quasibinomial())
m3 <- svyglm(walk ~ sex + age + edu_parent + hhwealthcat + province, design = design_edu, family = quasibinomial())

tbl_merge(
  list(
    tbl_regression(m1, exponentiate = TRUE),
    tbl_regression(m2, exponentiate = TRUE),
    tbl_regression(m3, exponentiate = TRUE)
  ),
  tab_spanner = c("Model 1", "Model 2", "Model 3")
)

Characteristic	Model 1			Model 2			Model 3
Characteristic	OR	95% CI	p-value	OR	95% CI	p-value	OR	95% CI	p-value
sex
Male	—	—		—	—		—	—
Female	0.98	0.84, 1.13	0.7	0.97	0.82, 1.15	0.8	0.99	0.83, 1.17	0.9
age	0.98	0.96, 1.00	0.035	0.97	0.95, 0.99	0.004	0.97	0.94, 0.99	0.004
Educational attainment
Less than primary				—	—		—	—
Primary				1.00	0.62, 1.60	>0.9	0.95	0.57, 1.57	0.8
Junior secondary				0.64	0.41, 1.00	0.048	0.69	0.44, 1.10	0.12
Senior secondary				0.57	0.36, 0.90	0.015	0.58	0.36, 0.94	0.028
Collegiate				0.47	0.28, 0.77	0.003	0.48	0.28, 0.81	0.006
Tertiary				0.45	0.23, 0.91	0.026	0.41	0.20, 0.85	0.015
hhwealthcat
Poorest				—	—		—	—
Poor				0.57	0.44, 0.72	<0.001	0.57	0.44, 0.73	<0.001
Middle				0.48	0.37, 0.63	<0.001	0.52	0.39, 0.68	<0.001
Rich				0.35	0.27, 0.46	<0.001	0.37	0.28, 0.50	<0.001
Richest				0.14	0.10, 0.21	<0.001	0.17	0.11, 0.25	<0.001
province
Western							—	—
Central							1.74	1.30, 2.35	<0.001
Southern							1.01	0.75, 1.38	>0.9
Northern							0.77	0.55, 1.08	0.13
Eastern							2.64	1.95, 3.57	<0.001
North Western							0.56	0.39, 0.82	0.003
North Central							0.65	0.41, 1.02	0.059
Uva							1.28	0.87, 1.86	0.2
Sabaragamuwa							1.14	0.77, 1.69	0.5
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

11.4.1 Goodness of fit

The results confirm the pattern observed in Session 2 with linear regression. Model 1, with only sex and age, fits the data very poorly (R² ≈ 0.001) — individual characteristics alone tell us almost nothing about who walks to school. The addition of socioeconomic variables in Model 2 produces a substantial improvement (AIC drops from 4725 to 3546, R² rises to 0.08), confirming that household wealth and parental education are the main drivers of transport mode. Adding province in Model 3 brings a further modest gain (AIC = 3447, R² = 0.11), suggesting that geographic context plays an additional but secondary role — consistent with the linear regression findings in Session 2.

Note

Lower AIC indicates better fit, penalising model complexity. The R² reported here is a survey-adapted approximation (correlation between fitted values and observed outcome, squared), comparable to the one used in Session 2. For logistic regression, it should be interpreted cautiously — what matters most for model comparison is the AIC.

library(performance)
compare_performance(m1, m2, m3, metrics = c("AIC", "R2", "R2_adj"))

# Comparison of Model Performance Indices

Name |  Model |  AIC (weights) |    R2 | R2 (adj.)
--------------------------------------------------
m1   | svyglm | 4725.1 (<.001) | 0.001 | 5.462e-04
m2   | svyglm | 3546.0 (<.001) | 0.079 |     0.076
m3   | svyglm | 3447.3 (>.999) | 0.109 |     0.104

11.5 Visualising Results: A Forest Plot

m3 |> 
  ggstats::ggcoef_model(exponentiate = TRUE)

11.6 LPM vs. Logistic Regression: A Comparison

It is instructive to compare the LPM and logistic regression on the same outcome. In many cases, the two approaches give very similar substantive conclusions.

Here, in both cases, household wealth is the strongest predictor of walking to school, with a clear gradient from poorest to richest; parental education shows a significant negative effect only from Senior secondary upward; and province plays an important role, with Eastern province children being much more likely to walk (OR = 2.64, +22 percentage points) and North Western children much less so.

The LPM coefficients offer a more direct interpretation — for instance, children from the richest households walk to school 31 percentage points less often than the poorest — while logistic regression ensures predicted probabilities remain within [0, 1] (as is verified by computing the marginal predictions from the logistic model).

library(marginaleffects)
lpm <- svyglm(walk ~ sex + age + edu_parent + hhwealthcat + province,
              design = design_edu)

tbl_merge(
  list(
    tbl_regression(lpm) |> 
      modify_header(estimate ~ "**Beta**"),
    tbl_regression(m3, exponentiate = TRUE) |> 
      modify_header(estimate ~ "**OR**"),
    tbl_regression(m3,
      tidy_fun = broom.helpers::tidy_marginal_predictions,
      type = "response",
      estimate_fun = scales::label_percent(accuracy = 0.1),
      wts = weights(m3)
    ) |>
      modify_table_body(
        ~ .x |> filter(!grepl("age", variable))
      ) |>
      modify_header(estimate ~ "**Marginal pred**")
  ),
  tab_spanner = c("LPM", "Logistic (OR)", "Logistic (AMP)")
)

Characteristic	LPM			Logistic (OR)			Logistic (AMP)
Characteristic	Beta	95% CI	p-value	OR	95% CI	p-value	Marginal pred	95% CI	p-value
sex
Male	—	—		—	—		29.0%	26.8%, 31.3%	<0.001
Female	0.00	-0.03, 0.03	0.9	0.99	0.83, 1.17	0.9	28.8%	26.7%, 30.9%	<0.001
age	-0.01	-0.01, 0.00	0.004	0.97	0.94, 0.99	0.004
Educational attainment
Less than primary	—	—		—	—		37.2%	27.8%, 46.5%	<0.001
Primary	-0.01	-0.12, 0.10	0.9	0.95	0.57, 1.57	0.8	36.1%	31.1%, 41.1%	<0.001
Junior secondary	-0.08	-0.19, 0.02	0.12	0.69	0.44, 1.10	0.12	29.9%	27.5%, 32.4%	<0.001
Senior secondary	-0.12	-0.23, -0.01	0.030	0.58	0.36, 0.94	0.028	26.6%	23.4%, 29.9%	<0.001
Collegiate	-0.14	-0.25, -0.03	0.010	0.48	0.28, 0.81	0.006	23.3%	19.2%, 27.5%	<0.001
Tertiary	-0.16	-0.28, -0.04	0.008	0.41	0.20, 0.85	0.015	21.1%	13.0%, 29.2%	<0.001
hhwealthcat
Poorest	—	—		—	—		44.8%	39.9%, 49.7%	<0.001
Poor	-0.13	-0.19, -0.08	<0.001	0.57	0.44, 0.73	<0.001	32.3%	28.7%, 35.9%	<0.001
Middle	-0.15	-0.21, -0.09	<0.001	0.52	0.39, 0.68	<0.001	30.5%	26.9%, 34.1%	<0.001
Rich	-0.22	-0.28, -0.16	<0.001	0.37	0.28, 0.50	<0.001	24.3%	20.8%, 27.7%	<0.001
Richest	-0.31	-0.37, -0.24	<0.001	0.17	0.11, 0.25	<0.001	13.1%	9.6%, 16.5%	<0.001
province
Western	—	—		—	—		26.4%	22.6%, 30.1%	<0.001
Central	0.11	0.06, 0.17	<0.001	1.74	1.30, 2.35	<0.001	37.3%	32.9%, 41.8%	<0.001
Southern	0.00	-0.05, 0.05	>0.9	1.01	0.75, 1.38	>0.9	26.6%	22.5%, 30.7%	<0.001
Northern	-0.05	-0.11, 0.01	0.085	0.77	0.55, 1.08	0.13	21.9%	17.9%, 26.0%	<0.001
Eastern	0.22	0.16, 0.28	<0.001	2.64	1.95, 3.57	<0.001	46.4%	41.5%, 51.2%	<0.001
North Western	-0.09	-0.14, -0.04	<0.001	0.56	0.39, 0.82	0.003	17.3%	13.1%, 21.5%	<0.001
North Central	-0.07	-0.13, 0.00	0.036	0.65	0.41, 1.02	0.059	19.3%	13.4%, 25.1%	<0.001
Uva	0.05	-0.03, 0.13	0.2	1.28	0.87, 1.86	0.2	30.9%	24.9%, 37.0%	<0.001
Sabaragamuwa	0.02	-0.05, 0.10	0.5	1.14	0.77, 1.69	0.5	28.8%	22.5%, 35.0%	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

#we have removed the calculation of a marginal prediction for age as it is a continuous
#variable.

When results are consistent across both models, it strengthens confidence in the findings. The LPM coefficients have the advantage of being directly interpretable as percentage point differences (from the .

11.7 Going Further: Marginal Effects and Interactions

Marginal predictions

Odds ratios can be difficult to communicate to a non-technical audience. Average marginal predictions convert logistic model results back into probability scale, giving directly interpretable percentage point differences — similar to LPM coefficients but computed correctly from the logistic model. The broom.helpers package provides a simple implementation:

m3 |> 
  tbl_regression(
    tidy_fun = broom.helpers::tidy_marginal_predictions,
    type = "response",
    estimate_fun = scales::label_percent(accuracy = 0.1),
    wts = weights(m3)
  )

Characteristic	Average Marginal Predictions	95% CI	p-value
sex
Male	29.0%	26.8%, 31.3%	<0.001
Female	28.8%	26.7%, 30.9%	<0.001
age
5	33.2%	29.9%, 36.5%	<0.001
9	30.6%	28.7%, 32.5%	<0.001
12	28.8%	27.2%, 30.3%	<0.001
15	27.0%	24.9%, 29.0%	<0.001
19	24.7%	21.5%, 27.8%	<0.001
Educational attainment
Less than primary	37.2%	27.8%, 46.5%	<0.001
Primary	36.1%	31.1%, 41.1%	<0.001
Junior secondary	29.9%	27.5%, 32.4%	<0.001
Senior secondary	26.6%	23.4%, 29.9%	<0.001
Collegiate	23.3%	19.2%, 27.5%	<0.001
Tertiary	21.1%	13.0%, 29.2%	<0.001
hhwealthcat
Poorest	44.8%	39.9%, 49.7%	<0.001
Poor	32.3%	28.7%, 35.9%	<0.001
Middle	30.5%	26.9%, 34.1%	<0.001
Rich	24.3%	20.8%, 27.7%	<0.001
Richest	13.1%	9.6%, 16.5%	<0.001
province
Western	26.4%	22.6%, 30.1%	<0.001
Central	37.3%	32.9%, 41.8%	<0.001
Southern	26.6%	22.5%, 30.7%	<0.001
Northern	21.9%	17.9%, 26.0%	<0.001
Eastern	46.4%	41.5%, 51.2%	<0.001
North Western	17.3%	13.1%, 21.5%	<0.001
North Central	19.3%	13.4%, 25.1%	<0.001
Uva	30.9%	24.9%, 37.0%	<0.001
Sabaragamuwa	28.8%	22.5%, 35.0%	<0.001
Abbreviation: CI = Confidence Interval

For a full treatment of marginal effects, the marginaleffects package provides systematic implementations, see the marginaleffects documentation.

Interactions

Interaction terms allow you to test whether the effect of one variable differs across subgroups — for example, whether the effect of age on walking to school differs across households of different wealth status. Interactions are easier to interpret on the probability scale, using the LPM or marginal effects/predictions.

Here is a quick example with the logistic model, suggesting that the effect of children’s age on the probability to walk to school matters mostly among the poorest households.

library(marginaleffects)
m4 <- svyglm(walk ~ sex + age + edu_parent + hhwealthcat + province + age:hhwealthcat,
              design = design_edu)

plot_predictions(m4, condition = c("age","hhwealthcat"))+theme_bw()

11.8 Exercise

After discretizing age, and adding sector (rural, urban, estate) as a new independent variable among geographical predictors, re-estimate the LPM and logistic models. Compare the results and interpret.
Children who go to school by bus are coded as 6 in the variable transport_medium. How does the probability of going to school by bus vary according to socioeconomic and geographical markers?
For your own group project, estimate logistic regressions. How do the results confirm/enhance your previous bivariate analyses?