4  A short example with HIES 2019

In this section, we provide an example of statistical analysis using data from the HIES 2019. We focus here on a short analysis on school attendance in Sri Lanka, and on the socio-demographic charactistics of individuals who attend school and of those who drop out.

In this section, you will find:

This is the kind of work you will need to do on the topic you have chosen. Of course, you will have to adapt the variables used and the kind of analysis (tables, figures, etc.) to your specific research question!

In this tutorial, we focus on the final statistical outputs that we have decided to keep. Hence, we do not show all the different steps of our statistical reasoning that we have done to explore the database and our variables (for example, using univariate statistics) nor the (many!) attempts that we have performed to eventually end up with only 3-4 outputs that we think best summarize the information we need for our research question. In particular, we do not present the unweighted-N frequency tables here, but it is always advisable to run them, as cells with very small N may require more cautious interpretation.

Keep in mind that this is only one possible analysis: other outputs could have been possible focusing on the same research question. The important aspect here is that no matter how you eventually decide to present your results, the key information needs to be clear and accessible (in this example, who attends school and who drops out, basically).

4.1 Introduction

As we explained in previous sections, we first install (and load if necessary) the different packages that we will need to perform our statistical analyses. We then load the edu.rds database that we are going to use for our research question.

#If not installed already, you need to install them using the following commands:
#(remove the # in that case)
#install.packages("tidyverse")
#install.packages("questionr")
#install.packages("Hmisc")
#install.packages("esquisse")
#install.packages("kableExtra")

#When opening a new session you need to load packages:
library(tidyverse)
library(questionr)
library(Hmisc)
library(esquisse)
library(kableExtra)

#Find the path of the repository on your computer where the data are stored
#And 
setwd("/home/groups/3genquanti/SoMix/HIES for workshop")
#Load data
edu<-readRDS("edu.rds")

Our main dependent variable is r2_school_education which refers to the third column of the table on “School Education (For persons aged from 5 to 19 years)” in the questionnaire Section 2 (p.324 of the report available on the shared Drive).

It is coded as follows: modality 1 stands for “Currently attending school”, 2 for “Never attended school” and 3 for “Attended school in the past”. Yet, in the edu.rds database, we only have these modalities as numbers (1, 2, or 3) and not what they mean. In order to make our tables and graphs easier to read, we apply the appropriate labels to each value of this variable.

This is what the following script does. Note that you can well use the function irec that we introduced in the second chapter to recode your variables. Besides, it is always good practice to check whether your recoding went well by comparing the initial and recoded variable, here: edu |> freqtable(r2_school_education,r2_school_education_rec).

#Recode the variable on school attendance
edu$r2_school_education_rec <- edu$r2_school_education  |>
  as.character() |>
  fct_recode(
    "Currently attending" = "1",
    "Never attended" = "2",
    "Attended school in the past" = "3"
  )

Similarly to what we did in the previous section on bivariate and multivariate statistics, we need to create a function that would combine the analysis of three different variables:

# Function ----------------------------------------------------------------

kbl_grouped_3way <- function(tab, digits = 1) {
  
  # Convert to a dataframe without dropping: NA conservés
  df <- as.data.frame(tab, stringsAsFactors = FALSE, drop = FALSE)
  
  # Identify the three dimensions
  dims <- names(df)[names(df) != "Freq"]
  dim1 <- dims[1]   # ligne
  dim2 <- dims[2]   # colonne
  dim3 <- dims[3]   # groupe
  
  # To character
  df[ dims ] <- lapply(df[ dims ], function(x) ifelse(is.na(x), NA, as.character(x)))
  
  # Pivot wider
  df_wide <- df |>
    tidyr::pivot_wider(
      names_from = all_of(dim2),
      values_from = Freq,
      values_fill = 0
    )
  
  # Keep order
  grouping_var <- df_wide[[dim3]]
  
  # Construct kable
  tab_out <- df_wide |>
    dplyr::select(-all_of(dim3)) |>
    kbl(digits = digits)
  
  # Group levels (including NA)
  group_levels <- unique(grouping_var)
  
  # Line index by group
  groups <- split(seq_len(nrow(df_wide)),
                  f = ifelse(is.na(grouping_var), "NA", grouping_var))
  
  # Pack_rows for each group in kable
  for (g in names(groups)) {
    rows <- groups[[g]]
    tab_out <- tab_out |>
      pack_rows(group = g,
                start_row = min(rows),
                end_row = max(rows))
  }
  
  tab_out
}

4.2 School non-attendance by sex and age

We first explore how attendance to school vary by sex and age. Indeed, there are reasons to think that drop out will be more likely to happen for older students. In Sri Lanka, school is compulsory until age 14, so we can imagine that attendance will decrease only from this age. Is this was we observe in the data? And are there some differences for male and female students?

We use the following code to produce a frequency table (with weighted percentages) to see how the distribution of each category of our dependent variable (school attendance, in column) vary according to the two independent variables we just mentioned: sex and age.

# Frequency table of school attendance by age and sex ----------------------------------------------------------------

m<-edu |> freqtable(age,r2_school_education_rec,sex,weights=finalweight_25per) |>
  lprop() 

kbl_grouped_3way(m,digits=0)|>
  kable_classic_2(full_width = FALSE)
age Currently attending Never attended Attended school in the past Total
Male
5 58 42 0 100
6 100 0 0 100
7 100 0 0 100
8 100 0 0 100
9 100 0 0 100
10 100 0 0 100
11 100 0 0 100
12 100 0 0 100
13 99 0 1 100
14 100 0 0 100
15 98 0 2 100
16 74 1 25 100
17 60 0 40 100
18 61 2 37 100
19 22 0 78 100
All 84 3 13 100
Female
5 54 46 0 100
6 100 0 0 100
7 100 0 0 100
8 100 0 0 100
9 100 0 0 100
10 100 0 0 100
11 100 0 0 100
12 99 1 1 100
13 99 0 1 100
14 97 0 3 100
15 98 0 2 100
16 86 0 14 100
17 80 2 18 100
18 68 0 32 100
19 37 1 62 100
All 88 3 9 100
#Note the digits=0 argument here which means that % will be shown without any digit
#We don’t need that level of precision (especially since the data comes from a survey and
#has a margin of error), and the extra digits can obscure the message of the table
  • This table shows specific patterns by age.

    • Until age 15, the vast majority of children are currently attending school. This is true both for male and female students. However, from age 15, drop outs increase. Among male respondents aged 15 at the time of the survey, 97.6% were currently attending school. Yet, this proportion later shrinks: they are only 74.1% among male respondents aged 16, and 21.9% for those aged 19.

    • Conversely, the percentage of individuals who dropped out school increases with age: it goes from only 2.4% of 15 year-old men to 77.6% for 19 year-old men.

  • Overall, we observe a similar pattern for women: school attendance rates are very high until age 15, and then consistently decrease with age. Interestingly, we also see some differences by sex. Drop out rates are larger for men than for women: while drop out rate among male respondents aged 19 is around 78%, this proportion only reaches 62% among female respondents.

  • Finally, note that the proportion of students who never attended school is very negligible.

4.3 School non-attendance and locality

Are there some differences in school attendance by place of residence? We saw in previous sections that distance to school tends to be higher in rural areas. In that context, can we expect school attendance to be larger in urban areas in comparison to rural sectors?

To investigate this question, we want to check whether the rate of school attendance is similar in urban, rural areas and estates. As we observed in the previous table, school attendance greatly varies by age, so we may want to take this variable into account when exploring the differences by locality.

For the sake of concision and to ensure that the table is not too difficult to read, we need to select the variables we want to study. Therefore, we have decided not to include the sex variable: while we observed some differences by sex, the main driver of school attendance observed in the first table was age. Further analyses could later explore the specificity of school attendance by type of locality and sex. In this same vein, we run analyses only on individuals above 14, as we saw in the previous table that school attendance does not vary for younger respondents: in that case, it is best to keep only the important information so as not to overwhelm the reader with too much detail!

# First, recode the sector (residence) variable

  edu$sector_rec <- edu$sector  |>
  as.character() |>
  fct_recode(
    "Urban" = "1",
    "Rural" = "2",
    "Estate" = "3"
  )


# Second, we keep only individuals in our database aged at least 14--------------------------
edu14<- edu |> 
  filter(age>=14)

# Third, frequency table of school attendance by age and locality--------------------------
m<-edu14 |> freqtable(age,r2_school_education_rec,sector_rec,weights=finalweight_25per) |>
  lprop() 
kbl_grouped_3way(m,digits=0)|>
  kable_classic_2(full_width = FALSE)
age Currently attending Never attended Attended school in the past Total
Urban
14 97 0 3 100
15 98 0 2 100
16 74 0 26 100
17 77 0 23 100
18 64 3 33 100
19 30 0 70 100
All 72 0 28 100
Rural
14 99 0 1 100
15 98 0 2 100
16 82 0 18 100
17 70 2 29 100
18 65 1 34 100
19 29 1 70 100
All 74 1 26 100
Estate
14 100 0 0 100
15 100 0 0 100
16 79 0 21 100
17 58 0 42 100
18 57 0 43 100
19 36 0 64 100
All 76 0 24 100

Interestingly, there do not appear to be strong differences in school attendance status across residential locality types. If anything, urban children seem slightly more likely to be out of school than rural or estate children. This pattern is somewhat surprising, as we might have expected rural children to be more prone to dropping out due to more limited access to schools at higher grade levels.

4.4 School non-attendance and family background

School attendance is likely to be associated with family’s social background for at least two reasons.

  • First, cultural capital may play a role: parents with higher levels of education are often better equipped to support and value continued schooling.

  • Second, economic factors matter: for families with fewer resources, the opportunity cost of keeping a child in school—rather than contributing to household income or domestic work—may be higher.

Do these hypotheses ring a bell with specific sociological theories? Say, maybe Bourdieu’s theory of cultural reproduction or Boudon’s theory of opportunity (secondary effects)?

We can have a look whether parental education and household wealth matter.

For parental education, notice that for about 16 percent of children parental education is unknown (in fact, these children are not the children of the household head so in the survey we could not match them to any parent).

Let’s examine how parental education affects school drop out at each age:

m<-edu14 |> freqtable(edu_parent,r2_school_education_rec,age,weights=finalweight_25per) |>
  lprop() 
kbl_grouped_3way(m,digits=0)|>
  kable_classic_2(full_width = FALSE)
edu_parent Currently attending Never attended Attended school in the past Total
14
Less than primary 91 0 9 100
Primary 99 1 0 100
Junior secondary 99 0 1 100
Senior secondary 100 0 0 100
Collegiate 100 0 0 100
Tertiary 100 0 0 100
NA 95 0 5 100
All 98 0 2 100
15
Less than primary 82 0 18 100
Primary 96 0 4 100
Junior secondary 99 0 1 100
Senior secondary 100 0 0 100
Collegiate 100 0 0 100
Tertiary 100 0 0 100
NA 98 0 2 100
All 98 0 2 100
16
Less than primary 66 0 34 100
Primary 77 0 23 100
Junior secondary 83 0 17 100
Senior secondary 89 0 11 100
Collegiate 70 3 28 100
Tertiary 95 0 5 100
NA 79 0 21 100
All 80 0 19 100
17
Less than primary 52 8 40 100
Primary 64 3 33 100
Junior secondary 66 1 33 100
Senior secondary 88 1 11 100
Collegiate 88 0 12 100
Tertiary 100 0 0 100
NA 45 0 55 100
All 71 1 28 100
18
Less than primary 17 0 83 100
Primary 40 0 60 100
Junior secondary 59 1 41 100
Senior secondary 93 0 7 100
Collegiate 88 3 9 100
Tertiary 100 0 0 100
NA 58 3 39 100
All 64 1 34 100
19
Less than primary 10 0 90 100
Primary 25 0 75 100
Junior secondary 24 1 75 100
Senior secondary 31 0 69 100
Collegiate 41 1 58 100
Tertiary 60 0 40 100
NA 27 0 73 100
All 29 1 70 100
  • Below age 16, the share of children not attending school is much higher among those whose parents have less than a primary education; for children of more educated parents, non-attendance is virtually nonexistent.

  • From age 16 onward, the overall pattern follows a clear gradient: children with lower-educated parents are more likely to leave school, while those with highly educated parents are more likely to remain enrolled.

  • It is also noteworthy that children whose parents’ education is unknown show above-average levels of non-attendance. This group may have a distinct background—for instance, some may not be the household head’s own children and might be growing up in large joint families or without their parents.

m<-edu14 |> freqtable(hhwealthcat,r2_school_education_rec,age,weights=finalweight_25per) |>
  lprop() 
kbl_grouped_3way(m,digits=0)|>
  kable_classic_2(full_width = FALSE)
hhwealthcat Currently attending Never attended Attended school in the past Total
14
Poorest 95 1 4 100
Poor 100 0 0 100
Middle 95 0 5 100
Rich 100 0 0 100
Richest 100 0 0 100
All 98 0 2 100
15
Poorest 96 0 4 100
Poor 100 0 0 100
Middle 93 0 7 100
Rich 100 0 0 100
Richest 100 0 0 100
All 98 0 2 100
16
Poorest 80 0 20 100
Poor 84 0 16 100
Middle 75 1 23 100
Rich 86 0 14 100
Richest 77 0 23 100
All 80 0 19 100
17
Poorest 50 6 44 100
Poor 64 1 35 100
Middle 71 0 29 100
Rich 81 1 19 100
Richest 83 0 17 100
All 71 1 28 100
18
Poorest 47 0 53 100
Poor 63 2 35 100
Middle 61 1 38 100
Rich 72 2 27 100
Richest 80 0 20 100
All 64 1 34 100
19
Poorest 30 0 70 100
Poor 25 1 74 100
Middle 26 0 74 100
Rich 27 1 71 100
Richest 39 0 61 100
All 29 1 70 100

Wealth differences become more pronounced in late adolescence:

  • At ages 14–15, past attendance remains very low across all categories, indicating that early dropout is rare regardless of household wealth.

  • From age 16 onward, however, the proportion of children who have left school rises steadily, especially among poorer households.

  • By ages 17–18, children in the poorest group show the highest levels of past attendance (44–53%), while those in the richest households are much less likely to have exited school.

  • At age 19, past attendance becomes the majority status for all groups, though it remains somewhat lower among the richest, highlighting that wealth increasingly shapes school persistence as children grow older.

4.5 Reasons for dropping out

We can dig in a bit deeper on the students who attended school in the past by checking the reasons for leaving school. Again, we need to recode the categories of this variable called r2_school_education.

edu14notcurrently<-edu14 |> filter(r2_school_education_rec=="Attended school in the past")
edu14notcurrently$reason_leave_school_rec <- edu14notcurrently$reason_leave_school  |>
  as.character() |>
  fct_recode(
    "Further schooling not available or too far away" = "1",
    "Financial problems" = "2",
    "House keeping / Family business" = "3",
    "Disability" = "4",
    "Illness" = "5",
    "Not willing / poor academic progress" = "6",
    "Pending results (GCE)" = "7",
    "Complete GCE / Grade 13" = "8",
    "Engaged in an economic activity" = "9",
    "Other" = "99"
  )

edu14notcurrently |> freqtable(reason_leave_school_rec,weights=finalweight_25per) |>
  freq() |> select(-n) |>
  kbl(digits=0) |> kable_classic(full_width = F)
% val%
Further schooling not available or too far away 1 1
Financial problems 7 7
House keeping / Family business 4 4
Disability 1 1
Illness 1 1
Not willing / poor academic progress 37 37
Pending results (GCE) 14 14
Complete GCE / Grade 13 14 14
Engaged in an economic activity 13 13
Other 7 7

This variable is rather subjective, and its categories may overlap with one another. Ideally, we would also have preferred to rely on more objective indicators of school achievement. Despite the presence of ten distinct response options, the reasons for leaving school can be grouped into a few broader blocks.

  • Opportunity-cost reasons include financial problems (7%), household or family business responsibilities (4%), and engagement in economic activity (13%).

  • Academic-related reasons are dominated by ‘not willing / poor academic progress,’ which is by far the most frequently cited reason (37%).

  • A small share relates to institutional or availability issues, such as further schooling being unavailable or too far away (1%, very low!) or pending GCE results (14%) to continue further.

  • Personal health issues, including disability (1%) and illness (1%) account for a small share of the cited reasons.

  • The Other category represents 7% of the children not attending school, and unfortunately we cannot do anything about it.

  • Having completed GCE / grade 13 is important to keep in mind here: these children already have completed secondary and this section of the survey does not keep track whether children are engaging into tertiary education so they should be left out in further analyses.

Let us recode the reasons for leaving school into these broader categories (noting that this recoding may be open to criticism or revision) and filter out the children having completed secondary school in order to streamline the analysis.

edu14nocurnocomp <-edu14notcurrently |> filter(reason_leave_school_rec!="Complete GCE / Grade 13")
  
edu14nocurnocomp$reason_leave_streamlined <- edu14nocurnocomp$reason_leave_school_rec  |>
fct_recode(
  "Institutional / Availability" = "Further schooling not available or too far away",
  "Opportunity-cost" = "Financial problems",
  "Opportunity-cost" = "House keeping / Family business",
  "Personal health issues" = "Disability",
  "Personal health issues" = "Illness",
  "Academic achievement-related" = "Not willing / poor academic progress",
  "Institutional / Availability" = "Pending results (GCE)",
  "Opportunity-cost" = "Engaged in an economic activity"
)
#Remove unused levels (Complete GCE / Grade 13)
edu14nocurnocomp$reason_leave_streamlined <- droplevels(edu14nocurnocomp$reason_leave_streamlined)

edu14nocurnocomp |> freqtable(reason_leave_streamlined,weights=finalweight_25per) |>
  freq() |> select(-n) |>
  kbl(digits=0) |> kable_classic(full_width = F)
% val%
Institutional / Availability 18 18
Opportunity-cost 29 29
Personal health issues 2 2
Academic achievement-related 44 44
Other 8 8

With this streamlined variable, we then investigate how they vary according to parent education and household wealth:

edu14nocurnocomp |> freqtable(edu_parent,reason_leave_streamlined,weights=finalweight_25per) |>
  rprop() |> 
  kbl(digits=0) |> kable_classic(full_width = F)
Institutional / Availability Opportunity-cost Personal health issues Academic achievement-related Other Total
Less than primary 9 37 0 54 0 100
Primary 13 26 1 51 9 100
Junior secondary 13 33 2 47 5 100
Senior secondary 39 34 3 20 4 100
Collegiate 35 15 3 24 24 100
Tertiary 0 0 0 0 100 100
NA 16 24 3 46 11 100
All 18 29 2 44 8 100
edu14nocurnocomp |> freqtable(hhwealthcat,reason_leave_streamlined,weights=finalweight_25per) |>
  rprop() |> 
  kbl(digits=0) |> kable_classic(full_width = F)
Institutional / Availability Opportunity-cost Personal health issues Academic achievement-related Other Total
Poorest 11 39 1 47 2 100
Poor 13 28 2 50 8 100
Middle 18 26 2 44 10 100
Rich 20 20 2 45 14 100
Richest 33 30 2 26 9 100
All 18 29 2 44 8 100

Taken together, the two tables show that both parental education and household wealth shape the reasons why children leave school, but in somewhat different ways.

  • Among children from less educated or poorer households, school exit is most often linked to academic-achievement–related issues and opportunity-cost pressures, suggesting that learning difficulties and the need to contribute economically remain key constraints. As parental education and wealth increase, however, academic and economic pressures diminish.

    • E.g., more than half (54%) of the children with less than primary educated parents cite academic-achievement-related issues (while only 24% of children with tertiary-educated parents cite this issue)

    • E.g., about 40% of poorest children cite opportunity-cost pressures, but only 30% of the richest children.

  • As parental education and wealth increase, institutional or availability factors—such as limited options for further schooling—become more prominent. At the top of the socioeconomic spectrum, these structural constraints partly replace academic or financial motivations as the main reasons for leaving school.

    • E.g., 33% of the richest children cite this reason but only 11% of poorest children do.

Notice that having filtered out children who had completed GCE/Grade 13 (i.e. those who have completed secondary school) means that none of the children with tertiary-educated parents are out of school, reinforcing the association between parents’ education and children’s dropouts.

4.6 Limits (and possible ways to solve them)

Our analysis has some limitations (or observations to bear in mind).

  • We adopted an age-cohort approach, implicitly assuming that children of different ages (e.g., 14 vs. 15) can be compared directly. This could imply, for example, that children who dropped out at age 15 are assumed to have done so in that year. But it would not be accurate to think that way—children aged 17, for instance, may have left school several years earlier. This issue could be explored further using the variable when_stop_schooling, which indicates the year in which each child dropped out.

  • We also did not consider another issue beyond dropping out: class/grade for age. Even if children are attending school, they may be late for their age, because they have started school later than their peers or because they repeated a class. The variables grade_this_year and grade_last_year may be of interest to examine this.

4.7 Possible qualitative extensions

Qualitative field material could enrich our understanding of school attendance and dropout. For example, interviews with students, parents, and teachers could provide insights into:

  • The reasons why children leave school that are not captured by survey categories (and whether they sometimes overlap);

  • How parents’ expectations shape educational decisions (to drop out);

  • How parents and children perceive the value of education (and whether it is worth going to school if it is not compulsory);

  • Localized barriers to school attendance (even if rural does not seem associated to more drop outs, are there some specific areas where school density is less present at higher grades?).

4.8 Preliminary conclusion

Overall, school attendance in Sri Lanka is high at younger ages but declines steadily from age 16, with dropouts more common among children from poorer households or with less-educated parents. Academic challenges and opportunity costs are the main reasons for leaving school among disadvantaged children, while institutional constraints become more important for children from wealthier or more educated families.

For a report:

  • This analysis should be extended with further academic literature on education school drop out, for instance, in Sri Lanka, see Lindberg, 2010 or Arunatilake, 2006 (they are now a bit old and you should also search for more recent articles).

  • We have not included graphics here, but they are always a useful way to convey information to an audience. You can use esquisse, as introduced in previous chapters, or export the tables you want to visualize into your preferred spreadsheet software to create graphics when finalizing your analysis.

  • In a final report, all the tables and graphs must have a title, as well as some explanations of the data used (HIES 2019 here) and the restrictions applied to the analytical sample (for example for the latter tables, individuals aged between 14 and 19, etc.).