The required R packages are explicitly listed here.
library(raster)
library(forcats)
library(ggmap)
library(lme4)
library(nlme)
library(pander)
library(haven)
library(knitr)
library(fields)
library(rgdal)
library(sf)
library(randomForest)
library(tidyverse)
To meet the high demand on the geographic dis-aggregation of key indicators, UNFPA Small Area Estimation (SAE) work has initially focused on the combination of census data and survey data. But to be rigorously valid, one necessary condition is that both the survey and the census should be carried out at the same moment so that the distributions of the variables from the survey converge toward their census counterpart. But in the real world, the census is generally “old” and the survey data contemporaneous. In this more frequent situation, the method suffers from different biases that have been pointed out by numerous authors (Tarozzi & Deaton, 2009; Molina & Rao 2010; Nguyen, 2012). So a new SAE approach needs to be developed for the countries without recent census data available. This document describes a method that fills this gap by relying only on the survey Data and borrowing strength from the spatial correlation within the sample.
This document describes the detailed steps for unit level model small area estimation. Each step is accompanied with the associated R code, followed by the detailed description for the code. The Ghana DHS data is taken as the example.
unsae
packageModeling requires an R package called unsae
available on here.
The development of unsae
package has been complete,
yet documentation may not be fully provided.
To install unsae
, run the following code:
devtools::install_github("ydhwang/unsae")
Then it can be loaded by running:
library(unsae)
could not find function
error messages or something
similar, it means the package may have been updated. So you need to run
the devtools::install_github("ydhwang/unsae")
again../data/
(locations are all relative).file.choose()
and locate the
file manually.../data/2011 DHS/
.GHHR72FL.SAV
: individual dataGHIR72FL.SAV
: Household datahaven::read_sav
function to read the data files
stored in SAV
format.SAV
file to “flattened”
data table form can be removed by choosing the variables that don’t have
“$” in their names.$
string and excludes them from the data set.Multiple occurrence variables are placed one after the other by occurrence; all variables for occurrence 1 precede all variables for occurrence 2 and so on; multiple occurrence variables have a numeric sub- index that indicate the occurrence number.
V003
is the Respondent’s line number in the household
schedule, so removed from the analysis.indiv_ini <- read_sav("sae-geospatial/data/2011 DHS/GHIR72FL.SAV")
indiv <- indiv_ini %>% select(!matches("[\\$]")) %>%
select(-V000) %>% # same country
select(-V003)
indiv <- indiv %>% select_if(function(x) mean(is.na(x)) < 0.3)
info <- lapply(indiv, function(x) attr(x, "label"))
ind_var_tbl <- tibble(COL = names(info), DESC = as.character(info))
ind_var_tbl %>% head %>% pander
COL | DESC |
---|---|
CASEID | Case Identification |
V001 | Cluster number |
V002 | Household number |
V004 | Ultimate area unit |
V005 | Women’s individual sample weight (6 decimals) |
V006 | Month of interview |
We first conduct the data cleaning for household data in a similar manner.
household_ini <- read_sav("sae-geospatial/data/2011 DHS/GHHR72FL.SAV")
household <- household_ini %>% select(!matches("[\\$]"))
household <- household %>%
rename(V000 = HV000, V001 = HV001, V002 = HV002, V003 = HV003, V004 = HV004)
The matching key information can be found in HV000
,
HV001
, HV0002
, HV003
,
HV004
. They are renamed to be used for joining with the
individual data table.
household <- household %>% select_if(function(x) mean(is.na(x)) < 0.3) %>% select(-V000) %>%
select(-V003) # same country
Same missing rules are applied to select the relevant variables. We can see that 3028 columns are reduced to 142.
Similarly to the individual data, only first few are presented here.
info <- lapply(household, function(x) attr(x, "label"))
house_var_tbl <- tibble(COL = names(info), DESC = as.character(info))
house_var_tbl %>% head %>% pander
COL | DESC |
---|---|
HHID | Case Identification |
V001 | Cluster number |
V002 | Household number |
V004 | Ultimate area unit |
HV005 | Household sample weight (6 decimals) |
HV006 | Month of interview |
dhs_expanded
.V000
, V001
, V0002
,
V003
, V004
as key variables.dhs_expanded <- left_join(indiv, household, by = c("V001", "V002", "V004")) %>%
arrange(V001)
HV004
: Ultimate area unit is a number assigned to each sample point to identify the ultimate area units used in the collection of data. This variable is usually the same as the cluster number, but may be a sequentially numbered variable for samples with a more complicated structure.
V218
Number of living children; convert it into three
categories and rename it as V218R
V013
: Age in 5-year groups; convert it into three
categories and rename it as V013R
V218R
0 1-2 3-4 5 and more
0 1 3 5
V013R
15-24 25-34 35-49
1 2 3
The age group variables are re-coded into a fewer categories.
dhs_expanded <- dhs_expanded %>% mutate(V013R =
ifelse(V013 %in% 1:2, 1,
ifelse(V013 %in% 3:4, 2,
ifelse(V013 > 4, 3, V013))))
dhs_expanded <- dhs_expanded %>%
mutate(V218R =
ifelse(V218 == 0, 0, ifelse(V218 == 1|V218 == 2, 1,
ifelse(V218 ==3|V218 ==4, 3,
ifelse(V218 >=5, 5, V218)))))
dhs_expanded <- dhs_expanded %>% select_if(function(x) mean(is.na(x)) < 0.9) # remove if more than 10% missing
Additional steps followed to remove columns with too much missing info.
V313
:
combine 1, 2, 3 as “Yes”; 0 as “No.V313
: combine 0, 1, 2, 3 as “No”; 3 as “Yes”.V626
:
combining 1, 2 as “Yes” (unmet need for family planning); combining all
other answers as “No”.dhs_reduced <-
dhs_expanded %>%
select(V013R, V106, V218R, V025, V120, V127, V128, V128, V129, V113,
V001, V005, # survey info part
V119,
HV206,
V130,
V116,
V151,
V150,
V626, V313) %>%
mutate(CPR = ifelse(V313 == 0, 0, 1)) %>%
mutate(CPRm = ifelse(V313 == 3, "Yes", "No")) %>%
mutate(UNR = ifelse(V626 == 1|V626 == 2, "Yes", "No")) %>%
select(-V313, -V626)
dhs_reduced <- dhs_reduced %>%
mutate_all(as_factor) %>%
mutate(V218R = ifelse(V218R == "5 and more"|V218R == "3-4",
"3 or more", V218R))
attr(dhs_reduced$V218R, "label") <- attr(dhs_expanded$V218, "label")
V013R, V106, V218R, V025, V120, V127, V128, V128, V129, V113, V119, HV206, V130, V116, V151, V150
are used as the candidates, and choose the seemingly important six
variables among them.temp_set <- dhs_reduced %>% na.omit
temp_rf <- randomForest(CPR ~ V013R + V106 + V218R + V025 + V120 + V127 + V128 + V128 + V129 + V113 + V119 +
HV206 + V130 + V116 + V151 + V150, data = temp_set)
imp_tbl <- tibble(imp = as.numeric(importance(temp_rf)), key = rownames(importance(temp_rf))) %>% mutate(q = rank(-imp))
selected_06 <- imp_tbl %>% filter(q <= 6) %>% arrange(-imp)
info_v <- lapply(temp_set, function(x) attr(x, "label")) %>% bind_rows %>% gather
left_join(selected_06, info_v, by = "key") %>% select(-imp) %>% pander()
key | q | value |
---|---|---|
V113 | 1 | Source of drinking water |
V130 | 2 | Religion |
V128 | 3 | Main wall material |
V116 | 4 | Type of toilet facility |
V127 | 5 | Main floor material |
V218R | 6 | Number of living children |
fct_lump_prop
combines levels that appear less than 10
% and create a new category named “Other”.target <- selected_06$key
reduced_tbl <-
temp_set %>%
mutate_if(function(x) n_distinct(x) < 20, as.factor) %>%
mutate_at(all_of(target), function(x) fct_lump_prop(x, 0.1, other_level = "Other"))
reduced_tbl
now has columns with fewer categories after
lumping levels with less than 10% proportion.temp_set %>% group_by(V130) %>% tally %>% pander
V130 | n |
---|---|
Catholic | 1341 |
Anglican | 72 |
Methodist | 547 |
Presbyterian | 513 |
Pentecostal/charismatic | 3456 |
Other Christian | 1239 |
Islam | 1726 |
Traditional/spiritualist | 226 |
No religion | 273 |
Other | 1 |
reduced_tbl %>% group_by(V130) %>% tally %>% pander
V130 | n |
---|---|
Catholic | 1341 |
Pentecostal/charismatic | 3456 |
Other Christian | 1239 |
Islam | 1726 |
Other | 1632 |
# automation of this needs a carefully examined set of the variables
fe_form <- paste(selected_06$key, collapse = " + ")
full_form <- as.formula(paste("CPR ~ ", fe_form, "+ (1|V001)"))
# full model (with all data)
model_out <- glmer(formula = full_form, data = reduced_tbl, family = binomial(link = "logit"))
CPR
as our
target indicator, and the survey cluster as a random effect.full_form
(“full” formula including fixed effects and
random effect).print(full_form)
## CPR ~ V113 + V130 + V128 + V116 + V127 + V218R + (1 | V001)
glmer
, generalized linear
mixed effects model. Users can use this in the later part as a
reference.country_shp <- st_read("sae-geospatial/data/2011 DHS/Spatial/GHGE71FL/GHGE71FL.shp", quiet = TRUE)
## handling missing coord
mis_mark <- country_shp %>% filter((LATNUM == 0 & LONGNUM ==0))
country_shp <- country_shp %>% filter(!(LATNUM == 0 & LONGNUM ==0)) %>% filter(!is.na(LATNUM))
bounding_box <- st_bbox(country_shp) %>% as.numeric # getting the bounding box
country_map <- get_stamenmap(bbox = bounding_box, messaging = FALSE, zoom = 8,
maptype = "toner-lite", format = c("png"))
ggmap(country_map, extent = "device")
country_map
object will be used as a base layer for
plotting in the remaining part.In this section, we will describe the detail of the model fitting process.
reduced_tbl
, DHSCLUST information was stored as
V001
(see DHS
webpage.)reduced_tbl <- reduced_tbl %>%
rename(DHSCLUST = V001) %>%
mutate(DHSCLUST = as.integer(DHSCLUST))
cluster_result <- reduced_tbl %>%
group_by(DHSCLUST) %>%
summarize(Prop_raw = mean(CPRm == "Yes"))
To see the first few clusters,
DHSCLUST | Prop_raw |
---|---|
1 | 0.03704 |
2 | 0.1429 |
3 | 0 |
4 | 0 |
5 | 0.1538 |
6 | 0.2 |
coord_info
of the
coordinate information associated with the DHS data.valid_set <- anti_join(reduced_tbl, mis_mark, by = "DHSCLUST")
coord_info <- tibble(DHSCLUST = country_shp$DHSCLUST, lon = country_shp$LONGNUM, lat = country_shp$LATNUM)
coord_info
is a tibble with three columns –
DHSCLUST
, lon
, lat
. This tibble
will be used to join the spatial coordinates to the survey data.valid_set_with_sp <- left_join(valid_set, coord_info, by = "DHSCLUST")
valid_set_with_sp <- valid_set_with_sp %>%
mutate(DHSCLUST = as.factor(DHSCLUST))
valid_set_with_sp
now has spatial coordinates.mod_lme
.mod_lme <-
glmer(CPR ~
V218R + V013R + (1 | DHSCLUST),
family = binomial,
data = valid_set_with_sp)
multilevel_EM
, which gives a spatial
model.spatial_model_fit <-
multilevel_EM(formula = "CPR ~ V218R + V013R + (1 | DHSCLUST)",
data = valid_set_with_sp, coordinates = c("lon", "lat"))
## ...EM iteration: 1
## ...EM iteration: 2
## ...EM iteration: 3
formula
: (string) formula that follows
glmer
syntax. It only takes one random effects
(cluster).data
: name of the data set.coordinates
: the name of coordinates within the data.
These columns should exist within the data set provided.multilevel_EM
mod_lme
from glmer
,
multilevel_EM
uses a spatial model for \(\mathbf{b}\); specifically, \[
\mathbf{b} \sim N(\mathbf{0}, \mathbf{C}),
\] whereThe code below does the following.
grid_coord
, with the coordinates assigned to
each grid points.country_border <- st_read("sae-geospatial/data/shps/sdr_subnational_boundaries.shp", quiet = TRUE)
ggplot(country_border) + geom_sf()
tibble
so that we can
use it with spatial_model_fit
. The tibble is named
grid_coord
.lon
and lat
as the
coordinates’ name, we need to rename X
and Y
to lon
and lat
.grid <- country_border %>% st_make_grid(cellsize = 0.1, what = "centers") %>% st_intersection(country_border)
ggplot() + geom_sf(data = grid)
grid_coord <- st_coordinates(grid) %>% as_tibble() %>%
rename(lon = X, lat = Y)
Once grid_coord
is available, the code below chooses
the nearest 10% of the (survey) clusters for each grid, sample 40% of
them, and make a prediction for each grid point using unit level model
by treating the sampled survey as observed data.
We want to get a set of predictions for each grid in
grid_coord
, so for
loop runs over
grid_coord
.
All clusters in DHS data are considered for each grid point.
Once the loop is complete, it combines into a single
tibble
and then indexed for every grid point – named
grid_tbl
.
cluster_coord <- coord_info %>% dplyr::select(lon, lat) %>% na.omit
covariates_list <- list()
for (c_i in 1:nrow(grid_coord)){
cluster_dist_temp <- coord_info
cluster_dist_temp$p_rank <- rdist(cluster_coord, grid_coord[c_i,]) %>% percent_rank()
chosen_coord <- cluster_dist_temp %>% filter(p_rank <= 0.1)
chosen_tbl <- valid_set_with_sp %>% filter(DHSCLUST %in% chosen_coord$DHSCLUST)
covariates_list[[c_i]] <- chosen_tbl %>% sample_frac(0.4) %>%
select(V013R, V218R, V119, HV206)
}
grid_coord$grid_index <- 1:nrow(grid_coord) %>% as.character()
grid_samples <- bind_rows(covariates_list, .id = "grid_index")
grid_tbl <- left_join(grid_samples, grid_coord, by = "grid_index")
grid_tbl
can be used to create a grid map.predict_EM
function is used.predict_EM
is a wrapper taking two inputs: name of the
spatial model obtained from and multilevel_EM
and a new
data set.multilevel_EM
, as well as two coordinates (in this example,
lon
and lat
).# this is grid level aggregation
# predict function
g_index <- grid_tbl$grid_index %>% unique
grid_tbl$yhat <- NA
for (g in seq_along(g_index)){
g_tbl <- grid_tbl %>% filter(grid_index == g_index[g])
grid_tbl$yhat[grid_tbl$grid_index == g_index[g]] <-
predict_em(spatial_model_fit, test_set = g_tbl)
}
raster_tbl <- grid_tbl %>%
group_by(lat, lon) %>%
summarise(m_yhat = mean(yhat), .groups = "drop")
Plotting the grid points.
ggplot(raster_tbl) + aes(y = lat, x = lon, colour = m_yhat) +
xlab("") + ylab("") +
geom_point(size = 3, alpha = 0.5) + scale_colour_viridis_c("contraception", option = "C") + coord_map()
Overlaying the points over the map.
(out <-
ggmap(country_map, extent = "device") +
xlab("") + ylab("") +
geom_point(data = raster_tbl, aes(y = lat, x = lon, colour = m_yhat), size = 3, alpha = 0.3) + scale_colour_viridis_c("contraception", option = "C"))
One can get the raster_map.png
file by running
ggsave(out, filename = "raster_map.png", width = 10, height = 8)
.
grid_coord
we created from the raster
map.level_2_geom <- st_read("sae-geospatial/data/admin_shapefiles/gha_admbnda_adm2_gss_20210308.shp", quiet = TRUE)
V218R
and V013R
, which can be found in the
code. If the model uses different variables, one can change the modeling
part.outcome <-
produce_m_v_sae(spatial_model_fit, level_2_geom,
coord_info, frac = 0.1, resampling = 0.3, counter = FALSE)
sae_map <- ggplot(outcome) +
geom_sf(aes(fill = m)) +
scale_fill_viridis_c(name = "Modern \nContraception Use", option = "B")
print(sae_map)
sae_map2 <- ggplot(outcome) +
geom_sf(aes(fill = v)) +
scale_fill_viridis_c(name = "Modern \nContraception Use (MSE)", option = "B")
print(sae_map2)
One can create the ghana_sae_map_1.png
by running
ggsave(sae_map, filename = "ghana_sae_map_1.png", width = 8, height = 8)
.