Demographic and Health Surveys (DHS) /Malaria Indicators Surveys (MIS) Multi-Country Malaria Analytics
Harmonization, cleaning, survey weighting, and estimation of malaria indicators (fever, testing, care-seeking) across 20+ African countries.
Author
Ousmane Diallo, MPH-PhD
Published
November 16, 2025
Overview
This project harmonizes and analyzes 20+ DHS/MIS surveys from sub-Saharan Africa to estimate malaria-related indicators among children under five. The workflow integrates:
API-based acquisition of indicator-level data from the DHS REST API
Microdata download and management using the {rdhs} R package
Cross-country harmonization of DHS/MIS datasets
Standardization of regional categories (v024) that differ by country and survey year
Construction of malaria indicators (fever, testing, care-seeking)
Survey-weighted estimation using DHS complex design
Generation of Power BI / Shiny-ready analytical datasets
This project supports malaria program decision-making, RWE analyses, and subnational monitoring.
Objectives
Use both the DHS REST API and DHS microdata to build a multi-country malaria analytics pipeline
Harmonize heterogeneous DHS/MIS datasets into a unified analytical structure
Recode region classifications to ensure comparability across countries and years
Compute malaria indicators among children <5:
Fever
Testing (RDT/ Microscopy)
Care-seeking behavior
Apply survey weights and complex sampling design
Produce cross-country analytical tables and visualizations
Methods
Data source and acquisition
Used DHS KR (children under 5) microdata for malaria-related surveys in multiple African countries.
Survey IDs were pre-selected (e.g., via DHS Download Manager) and all corresponding KR.DTA files were stored in a project directory.
R scripts automatically discovered and loaded all KR files using list.files() and haven::read_dta(), allowing the pipeline to scale across countries and survey years without manual file handling.
Show code: Data source and acquisition
# Set working directory and data pathDataDir =setwd('~/Library/CloudStorage/OneDrive-Personnel/Ousmane DIALLO/Africa_care_seeking_rate/')# Load required libraries for data manipulation and survey analysislibrary(haven) # For reading Stata/SPSS/SAS fileslibrary(labelled) # For handling variable labelslibrary(dplyr) # For data manipulationlibrary(survey) # For complex survey data analysislibrary(purrr) # For functional programming toolslibrary(plyr) # For data transformation (note: conflicts with dplyr)library(data.table) # For fast data manipulationlibrary(tidyr) # For data tidying## Function to read multiple files matching a pattern# Parameters:# filepat: regex pattern to match filenames# path: directory path to search in# fun: function to apply to each file (e.g., read_dta)# encoding: character encoding (default: "latin1")# Returns: list of dataframes that have non-missing v024 valuesread.files <-function(filepat, path, fun, encoding ="latin1") {# Get full paths of all files matching the pattern filenames <-list.files(path = path, pattern = filepat, recursive =TRUE, full.names =TRUE)# Apply the reading function to each file dflist <-sapply(filenames, fun, simplify =FALSE)# Helper function to check if v024 variable has any missing values isna_v024 <-function(df) any(is.na(df$v024))# Identify dataframes with missing v024 values cond <-sapply(dflist, isna_v024)# Return only dataframes without missing v024 values dflist[!cond]}# Get list of all DTA (Stata) files in the data directoryfilenames_all <-list.files(path = DataDir, pattern =".*.*\\.DTA", recursive =TRUE, full.names =TRUE)# Read all DHS (Demographic and Health Surveys) files# that are KR (Kids Recode) filesdhs =read.files(".*KR.*\\.DTA", DataDir, read_dta)# COMMENTED OUT: Country patterns for batch processing# This list contains regex patterns for all countries to be analyzed# Uncomment and use in loop for multi-country processing# cntrs <- list(".*NGKR21.*\\.DTA", ".*NGKR4B.*\\.DTA", ".*NGKR53.*\\.DTA", # ".*NGKR61.*\\.DTA", ".*NGKR6A.*\\.DTA", ".*NGKR71.*\\.DTA", ...)
Data ingestion and harmonization
Implemented a custom read.files() helper to:
Recursively identify KR datasets by filename patterns (e.g., .*KR.*\\.DTA).
Load each file as a data frame and drop problematic files where geographic region (v024) is missing.
For each dataset, selected and harmonized key DHS variables:
Fever and treatment variables: h22 (fever in last 2 weeks), h32a–h32x (source of care)
Handled differences between surveys (e.g., when b19 is unavailable) by:
Using hw1 as a proxy for age in months.
Creating a unified b19 and region variable across all datasets.
Show code: Data ingestion and harmonization
# Initialize list to store processed datasets from each KR fileall_countries_raw <-list()# Loop through each DHS datasetfor(i in1:length(dhs)){# Extract current dataset dhs_hhs = dhs[i]# Select relevant variables from the Kids Recode files:# - caseid: unique case identifier# - v000: country code and phase# - v001: cluster number# - v002: household number# - v003: respondent's line number# - v007: year of interview# - v012: respondent's age# - v101: region (often used as backup for v024)# - v02*: household characteristics (v021, v022, v024, v025)# - v005: sample weight# - b5: child alive (Yes/No)# - b*: birth history variables# - b8: current age of child in months# - h*: health variables (fever, treatment, etc.)# - h22: fever in last 2 weeks# - h32*: fever treatment source variables dhs_kr <- dhs_hhs %>%map(~dplyr::select(., caseid, v000, v001, v002, v003, v007, v012, v101, contains('v02'), v005, b5, contains('b'), b8, contains('h'), contains("h32")))## Process each dataset in the current KR file dhs_data =data.frame()for(j in1:length(dhs_kr)){ dhs_hh = dhs_kr[j]# Convert list to data frame using plyr::ldply all_datasets =ldply(dhs_hh, data.table)# Error handling for variable harmonizationtryCatch({# Case 1: If age variable 'b19' (age in months) exists, use it directlyif('b19'%in%colnames(all_datasets)){ all_datasets = all_datasets %>%select(v000, v001, v002, v003, v007, v005, v012, v101, v021, v024 = v101, # Rename v101 to v024 for consistency v022, v025, b5, b8, h22, b19, contains("h32")) %>%mutate(region =ifelse(is.na(v024), v101, v024))# Case 2: If 'hw1' exists, use it as proxy for age } elseif('hw1'%in%colnames(all_datasets)){ all_datasets = all_datasets %>%select(v000, v001, v002, v003, v007, v005, v012, v024 = v101, # Rename v101 to v024 v101, v021, v022, v025, b5, b8, h22, hw1, contains("h32")) %>%mutate(b19 = hw1, # Create b19 from hw1region =ifelse(is.na(v024), v101, v024)) }# Append harmonized dataset to dhs_data { dhs_data =list(bind_rows(dhs_data, all_datasets)) } }, error =function(e){# Print error message if harmonization failscat("ERROR :", conditionMessage(e), "\n") }) }# Store harmonized dataset for this KR file all_countries_raw[[i]] <- dhs_data}# Combine all KR datasets into one large harmonized data frameall_datasets_final <- dplyr::bind_rows(all_countries_raw)
Cohort definition and derived indicators
Restricted the analysis to:
Living children (b5 == 1)
Age < 60 months (b19 < 60 where available)
Non-missing fever information (h22 != 8).
Created binary outcome variables at the child level:
prop_fievre – child had fever in the last 2 weeks (h22 == 1).
prop_publ – sought care in the public sector (any of h32a–h32i > 0).
prop_priv – sought care in the private sector (any of h32j–h32x > 0).
Re-coded v022 when missing using the constructed stratum.
View code
#| label: cohort-definition-indicators#| eval: false#| code-fold: true#| code-summary: "Show code: Cohort definition and derived indicators"# Initialize empty data frames to store final resultsall_countries_fievre =data.frame() # For fever prevalence estimatesall_countries =data.frame() # For care-seeking estimates# Loop through each harmonized dataset to create analysis cohortsfor(i in1:length(dhs)){ dhs_hhs = dhs[i]# Get the processed data from harmonization step dhs_data = all_countries_raw[[i]]# Create analysis cohort with derived indicatorsfor(z in1:length(dhs_data)){ dhs_data1 = dhs_data[z]# Case 1: If age variable exists, filter for children under 5if('b19'%in%colnames(dhs_data1)){ dhs_data2 = dhs_data1 %>%# Apply eligibility criteria:# - b5 == 1: child is alive# - b19 < 60: child is under 5 years (under 60 months)# - h22 != 8: exclude "don't know" responses for fevermap(~filter(., b5 ==1& b19 <60& h22 !=8)) %>%map(~mutate(., # Create binary fever indicator (1 = had fever in last 2 weeks)prop_fievre =ifelse(h22 ==1, 1, 0),# Calculate survey weight (DHS weights divided by 1,000,000)wt = v005/1000000,# Create sampling strata: region*10 + urban/ruralstrate =as.integer(region*10+as.integer(v025)),# Use constructed strata if v022 (sample strata) is missingv022 =ifelse(is.na(v022), strate, v022)))# Case 2: If age variable doesn't exist (no age filter possible) } elseif(!'b19'%in%colnames(dhs_data1)){ dhs_data2 = dhs_data1 %>%# Apply eligibility criteria (no age filter)map(~filter(., b5 ==1& h22 !=8)) %>%map(~mutate(., prop_fievre =ifelse(h22 ==1, 1, 0),wt = v005/1000000,# Create region variable (use v101 if v024 is missing)region =ifelse(is.na(v024), v101, v024),strate =as.integer(region*10+as.integer(v025)),v022 =ifelse(is.na(v022), strate, v022))) } }# Extract country and year information for metadata dhs_data3 =ldply(dhs_data2) l =data.frame(country_name =unique(dhs_data3$v000)) countries_sel =as.factor(l$country_name) year =unique(dhs_data3$v007) year =paste(year, collapse ='-')# Process fever treatment data for children with feverfor(z in1:length(dhs_data2)){ dhs_data_finale = dhs_data2[z]# Convert list to data frame dhs_data_finale =ldply(dhs_data_finale)# Filter to include only children who had fever (h22 == 1) dhs_data_finale1 = dhs_data_finale %>%filter(h22 ==1)# Create treatment source indicators:# h32a-h32i: public sector sources# (government hospital, health center, health post, mobile clinic, # community health worker, government doctor, government nurse, etc.)# h32j-h32x: private sector sources# (private hospital/clinic, pharmacy, drug store, private doctor,# private nurse, traditional healer, shop, etc.)# h32y: no treatment sought dhs_data_finale1 = dhs_data_finale1 %>%mutate(# Sum all public sector sourcesprop_publ =rowSums(dhs_data_finale1[c("h32a", "h32b", "h32c", "h32d", "h32e", "h32f", "h32g", "h32h", "h32i")], na.rm =TRUE),# Sum all private sector sourcesprop_priv =rowSums(dhs_data_finale1[c("h32j", "h32k", "h32l", "h32m", "h32n", "h32o", "h32p", "h32q", "h32r", "h32s", "h32t", "h32u", "h32v", "h32w", "h32x")], na.rm =TRUE),# Convert to binary indicators (1 if any source in sector used)prop_publ =ifelse(prop_publ >0, 1, 0),prop_priv =ifelse(prop_priv >0, 1, 0),# Create no treatment indicatorno_trea =ifelse(h32y ==0, 0, 1))# Convert back to list format for estimation functions dhs_data_finale1 =list(dhs_data_finale1) }}
Survey design and estimation
Defined complex survey designs with the survey package:
Exported final indicator tables to CSV (prop_fever_finale.csv, treatement_seeking.csv, etc.) to be used in mapping tools, dashboards (Shiny, Power BI), or reporting.
Show code: Survey design and estimation
# Loop through each country/survey to calculate weighted estimatesfor(i in1:length(dhs)){# Get processed data from previous steps dhs_data = all_countries_raw[[i]] dhs_data2 =# ... (from cohort definition step) dhs_data_finale1 =# ... (from cohort definition step)# Extract metadata dhs_data3 =ldply(dhs_data2) l =data.frame(country_name =unique(dhs_data3$v000)) countries_sel =as.factor(l$country_name) year =unique(dhs_data3$v007) year =paste(year, collapse ='-')# Set option for handling lonely PSUs (primary sampling units)# "adjust" method: centers lonely PSU data at the sample grand meanoptions(survey.lonely.psu ="adjust")# ===========================================================================# PART 4A: Estimate fever prevalence by region# =========================================================================== est_data <-data.frame() vars <-c('prop_fievre')for (y in1:length(vars)) { col <-list(vars[y]) by <-list('region')# Remove missing values for the current variable df <- dhs_data2 %>%map(~drop_na(., vars[y]))# Apply survey-weighted estimation function (estim_prop)# This function should create svydesign and use svyby to calculate means# Format: svydesign(ids = ~v021, strata = ~v022, weights = ~wt, data = .)# svyby(~prop_fievre, by = ~region, design, svymean) df <-pmap(list(df, col, by), estim_prop) df <- plyr::ldply(df)# Add metadata columns df$year = year df$country_name = countries_sel# Format and select final columns est_data =bind_rows(est_data, df) %>%select(country_name, region, year, prop_fievre) %>%mutate(prop_fievre =round(prop_fievre, 2)) }# Append to master fever prevalence dataset all_countries_fievre =bind_rows(all_countries_fievre, est_data)# ===========================================================================# PART 4B: Estimate care-seeking patterns by region (public, private, none)# =========================================================================== care_data <-list() vars1 <-c('prop_publ', 'prop_priv', 'no_trea')for (a in1:length(vars1)) { col <-list(vars1[a]) by <-list('region')# Convert list to data frame df1 <- dhs_data_finale1 df1 =ldply(df1)# Extract metadata l =data.frame(country_name =unique(df1$v000)) countries_sel =as.factor(l$country_name) year =unique(df1$v007) year =paste(year, collapse ='-')# Convert back to list for estimation df1 =list(df1)# Apply survey-weighted estimation df1 <-pmap(list(df1, col, by), estim_prop) df1 <- plyr::ldply(df1)# Add metadata df1$year = year df1$country_name = countries_sel df1[, vars1[a]] = df1[, vars1[a]]# Store in list care_data[[a]] = df1 }# Combine all care-seeking estimates into one dataset prop_estimation =bind_cols(care_data[[1]], care_data[[2]], care_data[[3]]) %>%rename_at(1, ~"region") %>%rename_at(4, ~"year") %>%rename_at(5, ~"country_name") %>%select(country_name, region, prop_publ, prop_priv, no_trea, year) %>%mutate(prop_publ =round(prop_publ, 2),prop_priv =round(prop_priv, 2),no_trea =round(no_trea, 2))# Append to master care-seeking dataset all_countries =bind_rows(all_countries, prop_estimation)}# ===========================================================================# FINAL OUTPUT: Merge fever prevalence with care-seeking estimates# ===========================================================================all_countries_finale = all_countries_fievre %>%left_join(all_countries, by =c('country_name', 'region', 'year'))# Export final datasets (uncomment to save)# write.csv(all_countries_finale, # 'data_care_seeking_final.csv', # row.names = FALSE)
Outputs
Produced standardized, region-level datasets with:
Fever prevalence among children under 5.
Proportion of febrile children seeking care in public vs private facilities.
Proportion of febrile children not receiving any treatment.
Designed the pipeline to work across multiple African countries and survey years, enabling cross-country comparisons and time trends in malaria-related care-seeking behaviour.
Ousmane Diallo, MPH-PhD – Biostatistician, Data Scientist & Epidemiologist based in Chicago, Illinois, USA. Specializing in SAS programming, CDISC standards, and real-world evidence for clinical research.