Multi-Country Malaria Analytics Using DHS and MIS Data
Harmonization, cleaning, survey weighting, and estimation of malaria indicators (fever, testing, care-seeking) across 20+ African countries.
Author
Ousmane Diallo, MPH-PhD
Published
November 16, 2025
Overview
This project harmonizes and analyzes 20+ DHS/MIS surveys from sub-Saharan Africa to estimate malaria-related indicators among children under five. The workflow integrates:
Microdata download and management using the {rdhs} R package
Cross-country harmonization of DHS/MIS datasets
Standardization of regional categories (v024) that differ by country and survey year
Construction of malaria indicators (fever, testing, care-seeking)
Survey-weighted estimation using DHS complex design
Generation of Power BI / Shiny-ready analytical datasets
This project supports malaria program decision-making, RWE analyses, and subnational monitoring.
Key skills demonstrated
Advanced R programming for large-scale survey microdata
Complex survey design and weighting with the {survey} package
Cross-country data harmonization and regional standardization
Derivation of malaria-related RWE indicators (fever, treatment, sector of care)
Preparation of analysis-ready datasets for dashboards (Power BI, Shiny)
Objectives
Use both the DHS REST API and DHS microdata to build a multi-country malaria analytics pipeline
Harmonize heterogeneous DHS/MIS datasets into a unified analytical structure
Recode region classifications to ensure comparability across countries and years
Compute malaria indicators among children <5:
Fever
Care-seeking behavior
Apply survey weights and complex sampling design
Produce cross-country analytical tables and visualizations
Methods
Data source and acquisition
Used DHS KR (children under 5) microdata for malaria-related surveys in multiple African countries.
Survey IDs were pre-selected (e.g., via DHS Download Manager) and all corresponding KR.DTA files were stored in a project directory.
R scripts automatically discovered and loaded all KR files using list.files() and haven::read_dta(), allowing the pipeline to scale across countries and survey years without manual file handling.
Show code: Data source and acquisition
# Set working directory and data pathDataDir =setwd('~/Library/CloudStorage/OneDrive-Personnel/Ousmane DIALLO/Africa_care_seeking_rate/')# Load required libraries for data manipulation and survey analysislibrary(haven) # For reading Stata/SPSS/SAS fileslibrary(labelled) # For handling variable labelslibrary(dplyr) # For data manipulationlibrary(survey) # For complex survey data analysislibrary(purrr) # For functional programming toolslibrary(plyr) # For data transformation (note: conflicts with dplyr)library(data.table) # For fast data manipulationlibrary(tidyr) # For data tidying## Function to read multiple files matching a pattern# Parameters:# filepat: regex pattern to match filenames# path: directory path to search in# fun: function to apply to each file (e.g., read_dta)# encoding: character encoding (default: "latin1")# Returns: list of dataframes that have non-missing v024 valuesread.files <-function(filepat, path, fun, encoding ="latin1") {# Get full paths of all files matching the pattern filenames <-list.files(path = path, pattern = filepat, recursive =TRUE, full.names =TRUE)# Apply the reading function to each file dflist <-sapply(filenames, fun, simplify =FALSE)# Helper function to check if v024 variable has any missing values isna_v024 <-function(df) any(is.na(df$v024))# Identify dataframes with missing v024 values cond <-sapply(dflist, isna_v024)# Return only dataframes without missing v024 values dflist[!cond]}# Get list of all DTA (Stata) files in the data directoryfilenames_all <-list.files(path = DataDir, pattern =".*.*\\.DTA", recursive =TRUE, full.names =TRUE)# Read all DHS (Demographic and Health Surveys) files# that are KR (Kids Recode) filesdhs =read.files(".*KR.*\\.DTA", DataDir, read_dta)# COMMENTED OUT: Country patterns for batch processing# This list contains regex patterns for all countries to be analyzed# Uncomment and use in loop for multi-country processing# cntrs <- list(".*NGKR21.*\\.DTA", ".*NGKR4B.*\\.DTA", ".*NGKR53.*\\.DTA", # ".*NGKR61.*\\.DTA", ".*NGKR6A.*\\.DTA", ".*NGKR71.*\\.DTA", ...)
Data ingestion and harmonization
Implemented a custom read.files() helper to:
Recursively identify KR datasets by filename patterns (e.g., .*KR.*\\.DTA)
Load each file as a data frame and drop problematic files where geographic region (v024) is missing
For each dataset, selected and harmonized key DHS variables:
Fever and treatment variables: h22 (fever in last 2 weeks), h32a–h32x (source of care)
Handled differences between surveys (e.g., when b19 is unavailable) by:
Using hw1 as a proxy for age in months
Creating a unified b19 and region variable across all datasets
Standardized region names across countries and survey years:
Region identifiers (v024) vary by country and sometimes by survey year
A comprehensive lookup table with 200+ region mappings was created to standardize region names across all countries
This lookup table is maintained as a separate data asset in the project repository for reusability and documentation.
Note: Supporting files (lookup tables, helper functions) are available in the project’s GitHub repository.
Show code: Data ingestion and harmonization
# Initialize list to store processed datasets from each KR fileall_countries_raw <-list()# Loop through each DHS datasetfor(i in1:length(dhs)){# Extract current dataset dhs_hhs = dhs[i]# Select relevant variables from the Kids Recode files:# - caseid: unique case identifier# - v000: country code and phase# - v001: cluster number# - v002: household number# - v003: respondent's line number# - v007: year of interview# - v012: respondent's age# - v101: region (often used as backup for v024)# - v02*: household characteristics (v021, v022, v024, v025)# - v005: sample weight# - b5: child alive (Yes/No)# - b*: birth history variables# - b8: current age of child in months# - h*: health variables (fever, treatment, etc.)# - h22: fever in last 2 weeks# - h32*: fever treatment source variables dhs_kr <- dhs_hhs %>% purrr::map(~dplyr::select(., caseid, v000, v001, v002, v003, v007, v012, v101, contains('v02'), v005, b5, contains('b'), b8, contains('h'), contains("h32")))## Process each dataset in the current KR file dhs_data =data.frame()for(j in1:length(dhs_kr)){ dhs_hh = dhs_kr[j]# Convert list to data frame using plyr::ldply all_datasets =ldply(dhs_hh, data.table)# Error handling for variable harmonizationtryCatch({# Case 1: If age variable 'b19' (age in months) exists, use it directlyif('b19'%in%colnames(all_datasets)){ all_datasets = all_datasets %>% dplyr::select(v000, v001, v002, v003, v007, v005, v012, v101, v021, v024 = v101, # Rename v101 to v024 for consistency v022, v025, b5, b8, h22, b19, contains("h32")) %>% dplyr::mutate(region =ifelse(is.na(v024), v101, v024))# Case 2: If 'hw1' exists, use it as proxy for age } elseif('hw1'%in%colnames(all_datasets)){ all_datasets = all_datasets %>% dplyr::select(v000, v001, v002, v003, v007, v005, v012, v024 = v101, # Rename v101 to v024 v101, v021, v022, v025, b5, b8, h22, hw1, contains("h32")) %>% dplyr::mutate(b19 = hw1, # Create b19 from hw1region =ifelse(is.na(v024), v101, v024)) }# Append harmonized dataset to dhs_data { dhs_data =list(bind_rows(dhs_data, all_datasets)) } }, error =function(e){# Print error message if harmonization failscat("ERROR :", conditionMessage(e), "\n") }) }# Store harmonized dataset for this KR file all_countries_raw[[i]] <- dhs_data}# Combine all KR datasets into one large harmonized data frameall_datasets_final <- dplyr::bind_rows(all_countries_raw)# ============================================================================# STANDARDIZE REGION NAMES USING LOOKUP TABLE# ============================================================================# Load region standardization lookup table from repository# (See: lookup/dhs_region_standardization.csv for complete mappings)region_lookup <-read.csv('lookup/dhs_region_standardization.csv', stringsAsFactors =FALSE)# Apply standardization via left join# Handles both year-specific mappings (e.g., Angola, Liberia) # and year-independent mappings (e.g., Guinea, Benin)all_datasets_final <- all_datasets_final %>%left_join(region_lookup, by =c("country_name"="country_name", "year"="year", "v024"="v024")) %>% dplyr::mutate(# Use standardized region name if available, otherwise keep v024 coderegion_name =coalesce(region_name, as.character(v024)) )
Cohort definition and derived indicators
Restricted the analysis to:
Living children (b5 == 1)
Age < 60 months (b19 < 60 where available)
Non-missing fever information (h22 != 8).
Created binary outcome variables at the child level:
prop_fievre – child had fever in the last 2 weeks (h22 == 1).
prop_publ – sought care in the public sector (any of h32a–h32i > 0).
prop_priv – sought care in the private sector (any of h32j–h32x > 0).
Re-coded v022 when missing using the constructed stratum.
Show code: Cohort definition and indicators
# Initialize empty data frames to store final resultsall_countries_fievre =data.frame() # For fever prevalence estimatesall_countries =data.frame() # For care-seeking estimates# Loop through each harmonized dataset to create analysis cohortsfor(i in1:length(dhs)){ dhs_hhs = dhs[i]# Get the processed data from harmonization step dhs_data = all_countries_raw[[i]]# Create analysis cohort with derived indicatorsfor(z in1:length(dhs_data)){ dhs_data1 = dhs_data[z]# Case 1: If age variable exists, filter for children under 5if('b19'%in%colnames(dhs_data1)){ dhs_data2 = dhs_data1 %>%# Apply eligibility criteria:# - b5 == 1: child is alive# - b19 < 60: child is under 5 years (under 60 months)# - h22 != 8: exclude "don't know" responses for fever purrr::map(~dplyr::filter(., b5 ==1& b19 <60& h22 !=8)) %>% purrr::map(~dplyr::mutate(., # Create binary fever indicator (1 = had fever in last 2 weeks)prop_fievre =ifelse(h22 ==1, 1, 0),# Calculate survey weight (DHS weights divided by 1,000,000)wt = v005/1000000,# Create sampling strata: region*10 + urban/ruralstrate =as.integer(region*10+as.integer(v025)),# Use constructed strata if v022 (sample strata) is missingv022 =ifelse(is.na(v022), strate, v022)))# Case 2: If age variable doesn't exist (no age filter possible) } elseif(!'b19'%in%colnames(dhs_data1)){ dhs_data2 = dhs_data1 %>%# Apply eligibility criteria (no age filter) purrr::map(~dplyr::filter(., b5 ==1& h22 !=8)) %>% purrr::map(~dplyr::mutate(., prop_fievre =ifelse(h22 ==1, 1, 0),wt = v005/1000000,# Create region variable (use v101 if v024 is missing)region =ifelse(is.na(v024), v101, v024),strate =as.integer(region*10+as.integer(v025)),v022 =ifelse(is.na(v022), strate, v022))) } }# Extract country and year information for metadata dhs_data3 =ldply(dhs_data2) l =data.frame(country_name =unique(dhs_data3$v000)) countries_sel =as.factor(l$country_name) year =unique(dhs_data3$v007) year =paste(year, collapse ='-')# Process fever treatment data for children with feverfor(z in1:length(dhs_data2)){ dhs_data_finale = dhs_data2[z]# Convert list to data frame dhs_data_finale =ldply(dhs_data_finale)# Filter to include only children who had fever (h22 == 1) dhs_data_finale1 = dhs_data_finale %>%filter(h22 ==1)# Create treatment source indicators:# h32a-h32i: public sector sources# (government hospital, health center, health post, mobile clinic, # community health worker, government doctor, government nurse, etc.)# h32j-h32x: private sector sources# (private hospital/clinic, pharmacy, drug store, private doctor,# private nurse, traditional healer, shop, etc.)# h32y: no treatment sought dhs_data_finale1 = dhs_data_finale1 %>% dplyr::mutate(# Sum all public sector sourcesprop_publ =rowSums(dhs_data_finale1[c("h32a", "h32b", "h32c", "h32d", "h32e", "h32f", "h32g", "h32h", "h32i")], na.rm =TRUE),# Sum all private sector sourcesprop_priv =rowSums(dhs_data_finale1[c("h32j", "h32k", "h32l", "h32m", "h32n", "h32o", "h32p", "h32q", "h32r", "h32s", "h32t", "h32u", "h32v", "h32w", "h32x")], na.rm =TRUE),# Convert to binary indicators (1 if any source in sector used)prop_publ =ifelse(prop_publ >0, 1, 0),prop_priv =ifelse(prop_priv >0, 1, 0),# Create no treatment indicatorno_trea =ifelse(h32y ==0, 0, 1))# Convert back to list format for estimation functions dhs_data_finale1 =list(dhs_data_finale1) }}
Survey design and estimation
Applied survey weights to child-level data to account for complex sampling design
Defined complex survey designs with the survey package:
Data transformation: Child-level microdata (one row per child) was aggregated to region level using survey-weighted estimation (svyby) to produce population-representative estimates.
This dataset feeds into interactive dashboards (Power BI/R Shiny) for:
Cross-country malaria burden comparisons
Treatment-seeking behavior trends
Market access gap analysis
Show code: Export analysis-ready dataset
# ============================================================================# EXPORT FINAL ANALYSIS-READY DATASET# ============================================================================# Final dataset structure:# - One row per child (under 5 years)# - Standardized region names across countries# - Binary indicators: fever, public care, private care, no treatment# - Survey weights and design variables# - Ready for visualization in Power BI or R Shiny# Export to CSV for dashboard ingestionwrite.csv(all_countries_finale, 'data/malaria_fever_careseeking_final.csv', row.names =FALSE)# Preview final dataset structureglimpse(all_countries_finale)# Data summary for documentationcat("Total records:", nrow(all_countries_finale), "\n")cat("Countries:", length(unique(all_countries_finale$country_name)), "\n")cat("Survey years:", paste(range(all_countries_finale$year), collapse ="-"), "\n")
Final Dataset Schema
Variable
Type
Description
Example
country_name
Character
Country name
“Benin”
region
Character
Standardized region name
“alibori”
year
Character
Survey year
“2017-2018”
prop_fievre
Numeric
Fever prevalence among children <5
0.23
prop_publ
Numeric
% of febrile children who sought care in public sector
0.60
prop_priv
Numeric
% of febrile children who sought care in private sector
0.07
no_treat
Numeric
% of febrile children who did not seek any treatment
0.25
type_survey
Character
Survey type
“DHS” or “MIS”
Important Note on Care-Seeking Categories:
These categories are not mutually exclusive. Children may seek care from multiple sources during a single illness episode (e.g., public health center for consultation + private pharmacy for medications). This reflects real-world healthcare-seeking behavior, particularly in contexts where:
Stock-outs are common in public facilities
Patients utilize a mix of public and private services based on availability, cost, and perceived quality
Treatment pathways involve sequential visits to multiple providers
Interpretation: In regions where prop_publ + prop_priv + No_treat > 1.0, this indicates overlap where some children accessed both public and private care. This is an important indicator of health system performance and patient behavior patterns.
Sample Data Preview
country_name
region
prop_fievre
prop_publ
prop_priv
no_treat
type_survey
year
Angola
cabinda
0.16
0.60
0.17
0.35
DHS
2015-2016
Angola
zaire
0.15
0.64
0.21
0.12
DHS
2015-2016
Benin
alibori
0.23
0.10
0.01
0.83
DHS
2017-2018
Benin
atacora
0.16
0.28
0.19
0.52
DHS
2017-2018
Burkina Faso
Boucle du Mouhoun
0.13
0.60
0.01
0.39
MIS
2021
Note: Some regions show totals > 1.0 (e.g., Angola-Cabinda: 0.60 + 0.17 + 0.35 = 1.12), indicating dual care-seeking from both sectors.
Note: This dataset is designed for direct ingestion into Power BI/Shiny dashboards.
Ousmane Diallo, MPH-PhD – Biostatistician, Data Scientist & Epidemiologist based in Chicago, Illinois, USA. Specializing in SAS programming, CDISC standards, and real-world evidence for clinical research.