Lead investigators: David Haziza, Zeinab Mashreghi, and Changbao Wu

Collaborative Research Team Projects – Project 31

Statistical Inference in Survey Sampling with Machine Learning Methods

Machine learning (ML) helps national statistical offices (NSOs) improve accuracy by finding patterns in large, complex datasets. Despite the widespread use of ML techniques, the literature on statistical inference from ML-based predictions remains sparse. This project will develop better statistical tools in the presence of ML to make valid inferences based on complex sampling designs across different settings: handling missing data, model-assisted estimation, and the integration of probability and non-probability samples. The project team aims to establish foundational frameworks along with user-friendly software for these procedures useful to practitioners.

Research Category:
Region:
National
Date:
2026–2029

Why Do We Need Better Tools for Statistical Inference from ML-based Predictions?

In recent years, there has been growing interest in applying machine learning procedures within national statistical offices. The increasing availability of big data sources and administrative files has enabled the use of sophisticated ML algorithms, and ML helps NSOs improve accuracy by finding patterns in large, complex datasets. Despite the widespread use of ML techniques, the literature on statistical inference from ML-based predictions remains sparse. Indeed, the properties of point and variance estimators derived from these methods are not well understood, and it is thus very important, and indeed critical, to develop better statistical tools in the presence of ML to make valid inferences based on complex sampling designs across different settings: handling missing data, model-assisted estimation, and the integration of probability and non-probability samples.

In this direction, the Collaborative Research Team (CRT) will provide education for students, researchers, and survey methodologists working in NSOs and other agencies such as the Bank of Canada. It will also promote the use of modern statistical methods in the survey community by demonstrating the applicability of the proposed methodologies to real-world datasets. The CRT aims to establish foundational frameworks along with user-friendly software for these procedures useful to practitioners.

Credit: Statistics Canada
Credit: Bank of Canada

Research Aims and Activities

Recently, there has been growing interest in using ML for accurate predictions in surveys. Most often, the interest lies in estimating finite population parameters (e.g., a finite population total/mean, a finite population quantile, etc.), and this is the focus of this research.

The CRT will investigate four problems related to this interest:

  1. Debiased machine learning for imputation and model-assisted estimation: Projects will be undertaken to investigate double debiased imputation procedures for a population mean, double debiased imputation procedures for general parameters, and model-assisted estimation based on debiased machine learning methods.
  2. Machine learning for the treatment of unit nonresponse: The team will examine finite-population bootstrap procedures to identify the settings in which they perform well and those in which they break down; analyze the choice of architecture to be used, corresponding to a set of hyperparameters, to identify an optimal architecture; explore the effect of weight trimming procedures based on adaptive thresholds to, for example, minimize the estimated mean square error (MSE) of inverse probability weighting estimators; and overcome the issue of lack of Neyman orthogonalization by considering a weighting approach, where separate outcome regression models are fitted for each of the key survey variables.
  3. Integration of probability and non-probability samples: The objective of this project is to study how, and with what assumptions, to obtain approximately unbiased predictors of finite population totals or means when no or limited probability survey data are available, and how to estimate the quality (prediction variance) of these predictors.
  4. Prediction-powered inference with survey data: This work will address a number of technical and fundamental issues related to the use of prediction-powered inference for data integration in survey sampling and official statistics.

People Behind the Project

Project Team

David Haziza | University of Ottawa

Zeinab Mashreghi | University of Winnipeg

Changbao Wu | University of Waterloo

Collaborators

Jean-François Beaumont | Statistics Canada

Sixia Chen | University of Oklahoma

Mehdi Dagdoug | McGill University

Audrey-Anne Vallée | Université Laval

Angelika Welte | Bank of Canada

Project Partners

Statistics Canada

Bank of Canada

Explore More Stories