Lead investigators (clockwise from top left): Archer Yang, Eric Kolaczyk, Kaiqiong Zhao, and Hui Peng.

Collaborative Research Team Projects – Project 30

Collaborative Innovations in Statistical Learning for Next-generation Drug Discovery

Modern drug discovery faces steep challenges, including rising costs, complexity, and high failure rates. This collaborative research program will develop a synergistic suite of advanced statistical and machine learning (ML) methods to transform critical decisions across the early drug discovery pipeline. The goal is to equip researchers with powerful, integrated tools for confident compound selection, efficient experimentation, and rigorous drug target validation.

Research Category:
Region:
National
Date:
2026–2029

Why Do We Need Novel Statistical and Machine Learning Tools for Drug Discovery?

Modern drug discovery faces steep challenges, including rising costs, complexity, and high failure rates. These challenges demand bold, quantitative innovation.

The team’s approach comprises three interlocking themes: (a) Conformal Inference for Compound Screening will provide rigorous false discovery rate (FDR) control in virtual screening, ensuring high-quality starting points crucial for subsequent stages. Building on this, (b) Dynamic Decision Making with Individualized Variable Selection will use reinforcement learning to develop adaptive, cost-effective experimental strategies for characterizing promising compounds. Finally, to ensure these efforts target biologically sound mechanisms, (c) Advanced Mendelian Randomization (MR) will integrate multi-omic and biomarker data to rigorously assess the causal validity of drug targets. Together, these themes create a cohesive program that will improve decision-making from initial screening to pre-clinical target validation.

Research Aims and Activities

The central aim of this project is to build interdependent novel statistical and ML frameworks, devise adaptive experimental strategies, and advance causal inference techniques for drug discovery. Anticipated outcomes include innovative algorithms, open-source software, and validated applications that will lead to faster discovery cycles, reduced costs, and safer, more effective medicines.

The methods resulting from the first of the project’s three themes, compound screening with confidence via conformal inference, will be validated on two cutting-edge drug discovery datasets with the aim of improving the efficiency and reliability of early-stage drug discovery.

The methods resulting from the second theme, dynamic decision making with individualized variable selection, will be tested on public benchmarks as well as a rich dataset provided by Merck to efficiently predict important molecular activities and safety profiles, optimizing sequential decision-making while minimizing costly experiments.

Finally, the investigations related to the third theme, robust causal discovery via Mendelian randomization for drug target validation, will use a biomarker-informed MR approach to clarify whether cholesterol-lowering drugs genuinely affect diabetes-related pathways, as well as a multi-omic MR strategy applied to data from the UK Biobank to discover and prioritize repurposing opportunities for existing drugs.

People Behind the Project

Project Team

Archer Yang | McGill University
Eric Kolaczyk | McGill University
Kaiqiong Zhao | York University
Hui Peng | University of Toronto

Collaborators

Celia Greenwood | Lady Davis Institute for Medical Research
Jian Tang | HEC Montréal, Mila – Quebec AI institute
Yeying Zhu | University of Waterloo
Linbo Wang | University of Toronto, Vector Institute
David Stephens | McGill University
Marc-André Legault | Université de Montréal, Mila – Quebec AI Institute
Qiang Sun | University of Toronto, Vector Institute, Mohamed bin Zayed University of Artificial Intelligence

Project Partners

Stanford University (Lu Tian, Ying Cui)
University of Minnesota (Hui Zou)
Merck Research Lab (Xiang Yu)

Explore More Stories