Leakage and the Reproducibility Crisis in ML-based Science

Data leakage causes reproducibility failures in ML-based science

The running list below consists of papers that highlight reproducibility failures or pitfalls in ML-based science. We find 22 papers from 17 fields where errors have been found, collectively affecting 294 papers and in some cases leading to wildly overoptimistic conclusions. In each case, data leakage causes errors in the modeling process.

Field	Paper	Year	Num. papers reviewed	Num. papers w/pitfalls	Pitfalls
Medicine	Bouwmeester et al.	2012	71	27	No train-test split
Neuroimaging	Whelan et al.	2014	—	14	No train-test split; Feature selection on train and test set
Bioinformatics	Blagus et al.	2015	—	6	Pre-processing on train and test sets together
Autism Diagnostics	Bone et al.	2015	—	3	Duplicates across train-test split; Sampling bias
Nutrition research	Ivanescu et al.	2016	—	4	No train-test split
Software engineering	Tu et al.	2018	58	11	Temporal leakage
Toxicology	Alves et al.	2019	—	1	Duplicates across train-test split
Clinical epidemiology	Christodoulou et al.	2019	71	48	Feature selection on train and test set
Satelitte imaging	Nalepa et al.	2019	17	17	Non-independence between train and test sets
Tractography	Poulin et al.	2019	4	2	No train-test split
Brain-computer interfaces	Nakanishi et al.	2020	—	1	No train-test split
Histopathology	Oner et al.	2020	—	1	Non independence between train and test sets
Neuropsychiatry	Poldrack et al.	2020	100	53	No train-test split; pre-processing on train and test sets together
Neuroimaging	Ahmed et al.	2021	—	1	Non independence between train and test sets
Neuroimaging	Li et al.	2021	122	18	Non independence between train and test sets
IT Operations	Lyu et al.	2021	9	3	Temporal leakage
Medicine	Filho et al.	2021	—	1	Illegitimate features
Radiology	Roberts et al.	2021	62	16	No train-test split; duplicates in train and test sets; sampling bias
Neuropsychiatry	Shim et al.	2021	—	1	Feature selection on training and test sets
Medicine	Vandewiele et al.	2021	24	21	Feature selection on train-test sets; Non-independence between train and test sets; Sampling bias
Computer security	Arp et al.	2022	30	22	No train-test split; Pre-processing on train and test sets together; Illegitimate features; others
Genomics	Barnett et al.	2022	41	23	Feature selection on training and test sets

Data leakage has long been recognized as a leading cause of errors in ML applications. In formative work on leakage, Kaufman et al. provide an overview of different types of errors and give several recommendations for mitigating these errors. Since this paper was published, the ML community has investigated the impact of leakage in several engineering applications and modeling competitions. However, leakage occurring in ML-based science has not been comprehensively investigated. As a result, mitigations for data leakage in scientific applications of ML remain understudied.

Towards a solution: A taxonomy of data leakage

A taxonomy of data leakage can enable a better understanding of why leakage occurs in ML-based science and inform potential solutions. We present a fine-grained taxonomy of 8 types of leakage that range from textbook errors to open research problems. Our taxonomy is comprehensive and addresses data leakage arising during the data collection, pre-processing, modeling and evaluation steps. In particular, our taxonomy addresses all cases of data leakage that we found in our survey. We provide an overview of the types of leakage here, a more detailed taxonomy is included in our paper.

1. Lack of clean separation of training and test set: If the training dataset is not separated from the test dataset during all pre-processing, modeling and evaluation steps, the model has access to information in the test set before its performance is evaluated.

2. Model uses features which are not legitimate: The model has access to features that should not be legitimately available for use in the modeling exercise, for instance if they are a proxy for the outcome variable.

3. Test set is not drawn from the distribution of interest: The distribution of data on which the performance of an ML model is evaluated differs from the distribution of data about which the scientific claims are made.

Model info sheets for addressing leakage

Our taxonomy of data leakage highlights several failure modes which are prevalent in ML-based science. To address leakage, researchers using ML methods need to connect the performance of their ML models to their scientific claims. To detect cases of leakage, we provide a template for a model info sheet which should be included when making a scientific claim using predictive modeling. The template consists of precise arguments needed to justify the absence of leakage, and is inspired by Mitchell et al.'s model cards for increasing the transparency of ML models.

Model info sheets can be voluntarily used by researchers to detect leakage. Of course, model info sheets can’t prevent researchers from making false claims, but we hope they can make errors more apparent. Note that for model info sheets to be verified, the analysis must be computationally reproducible. Also, model info sheets don’t address reproducibility issues other than leakage.

A case study of irreproducibility in civil war prediction

We find that prominent studies on civil war prediction claiming superior performance of ML models over baseline Logistic Regression models fail to reproduce. Our results provide two reasons to be skeptical of the use of ML methods in this research area, by both questioning their usefulness and highlighting the pitfalls of applying them correctly. While none of these errors could have been caught by reading the papers, our model info sheets enable the detection of leakage in each case.

A comparison of reported results vs. corrected results in the 4 papers on civil war prediction that compare the performance of ML models and Logistic Regression models. — A comparison of reported and corrected results in civil war prediction papers published in top Political Science journals. The main findings of each of these papers are invalid due to various forms of data leakage: Muchlinski et al. impute the training and test data together, Colaresi & Mahmood and Wang incorrectly reuse an imputed dataset, and Kaufman et al. use proxies for the target variable which causes data leakage. The use of model info sheets would detect leakage in every paper. When we correct these errors, complex ML models (such as Adaboost and Random Forests) do not perform substantively better than decades-old Logistic Regression models for civil war prediction in each case. Each column in the table outlines the impact of leakage on the results of a paper.

Reproduction materials on CodeOcean List of papers in our systematic review

A note on the term reproducibility crisis

We acknowledge that there isn't consensus about the term reproducibility, and there have been a number of recent attempts to define the term and create consensus. One possible definition is computational reproducibility — when the results in a paper can be replicated using the exact code and dataset provided by the authors. We argue that this definition is too narrow because even cases of outright bugs in the code would not be considered irreproducible under this definition. Therefore we advocate for a standard where bugs and other errors in data analysis that change or challenge a paper's findings constitute irreproducibility. We elaborate this perspective here.

Reproducibility failures don’t mean a claim is wrong, just that evidence presented falls short of the accepted standard or that the claim only holds in a narrower set of circumstances than asserted. We don’t view reproducibility failures as signs that individual authors or teams are careless, and we don’t think any researcher is immune. One of us (Narayanan) has had multiple such failures in his applied-ML work and expects that it will probably happen again.

We call it a crisis for two related reasons. First, reproducibility failures in ML-based science are systemic. In nearly every scientific field that has carried out a systematic study of reproducibility issues, papers are plagued by common pitfalls. In many systematic reviews, a majority of the papers reviewed suffer from these pitfalls. Second, despite the urgency of addressing reproducibility failures, there aren’t yet any systemic solutions.

Citation

To cite this work, please use this BibTeX entry.

About us

This is a project by Sayash Kapoor and Arvind Narayanan. We are researchers in the department of computer science and the Center for Information Technology Policy at Princeton University.

Our interest in this topic arose during a graduate seminar on Limits to Prediction. Narayanan offered this course together with Prof. Matthew Salganik in Fall 2020, and Kapoor took the course. The course aimed to critically examine the narrative about the ability to predict the future with ever-increasing accuracy given bigger datasets and more powerful algorithms. The work on reproducibility pitfalls is one aspect of our broader interest in limits to prediction.

Leakage and the Reproducibility Crisis in ML-based Science

Context

Scope

Data leakage causes reproducibility failures in ML-based science

Towards a solution: A taxonomy of data leakage

Model info sheets for addressing leakage

A case study of irreproducibility in civil war prediction

A note on the term reproducibility crisis

Citation

About us

Leakage and the Reproducibility Crisis in ML-based Science

Context

Scope

Why do we call these reproducibility failures?

Data leakage causes reproducibility failures in ML-based science

Towards a solution: A taxonomy of data leakage

Model info sheets for addressing leakage

A case study of irreproducibility in civil war prediction

A note on the term reproducibility crisis

Citation

About us