Irreproducibility in Machine Learning

We seek to document the reproducibility of applied Machine Learning research. We aim to do systematic, methodologically critical reviews of research in fields adopting ML methods, and use in-depth code review to investigate their reproducibility.

Draft paper: (Ir)Reproducible Machine Learning: A Case Study

Princeton University

Center for Information Technology Policy

Hypotheses and perspectives that motivate this project

Many quantitative science fields are adopting the paradigm of predictive modeling using machine learning. We welcome this development. At the same time, as researchers whose interests include the strengths and limits of machine learning, we have concerns about reproducibility and overoptimism. There are many reasons for caution: performance evaluation is notoriously tricky in machine learning; ML code tends to be complex and as yet lacks standardization; subtle pitfalls arise from the differences between explanatory and predictive modeling; finally, the hype and overoptimism about commercial AI may spill over into applied machine learning research. All these, of course, are in addition to the pressures and publication biases present in all disciplines that have led to reproducibility crises.

Indeed, we found that systematic reviews have exposed reproducibility issues in many applied-ML fields. Motivated by this, we seek to undertake more systematic reviews, and we have released our first study focusing on civil war prediction in political science.

Reproducibility failures don’t mean a claim is wrong, just that evidence presented falls short of the accepted standard or that the claim only holds in a narrower set of circumstances than asserted. We don’t view reproducibility failures as signs that individual authors or teams are careless, and we don’t think any researcher is immune. One of us (Narayanan) has had multiple such failures in his applied-ML work and expects that it will probably happen again.

In fact, we view reproducibility difficulties as the expected state of affairs given the complexities of the new paradigm of prediction-for-understanding. We should see frequent reproducibility failures as the norm until best practices become better established and understood. Thus, the spate of reproducibility failures we have compiled highlight the immaturity of applied-ML research, the critical need for ongoing work on methods and best practices, and the importance of treating the results from this body of work with caution.

We recognize that there is substantial inconsistency in the use of the term reproducibility, justify our choice of the term below, and welcome feedback on this point. Regardless of terminology, it is clear that there have been exaggerated claims of predictive performance of machine learning in many scientific fields. One goal of our project is to understand what systemic interventions might be most effective. We provide tentative suggestions in the Discussion section of our draft paper.

(Ir)Reproducible Machine Learning: A Case Study

Sayash Kapoor, Arvind Narayanan
We find that prominent studies on civil war prediction claiming superior performance of ML models over baseline Logistic Regression models fail to reproduce. Our results provide two reasons to be skeptical of the use of ML methods in this research area, by both questioning their usefulness and highlighting the pitfalls of applying them correctly.

 A comparison of reported results vs. corrected results in the 4 papers on civil war prediction that compare the performance of ML models and Logistic Regression models.

Figure 1. A comparison of reported results vs. corrected results in the 4 papers on civil war prediction that compare the performance of ML models and Logistic Regression models. The main findings of each of these papers are invalid due to methodological pitfalls: Muchlinski et al. impute the training and test data together, Colaresi and Mahmood as well as Wang incorrectly reuse an imputed dataset, and Kaufman et al. use proxies for the target variable which causes data leakage. When we correct these errors, ML models do not perform substantively better than Logistic Regression models for civil war prediction in each case. The metric for Kaufman et al. is accuracy; for all other papers, it is AUC.

Note: The reproduction materials are zipped into a 1.3 GB zip file containing our reproductions of all papers with reproducibility issues. The large size of the reproduction materials is due to the inclusion of all models that were created during our reproduction in the zip file for quicker reproductions. In case the size of the zip file is an issue for you, please get in touch at sayashk AT

Draft paper      Supplement      Reproduction materials

To cite this work, please use this BibTeX entry.

Why do we call these reproducibility issues?

We acknowledge that there isn't consensus about the term reproducibility, and there have been a number of recent attempts to define the term and create consensus. One possible definition is computational reproducibility — when the results in a paper can be replicated using the exact code and dataset provided by the authors. We argue that this definition is too narrow because even cases of outright bugs in the code would not be considered irreproducible under this definition. Therefore we advocate for a standard where bugs and other errors in data analysis that change or challenge a paper's findings constitute irreproducibility.

What constitutes an error?

The goal of predictive modeling is to estimate (and improve) the accuracy of predictions that one might make in a real-world scenario. This is true regardless of the specific research question one wishes to study by building a predictive model. In practice one sets up the data analysis to mimic this real-world scenario as closely as possible. There are limits to how well we can do this and consequently there is always methodological debate on some issues, but there are also some clear rules. If an analysis choice can be shown to lead to incorrect estimates of predictive accuracy, there is usually consensus in the ML community that it is an error. For example, violating the train-test split (or the learn-predict separation) is an error because the test set is intended to provide an accurate estimate of 'out-of-sample' performance — model performance on a dataset that was not used for training. Thus, to define what is an error, we look to this consensus in the ML community (e.g. in textbooks) and offer our own arguments when necessary.

Explanatory vs. predictive modeling

It's important to bear in mind that the goals of explanatory modeling and predictive modeling differ, and a valid methodological choice for explanatory modeling can be an error in a predictive modeling setting. For example, imputing the entire dataset together may be acceptable in explanatory modeling but it is a clear violation of the train-test split in predictive modeling. We created a simulated example to show how data leakage affects performance evaluation when we impute the training and test sets together. We describe the simulation below:

  • The dataset consists of two variables — the target variable y and the independent variable x.

  • y is a binary variable.

  • x is drawn from a normal distribution and depends on y as x = N(0,1) + y

  • We generate 1000 samples with y=0 and 1000 samples with y=1 to create the dataset.

  • We randomly split the data into training (50%) and test (50%) sets, and create a random forests model that is trained on the training set and evaluated on the test set.

  • In order to observe the impact of imputing the training and test sets together, we delete a certain percentage of values x, and impute it using the imputation method used in some of the papers we review in our study — imputing the training and test datasets together.

  • We vary the proportion of missing values from 0% to 95% in increments of 5% and plot the accuracy of the random forests classifier on the test set.

  • We run the entire process 100 times and report the mean and 95% CI of the accuracy in Figure 2.

We find that imputing the training and test sets together leads to an increasing improvement in the purportedly “out-of-sample” accuracy of the model as the percentage of missing values increases. Estimates of model performance in this case are artificially high — when no data was missing, model accuracy was around 60%; with 95% missing data, the model accuracy increases to >95%.

Figure 2. Results of a simulation that showcase how imputing the training and test sets together leads to overoptimistic estimates of model performance. The 95% CI is too small to be seen.

A running list of reproducibility failures and overoptimistic claims in applied ML research

The list below consists of papers (especially systematic reviews) that highlight reproducibility failures or pitfalls in applied ML research. We distinguish applied ML research, where the goal is to use ML methods to study some scientific question, from ML research, where the goal is to develop new ML methods, for example the typical NeurIPS paper. We are interested in the former.

Field Paper Year Num. papers reviewed Num. papers w/pitfalls Pitfalls
Medicine Bouwmeester et al. 2012 71 27 No train-test split
Neuroimaging Whelan et al. 2014 14 No train-test split; Feature selection on train and test set
Autism Diagnostics Bone et al. 2015 3 Duplicates across train-test split; Sampling bias
Bioinformatics Blagus et al. 2015 6 Pre-processing on train and test sets together
Nutrition research Ivanescu et al. 2016 4 No train-test split
Software engineering Tu et al. 2018 58 11 Temporal leakage
Toxicology Alves et al. 2019 1 Duplicates across train-test split
Satelitte imaging Nalepa et al. 2019 17 17 Non-independence between train and test sets
Clinical epidemiology Christodoulou et al. 2019 71 48 Feature selection on train and test set
Tractography Poulin et al. 2019 4 2 No train-test split
Brain-computer interfaces Nakanishi et al. 2020 1 No train-test split
Histopathology Oner et al. 2020 1 Non independence between train and test sets
Computer security Arp et al. 2020 30 30 No train-test split; Pre-processing on train and test sets together; Illegitimate features; others
Neuropsychiatry Poldrack et al. 2020 100 53 No train-test split; pre-processing on train and test sets together
Medicine Vandewiele et al. 2021 24 21 Feature selection on train-test sets; Non-independence between train and test sets; Sampling bias
Radiology Roberts et al. 2021 62 62 No train-test split; duplicates in train and test sets; sampling bias
IT Operations Lyu et al. 2021 9 3 Temporal leakage
Medicine Filho et al. 2021 1 Illegitimate features
Neuropsychiatry Shim et al. 2021 1 Feature selection on training and test sets
Genomics Barnett et al. 2022 41 23 Feature selection on training and test sets

About us

This is a project by Sayash Kapoor and Arvind Narayanan. We are researchers in the department of computer science and the Center for Information Technology Policy at Princeton University.

Our interest in this topic arose during a graduate seminar on Limits to Prediction. Narayanan offered this course together with Prof. Matthew Salganik in Fall 2020, and Kapoor took the course. The course aimed to critically examine the narrative about the ability to predict the future with ever-increasing accuracy given bigger datasets and more powerful algorithms. The work on reproducibility pitfalls is one aspect of our broader interest in limits to prediction.