We seek to document the reproducibility of applied Machine Learning research. We aim to do systematic, methodologically critical reviews of research in fields adopting ML methods, and use in-depth code review to investigate their reproducibility.
Draft paper: (Ir)Reproducible Machine Learning: A Case Study
Many quantitative science fields are adopting the paradigm of predictive modeling using machine learning. We welcome this development. At the same time, as researchers whose interests include the strengths and limits of machine learning, we have concerns about reproducibility and overoptimism. There are many reasons for caution: performance evaluation is notoriously tricky in machine learning; ML code tends to be complex and as yet lacks standardization; subtle pitfalls arise from the differences between explanatory and predictive modeling; finally, the hype and overoptimism about commercial AI may spill over into applied machine learning research. All these, of course, are in addition to the pressures and publication biases present in all disciplines that have led to reproducibility crises.
Indeed, we found that systematic reviews have exposed reproducibility issues in many applied-ML fields. Motivated by this, we seek to undertake more systematic reviews, and we have released our first study focusing on civil war prediction in political science.
Reproducibility failures don’t mean a claim is wrong, just that evidence presented falls short of the accepted standard or that the claim only holds in a narrower set of circumstances than asserted. We don’t view reproducibility failures as signs that individual authors or teams are careless, and we don’t think any researcher is immune. One of us (Narayanan) has had multiple such failures in his applied-ML work and expects that it will probably happen again.
In fact, we view reproducibility difficulties as the expected state of affairs given the complexities of the new paradigm of prediction-for-understanding. We should see frequent reproducibility failures as the norm until best practices become better established and understood. Thus, the spate of reproducibility failures we have compiled highlight the immaturity of applied-ML research, the critical need for ongoing work on methods and best practices, and the importance of treating the results from this body of work with caution.
We recognize that there is substantial inconsistency in the use of the term reproducibility, justify our choice of the term below, and welcome feedback on this point. Regardless of terminology, it is clear that there have been exaggerated claims of predictive performance of machine learning in many scientific fields. One goal of our project is to understand what systemic interventions might be most effective. We provide tentative suggestions in the Discussion section of our draft paper.
Sayash Kapoor, Arvind Narayanan
We find that prominent studies on civil war prediction claiming superior performance of ML models over baseline Logistic Regression models fail to reproduce. Our results provide two reasons to be skeptical of the use of ML methods in this research area, by both questioning their usefulness and highlighting the pitfalls of applying them correctly.
Note: The reproduction materials are zipped into a 1.3 GB zip file containing our reproductions of all papers with reproducibility issues. The large size of the reproduction materials is due to the inclusion of all models that were created during our reproduction in the zip file for quicker reproductions. In case the size of the zip file is an issue for you, please get in touch at sayashk AT princeton.edu.Draft paper Supplement Reproduction materials
We acknowledge that there isn't consensus about the term reproducibility, and there have been a number of recent attempts to define the term and create consensus. One possible definition is computational reproducibility — when the results in a paper can be replicated using the exact code and dataset provided by the authors. We argue that this definition is too narrow because even cases of outright bugs in the code would not be considered irreproducible under this definition. Therefore we advocate for a standard where bugs and other errors in data analysis that change or challenge a paper's findings constitute irreproducibility.
What constitutes an error?
The goal of predictive modeling is to estimate (and improve) the accuracy of predictions that one might make in a real-world scenario. This is true regardless of the specific research question one wishes to study by building a predictive model. In practice one sets up the data analysis to mimic this real-world scenario as closely as possible. There are limits to how well we can do this and consequently there is always methodological debate on some issues, but there are also some clear rules. If an analysis choice can be shown to lead to incorrect estimates of predictive accuracy, there is usually consensus in the ML community that it is an error. For example, violating the train-test split (or the learn-predict separation) is an error because the test set is intended to provide an accurate estimate of 'out-of-sample' performance — model performance on a dataset that was not used for training. Thus, to define what is an error, we look to this consensus in the ML community (e.g. in textbooks) and offer our own arguments when necessary.
Explanatory vs. predictive modeling
It's important to bear in mind that the goals of explanatory modeling and predictive modeling differ, and a valid methodological choice for explanatory modeling can be an error in a predictive modeling setting. For example, imputing the entire dataset together may be acceptable in explanatory modeling but it is a clear violation of the train-test split in predictive modeling. We created a simulated example to show how data leakage affects performance evaluation when we impute the training and test sets together. We describe the simulation below:
The dataset consists of two variables — the target variable y and the independent variable x.
y is a binary variable.
x is drawn from a normal distribution and depends on y as x = N(0,1) + y
We generate 1000 samples with y=0 and 1000 samples with y=1 to create the dataset.
We randomly split the data into training (50%) and test (50%) sets, and create a random forests model that is trained on the training set and evaluated on the test set.
In order to observe the impact of imputing the training and test sets together, we delete a certain percentage of values x, and impute it using the imputation method used in some of the papers we review in our study — imputing the training and test datasets together.
We vary the proportion of missing values from 0% to 95% in increments of 5% and plot the accuracy of the random forests classifier on the test set.
We run the entire process 100 times and report the mean and 95% CI of the accuracy in Figure 2.
We find that imputing the training and test sets together leads to an increasing improvement in the purportedly “out-of-sample” accuracy of the model as the percentage of missing values increases. Estimates of model performance in this case are artificially high — when no data was missing, model accuracy was around 60%; with 95% missing data, the model accuracy increases to >95%.
The list below consists of papers (especially systematic reviews) that highlight reproducibility failures or pitfalls in applied ML research. We distinguish applied ML research, where the goal is to use ML methods to study some scientific question, from ML research, where the goal is to develop new ML methods, for example the typical NeurIPS paper. We are interested in the former.
|Field||Paper||Year||Num. papers reviewed||Num. papers w/pitfalls||Pitfalls|
|Medicine||Bouwmeester et al.||2012||71||27||No train-test split|
|Neuroimaging||Whelan et al.||2014||—||14||No train-test split; Feature selection on train and test set|
|Autism Diagnostics||Bone et al.||2015||—||3||Duplicates across train-test split; Sampling bias|
|Bioinformatics||Blagus et al.||2015||—||6||Pre-processing on train and test sets together|
|Nutrition research||Ivanescu et al.||2016||—||4||No train-test split|
|Software engineering||Tu et al.||2018||58||11||Temporal leakage|
|Toxicology||Alves et al.||2019||—||1||Duplicates across train-test split|
|Satelitte imaging||Nalepa et al.||2019||17||17||Non-independence between train and test sets|
|Clinical epidemiology||Christodoulou et al.||2019||71||48||Feature selection on train and test set|
|Tractography||Poulin et al.||2019||4||2||No train-test split|
|Brain-computer interfaces||Nakanishi et al.||2020||—||1||No train-test split|
|Histopathology||Oner et al.||2020||—||1||Non independence between train and test sets|
|Computer security||Arp et al.||2020||30||30||No train-test split; Pre-processing on train and test sets together; Illegitimate features; others|
|Neuropsychiatry||Poldrack et al.||2020||100||53||No train-test split; pre-processing on train and test sets together|
|Medicine||Vandewiele et al.||2021||24||21||Feature selection on train-test sets; Non-independence between train and test sets; Sampling bias|
|Radiology||Roberts et al.||2021||62||62||No train-test split; duplicates in train and test sets; sampling bias|
|IT Operations||Lyu et al.||2021||9||3||Temporal leakage|
|Medicine||Filho et al.||2021||—||1||Illegitimate features|
|Neuropsychiatry||Shim et al.||2021||—||1||Feature selection on training and test sets|
|Genomics||Barnett et al.||2022||41||23||Feature selection on training and test sets|