No, you still cannot predict hit songs using machine learning

Sayash Kapoor and Arvind Narayanan
August 16, 2023

A June 2023 paper claimed that machine learning can predict hit songs with 97% accuracy. News outlets including Scientific American and Axios published pieces about how this "frightening accuracy" could revolutionize the music industry. Earlier studies have found that it is hard to predict if a song will be successful in advance, so this paper seemed to be a dramatic achievement.

Unfortunately, we found that the study's results are bogus.

The model presented in the paper exhibits one of the most common pitfalls in machine learning: data leakage. This roughly means that the model is evaluated on the same, or similar, data as it is trained on, which makes estimates of accuracy exaggerated. In the real world, the model would perform far worse. This is like teaching to the test—or worse, giving away the answers before the exam takes place.

The authors used data about how 33 listeners reacted to 24 songs. Their initial dataset consisted of just 24 samples; one for each song. For each song, the model relied on only three features to predict if it would be a hit, and the values of these features were averaged across listeners. They used this dataset to create a synthetic (fake) dataset with 10,000 samples, through a process called oversampling. One of the main considerations in testing an ML model is that the data it is trained on should be entirely separate from the data it is evaluated on. The crucial error in this paper is that this train-test split was performed after the data was already oversampled. So the training and test data were far more similar to each other, than, say, to a new dataset containing other songs. In other words, the paper provides no evidence of how well the model would perform on new songs.

When we tested the models after correcting this error on the original data released by the authors, the model's accuracy was little better than random. We also found that using the authors' synthetic dataset, it is in fact possible to achieve 100% accuracy. This shouldn't be surprising: with such heavy oversampling, it is likely that the original data can be reconstructed using either the train or the test split. In other words, they are training and testing on essentially the same data.

Similar errors have been uncovered in many other fields. For example, Vandewiele et al. found that the same error—oversampling a dataset before splitting it into the training and test sets—led to vastly exaggerated accuracy numbers in various studies that used AI for predicting the risk of a preterm birth. In our past research, we have found that leakage affects hundreds of papers across over a dozen fields.

The authors also obfuscate the description of their analysis, making it hard to spot the evidence of leakage when reading the paper. They used cardiac sensors to infer neural states, resulting in only three features per song. But their abstract and introduction only mentioned that they used "neural measures", suggesting that the data is far more complex and signal-rich than it is. In fact, the words "neural" and "brain" are collectively mentioned 70 times in the paper; the terms "heart" and "cardiac" are collectively mentioned only five times. Further, the authors did not include the code needed to reproduce their analysis. The REFORMS checklist could be useful in making such omissions more obvious and obfuscation harder.