Albert J. Menkveld

My two cents

Reproducible Science

Technology makes almost everything reproducible these days. The voice of my father is reproducible. A scammer could call me with a machine, in real-time, converting his voice to my father’s: “My credit card fails and I need pay this driver, can you please give me your credit card details?”

This example of reproducibility might come as a surprise to you, but, surely, you can trust scientific research to be reproducible, no? Increasingly, online repositories hold the code and the data that produced the results in published papers. This is a reassuring trend, but does running this code on this data produce the same results on your computer? This type of reproducibility is known as computational reproducibility.

The answer to whether research in Economics is computationally reproducible is: About two-thirds. Verifiers at Cascad specialize in computationally reproducing published research for, among others, the journals of the American Economic Association (AEA) and the Review of Economic Studies (RES). These verifiers are unable to reproduce findings for about one third of the papers. They did the same for #fincap, an experiment in which 166 teams independently tested the same hypotheses on the same data. They uploaded their code, which was made available to Cascad. Their analysis of #fincap reproducibility yield results that are similar to AEA/RES reproducibility.

Failure rate AEA/RES	Failure rate #fincap experiment
30.3%	29.1%

When do results fail to reproduce?

The relative merit of the #fincap experiment is that the environment is controlled. Research teams were all doing the same testing on the same data. Any variation across teams in reproducibility must, therefore, be due to the execution of the task. The experiment, therefore, can yield insight into when results fail to reproduce.

Regressions with #fincap reproducibility as the dependent variable yield several significant results. Reproducibility tends to fail for:

Code lacking loops or matrix operations
Complex code
Poorly documented code
Outlier results

While these co-variates are statistically significant, others are not. There is no evidence for a gender effect, a geographic effect, or a choice of software effect. (Although insignificant, I was pleased to see above-average reproducibility of Python code.)

These results and others are available in a working paper that we released recently (link below). I believe there is a lot to think about for empiricists in the paper. Perhaps most useful is the section that discusses implications for researchers and scientific journals. It includes guidelines and it references a reproducibility kit that might be useful for your next project.

Reproducible Science should be a pleonasm, just like “white snow.”

The paper is here: Computational Reproducibility in Finance: Evidence from 1,000 Tests.