
they had similar issues and were ultimately able to receive data and code from about 35% of their
≈
200 queries.
Even when data was provided, the authors were able to reproduce the scientific results from only
≈
60%. In
cases where data was not shared, the reasons varied from institutional/ethical restrictions to outright refusal
as their “code was not written with an eye toward distributing for other people to use.” (Ref. [5], p. 2585).
This can create a slew of problems. Trisovic et al. [6] recently demonstrated that only
≈
25 % of code released
alongside research papers could be run without error. This represents a view of computation held by many
scientists; it’s an exercise in personal research, never intended to be used by someone else. This pulls me back to
that fateful night in preparing for my candidacy exam. Not only did I write that code without an eye towards
sharing with others, I didn’t even write it for my future self.
Recent years have seen a flurry of excellent papers outlining best practices for reproducible research, spanning
from scientific programming guidelines [7
–
10], to general and specialized data annotation [11,12], to instructions
for bundling entire projects as “reproducible packages” [13] and I encourage the reader to give them a look.
However, I take a different approach in this essay and give my perspective as a practicing biologist who thinks
about how to maximize reproducibility alongside designing, executing, and analyzing experiments.
Data as modern scientific currency
I view research as a journey with the generation, manipulation, visualization, and interpretation of data as
the overarching themes. Here, I take very general definition of “data” to mean “a collection of qualitative or
quantitative facts” such that results from simulations, mathematical analysis, and bench-top experiments are
treated equivalently as data-generating processes. While we often remark that the “data speak for themselves”,
this is never truly the case. Not only do you give the data their voices, you give them the language they speak.
Reproducibility requires a Rosetta stone such that anyone can perform the translation and come to the same
results.
Consider the “typical” cycle of science as depicted in Figure 1. Beginning with hypotheses, experiments are
designed to thoroughly test and falsify them
1
, resulting in the generation of new data. These data, whether
they come from tangible or computational experiments, often need to be manipulated through processing,
cleaning, and analysis pipelines before they can be truly understood. In all cases, these data must be visualized
in a way where the experimenter can use their expertise and logical creativity to interpret the results, allowing
conclusions to be drawn and the hypothesis to be confirmed, refuted, or refined. In the modern scientific
enterprise, each of these steps require a combination of instructions that are physical and targeted to humans
(protocols, observations, notes, etc.) and digital records which are computer-readable (code, instrument settings,
accession numbers, etc.). In order for this process to be reproducible, each of these steps must have their
instructions meticulously kept and clearly documented. With enough care, these instructions come together to
serve as your Rosetta stone.
Philosophical pillars for reproducibility
“Making your research reproducible” is easier to say than to do. Through my years of experience in prioritizing
reproducibility in my own work, I’ve found four key principles to be critical to performing my research in a
reproducible manner [Figure 2(A)]. While the detailed structure or the questions I pose may not be appropriate
for your particular project or experiment, the philosophy behind it will likely still apply. This allows you to make
a tailor-made reproducible workflow from the ground up in a way that others can follow.
1
In exploratory research, experiments are designed to properly collect data from which hypotheses will be drawn. In meta-analyses,
the “experiments” may be collection of data from previously published papers or other resources. In either case, the cycle shown in
Figure 1 still applies.
2