The radical increase in the scale and availability of data has
led to an equally radical paradigm shift in its use. Data sci-
entists build complex systems on top of complex, biased,
and generally incomprehensible data. To do this, they are
the consumers of many more software tools than classical
statisticians. As a user of many tools, it is naturally more
vital to know how to interface with them and less possible
to understand their internal workings. Hence the underlying
software must be trustworthy; one has to assume it is almost
bug-free, with any remaining bugs being insignificant to any
conclusions.
Expressing and structuring an analysis plan in code is the
bedrock for all data science projects and due to these many
tools, modern data scientists must write increasing amounts
of custom ‘glue code’ when developing DSSs. However, SE
is a challenging discipline, and building on vast unfamiliar
codebases often leads to unexpected consequences.
Both from the data and algorithm’s perspective, this
paradigm shift resembles a Cambrian explosion in the quan-
tity and intrinsic complexity of data and code.
Why is the problem challenging?
In this section, we want to discuss some significant challenges
data scientists face when developing a correct and effective
DSS. Some of these challenges are due to human nature,
whereas others are of a technical nature.
Challenge 1: Missing SE skills
Most data scientists only learn to write small codebases,
whereas SE focuses on working with large codebases. As
mentioned above, code is the interface to many data science
tools, and SE is the discipline of organizing interfaces me-
thodically. For this paper,
we define SE as the discipline
of managing the complexity of code and data with inter-
faces as one of its primary tools
[Par72]. While many SE
practices focus on enterprise software and do not trivially
apply to all components of DSSs, it is our conviction that
SE methodologies must play a more prominent role in future
data science projects.
Challenge 2: Correctness and efficacy
A DSS must work correctly, i.e., it does what you think it
does. It also must be efficacious, i.e., produce relevant and
usable predictions. Without SE, following earlier arguments,
this tends to lead to the following implications:
Multiple Experiments ⇒
Messy Code ⇒Incorrect Conclusions
So why do we truly need correctness and efficacy for a trust-
worthy high-performing model? Firstly, as mentioned, a
published, executable code can provide computational repro-
ducibility, but repeatability requires correctness. Secondly,
while an incorrect DSS can be efficacious due to a lucky bug,
it is uninterpretable and hard to modify.
Without correct-
ness, it is impossible to understand, interpret, or trust the
outputs of and conclusion based on a DSS.
See Figure 1
for visualization of why we need correctness and efficacy.
Challenge 3: Perverse incentives in academia
Software engineers, industrial data scientists, and academic
data scientists produce different products within wildly dif-
ferent incentive structures. Software engineers are rewarded
for creating high-performing, well-documented, and reusable
codebases; data scientists are rewarded based on their DSS
outputs. Like software engineers, industrial data scientists are
rewarded based on the system’s usefulness to the company.
Academic data scientists, however, aim to use their results to
write marketable papers to further their field, apply for grants,
and enhance their reputation.
For academia, there is a conflict between short-term and
long-term incentives.
Academic careers are peripatetic in nature and most posi-
tions are temporary for early career researchers, who tend to
be those developing DSSs. Therefore, in the short-term it is
rewarding to publish papers quickly, and give less attention
to reusability of the codebase, as careful reusable develop-
ment leads to delayed gratification. The short-term academic
incentive structure might even discourage producing and pub-
lishing code comprehensible for a broad audience to avoid
getting ‘scooped’ by competitors.
In the long-term however, a clear incentive to develop
reusable DSSs is that this increases the probability that the
paper will become influential and be well cited. For example,
if two similar papers are published, but only one provides
good code, it is almost certain that future papers will compare
directly to this one. Over time, this will dramatically (and
multiplicative) separate the popularity of the two papers. Not
having this incentivised however leads to an enormous value
destruction for society.
However, the grant system gives one potential mecha-
nism to resolve the perverse encourage realisation of these
long-term incentives by encouraging proposals that involve
the development of DSSs must involve resources for the con-
struction of reusable and deployable codebases.
Interestingly, this is not a new phenomenon;
Knuth [Knu84] discussed it in the 1980s when he was
advised to publish T
E
X’s source code. However, if a field’s
incentive structure and goals are misaligned, see, e.g., the
positive publication bias [BW19;ZC21], the path of least
resistance easily wins the upper hand.
Challenge 4: Short-circuits
The democratization of powerful data analysis and machine
learning tools allows for short-circuits as keen amateurs can
develop complex DSSs relatively quickly. This is not to say
that using powerful publicly available tools or short-circuits
are inherently bad. On the contrary, if every practitioner was
writing private versions of common tool kits, this would be a
major source of bugs.
However, powerful tools reduce the accidental complex-
ity, not the intrinsic complexity of DSS. Thus, they make
2/10