Navigating the challenges in creating complex data systems a development philosophy Soren Dittmer16 Michael Roberts1 Julian Gilbey1 Ander Biguri1 AIX-COVNET

2025-05-02 0 0 548.48KB 10 页 10玖币

侵权投诉

Navigating the challenges in creating complex data

systems: a development philosophy

S¨

oren Dittmer1,6,+,*, Michael Roberts1,+,*, Julian Gilbey1, Ander Biguri1, AIX-COVNET

Collaboration2, Jacobus Preller3, James H.F. Rudd4, John A.D. Aston5, and

Carola-Bibiane Sch ¨

onlieb1

1Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK

2A list of authors and their afﬁliations appears at the end of the paper

3Addenbrooke’s Hospital, Cambridge University Hospitals NHS Trust, Cambridge, UK.

4Department of Medicine, University of Cambridge, Cambridge, UK

5Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, Cambridge, UK

6ZeTeM, University of Bremen, Bremen, Germany

+these authors contributed equally to this work

*corresponding authors sd870@cam.ac.uk and mr808@cam.ac.uk

ABSTRACT

In this perspective, we argue that despite the democratization of powerful tools for data science and machine learning over

the last decade, developing the code for a trustworthy and effective data science system (DSS) is getting harder. Perverse

incentives and a lack of widespread software engineering (SE) skills are among many root causes we identify that naturally

give rise to the current systemic crisis in reproducibility of DSSs. We analyze why SE and building large complex systems is, in

general, hard. Based on these insights, we identify how SE addresses those difﬁculties and how we can apply and generalize

SE methods to construct DSSs that are ﬁt for purpose. We advocate two key development philosophies, namely that one

should incrementally grow – not biphasically plan and build – DSSs, and one should always employ two types of feedback

loops during development: one which tests the code’s correctness and another that evaluates the code’s efﬁcacy.

Machine learning is in a reproducibility crisis [Hai+20;

Pin+21;Bak16]. We argue that a primary driver is poor code

quality, having two root causes: poor incentives to produce

good code and a widespread lack of Software Engineering

(SE) skills. The crisis also demonstrates that Data Science

Systems (DSS) can, and will, fail silently if no continual

veriﬁcation infrastructure exists throughout their develop-

ment [Kar19;AW19;Bha+19].

We consider two questions important to all data scientists.

Firstly, why is it so hard to build complex systems? Here we

blame the intrinsic fragility and sheer number of components

in modern DSSs, holding for the code and data involved.

Therefore, we reason that the development of DSSs must fol-

low Gall’s law – one can not build complex systems; one has

to grow them [Gal75]. Note that this aligns well with agile

development in modern SE. Secondly, we ask, how can we

write trustworthy and effective DSSs? We argue that having

two types of feedback loops in place is a critical necessity:

one to assess the DSS’s correctness and another to assess its

efﬁcacy.

In fact, we believe the crisis to be a natural corollary of an

environment in which data scientists develop complex DSSs

without growing them nor establishing a careful and contin-

ual assessment of their code’s correctness. While incorrect

code can be computationally reproducible – i.e., rerunning

the code produces identical results – for replicability and

general reproducibility using independent implementations,

correctness is crucial. Maybe even more crucial, reusability –

standing on the shoulders of giants – demands correctness;

indeed, without it every downstream task inherits the lack of

correctness and reproducibility.

The problem: a Cambrian explosion

Until relatively recently, statisticians had a monopoly on data

analysis. They were, and are, highly trained to appreciate the

intricate relationships and biases in data and to use relatively

simple methods (in the best sense of the word) to analyze the

data and ﬁt models to it. Data collection was often done under

their guidance to ensure biases were understood, documented,

and mitigated.

Nowadays, data is ubiquitous and often claimed to be

the new oil. However, real-world datasets often resemble

more of an oil spill, containing a plethora of unknown (and

often unknowable) biases. Without sufﬁcient statistics and

SE skills, the development of a DSS tends to lead to the

following implications:

Big Data ⇒Messy Data ⇒Big Code

⇒Messy Code ⇒Incorrect Conclusions

arXiv:2210.13191v1 [cs.SE] 21 Oct 2022

The radical increase in the scale and availability of data has

led to an equally radical paradigm shift in its use. Data sci-

entists build complex systems on top of complex, biased,

and generally incomprehensible data. To do this, they are

the consumers of many more software tools than classical

statisticians. As a user of many tools, it is naturally more

vital to know how to interface with them and less possible

to understand their internal workings. Hence the underlying

software must be trustworthy; one has to assume it is almost

bug-free, with any remaining bugs being insigniﬁcant to any

conclusions.

Expressing and structuring an analysis plan in code is the

bedrock for all data science projects and due to these many

tools, modern data scientists must write increasing amounts

of custom ‘glue code’ when developing DSSs. However, SE

is a challenging discipline, and building on vast unfamiliar

codebases often leads to unexpected consequences.

Both from the data and algorithm’s perspective, this

paradigm shift resembles a Cambrian explosion in the quan-

tity and intrinsic complexity of data and code.

Why is the problem challenging?

In this section, we want to discuss some signiﬁcant challenges

data scientists face when developing a correct and effective

DSS. Some of these challenges are due to human nature,

whereas others are of a technical nature.

Challenge 1: Missing SE skills

Most data scientists only learn to write small codebases,

whereas SE focuses on working with large codebases. As

mentioned above, code is the interface to many data science

tools, and SE is the discipline of organizing interfaces me-

thodically. For this paper,

we deﬁne SE as the discipline

of managing the complexity of code and data with inter-

faces as one of its primary tools

[Par72]. While many SE

practices focus on enterprise software and do not trivially

apply to all components of DSSs, it is our conviction that

SE methodologies must play a more prominent role in future

data science projects.

Challenge 2: Correctness and efﬁcacy

A DSS must work correctly, i.e., it does what you think it

does. It also must be efﬁcacious, i.e., produce relevant and

usable predictions. Without SE, following earlier arguments,

this tends to lead to the following implications:

Multiple Experiments ⇒

Messy Code ⇒Incorrect Conclusions

So why do we truly need correctness and efﬁcacy for a trust-

worthy high-performing model? Firstly, as mentioned, a

published, executable code can provide computational repro-

ducibility, but repeatability requires correctness. Secondly,

while an incorrect DSS can be efﬁcacious due to a lucky bug,

it is uninterpretable and hard to modify.

Without correct-

ness, it is impossible to understand, interpret, or trust the

outputs of and conclusion based on a DSS.

See Figure 1

for visualization of why we need correctness and efﬁcacy.

Challenge 3: Perverse incentives in academia

Software engineers, industrial data scientists, and academic

data scientists produce different products within wildly dif-

ferent incentive structures. Software engineers are rewarded

for creating high-performing, well-documented, and reusable

codebases; data scientists are rewarded based on their DSS

outputs. Like software engineers, industrial data scientists are

rewarded based on the system’s usefulness to the company.

Academic data scientists, however, aim to use their results to

write marketable papers to further their ﬁeld, apply for grants,

and enhance their reputation.

For academia, there is a conﬂict between short-term and

long-term incentives.

Academic careers are peripatetic in nature and most posi-

tions are temporary for early career researchers, who tend to

be those developing DSSs. Therefore, in the short-term it is

rewarding to publish papers quickly, and give less attention

to reusability of the codebase, as careful reusable develop-

ment leads to delayed gratiﬁcation. The short-term academic

incentive structure might even discourage producing and pub-

lishing code comprehensible for a broad audience to avoid

getting ‘scooped’ by competitors.

In the long-term however, a clear incentive to develop

reusable DSSs is that this increases the probability that the

paper will become inﬂuential and be well cited. For example,

if two similar papers are published, but only one provides

good code, it is almost certain that future papers will compare

directly to this one. Over time, this will dramatically (and

multiplicative) separate the popularity of the two papers. Not

having this incentivised however leads to an enormous value

destruction for society.

However, the grant system gives one potential mecha-

nism to resolve the perverse encourage realisation of these

long-term incentives by encouraging proposals that involve

the development of DSSs must involve resources for the con-

struction of reusable and deployable codebases.

Interestingly, this is not a new phenomenon;

Knuth [Knu84] discussed it in the 1980s when he was

advised to publish T

X’s source code. However, if a ﬁeld’s

incentive structure and goals are misaligned, see, e.g., the

positive publication bias [BW19;ZC21], the path of least

resistance easily wins the upper hand.

Challenge 4: Short-circuits

The democratization of powerful data analysis and machine

learning tools allows for short-circuits as keen amateurs can

develop complex DSSs relatively quickly. This is not to say

that using powerful publicly available tools or short-circuits

are inherently bad. On the contrary, if every practitioner was

writing private versions of common tool kits, this would be a

major source of bugs.

However, powerful tools reduce the accidental complex-

ity, not the intrinsic complexity of DSS. Thus, they make

2/10

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Navigatingthechallengesincreatingcomplexdatasystems:adevelopmentphilosophyS¨orenDittmer1,6,+,*,MichaelRoberts1,+,*,JulianGilbey1,AnderBiguri1,AIX-COVNETCollaboration2,JacobusPreller3,JamesH.F.Rudd4,JohnA.D.Aston5,andCarola-BibianeSch¨onlieb11DepartmentofAppliedMathematicsandTheoreticalPhysics,Univer...

展开>> 收起<<

Navigating the challenges in creating complex data systems a development philosophy Soren Dittmer16 Michael Roberts1 Julian Gilbey1 Ander Biguri1 AIX-COVNET.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Navigating the challenges in creating complex data systems a development philosophy Soren Dittmer16 Michael Roberts1 Julian Gilbey1 Ander Biguri1 AIX-COVNET

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: