Navigating the challenges in creating complex data systems a development philosophy Soren Dittmer16 Michael Roberts1 Julian Gilbey1 Ander Biguri1 AIX-COVNET

2025-05-02 0 0 548.48KB 10 页 10玖币
侵权投诉
Navigating the challenges in creating complex data
systems: a development philosophy
S¨
oren Dittmer1,6,+,*, Michael Roberts1,+,*, Julian Gilbey1, Ander Biguri1, AIX-COVNET
Collaboration2, Jacobus Preller3, James H.F. Rudd4, John A.D. Aston5, and
Carola-Bibiane Sch ¨
onlieb1
1Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK
2A list of authors and their affiliations appears at the end of the paper
3Addenbrooke’s Hospital, Cambridge University Hospitals NHS Trust, Cambridge, UK.
4Department of Medicine, University of Cambridge, Cambridge, UK
5Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, Cambridge, UK
6ZeTeM, University of Bremen, Bremen, Germany
+these authors contributed equally to this work
*corresponding authors sd870@cam.ac.uk and mr808@cam.ac.uk
ABSTRACT
In this perspective, we argue that despite the democratization of powerful tools for data science and machine learning over
the last decade, developing the code for a trustworthy and effective data science system (DSS) is getting harder. Perverse
incentives and a lack of widespread software engineering (SE) skills are among many root causes we identify that naturally
give rise to the current systemic crisis in reproducibility of DSSs. We analyze why SE and building large complex systems is, in
general, hard. Based on these insights, we identify how SE addresses those difficulties and how we can apply and generalize
SE methods to construct DSSs that are fit for purpose. We advocate two key development philosophies, namely that one
should incrementally grow – not biphasically plan and build – DSSs, and one should always employ two types of feedback
loops during development: one which tests the code’s correctness and another that evaluates the code’s efficacy.
Machine learning is in a reproducibility crisis [Hai+20;
Pin+21;Bak16]. We argue that a primary driver is poor code
quality, having two root causes: poor incentives to produce
good code and a widespread lack of Software Engineering
(SE) skills. The crisis also demonstrates that Data Science
Systems (DSS) can, and will, fail silently if no continual
verification infrastructure exists throughout their develop-
ment [Kar19;AW19;Bha+19].
We consider two questions important to all data scientists.
Firstly, why is it so hard to build complex systems? Here we
blame the intrinsic fragility and sheer number of components
in modern DSSs, holding for the code and data involved.
Therefore, we reason that the development of DSSs must fol-
low Gall’s law – one can not build complex systems; one has
to grow them [Gal75]. Note that this aligns well with agile
development in modern SE. Secondly, we ask, how can we
write trustworthy and effective DSSs? We argue that having
two types of feedback loops in place is a critical necessity:
one to assess the DSS’s correctness and another to assess its
efficacy.
In fact, we believe the crisis to be a natural corollary of an
environment in which data scientists develop complex DSSs
without growing them nor establishing a careful and contin-
ual assessment of their code’s correctness. While incorrect
code can be computationally reproducible – i.e., rerunning
the code produces identical results – for replicability and
general reproducibility using independent implementations,
correctness is crucial. Maybe even more crucial, reusability –
standing on the shoulders of giants – demands correctness;
indeed, without it every downstream task inherits the lack of
correctness and reproducibility.
The problem: a Cambrian explosion
Until relatively recently, statisticians had a monopoly on data
analysis. They were, and are, highly trained to appreciate the
intricate relationships and biases in data and to use relatively
simple methods (in the best sense of the word) to analyze the
data and fit models to it. Data collection was often done under
their guidance to ensure biases were understood, documented,
and mitigated.
Nowadays, data is ubiquitous and often claimed to be
the new oil. However, real-world datasets often resemble
more of an oil spill, containing a plethora of unknown (and
often unknowable) biases. Without sufficient statistics and
SE skills, the development of a DSS tends to lead to the
following implications:
Big Data Messy Data Big Code
Messy Code Incorrect Conclusions
1
arXiv:2210.13191v1 [cs.SE] 21 Oct 2022
The radical increase in the scale and availability of data has
led to an equally radical paradigm shift in its use. Data sci-
entists build complex systems on top of complex, biased,
and generally incomprehensible data. To do this, they are
the consumers of many more software tools than classical
statisticians. As a user of many tools, it is naturally more
vital to know how to interface with them and less possible
to understand their internal workings. Hence the underlying
software must be trustworthy; one has to assume it is almost
bug-free, with any remaining bugs being insignificant to any
conclusions.
Expressing and structuring an analysis plan in code is the
bedrock for all data science projects and due to these many
tools, modern data scientists must write increasing amounts
of custom ‘glue code’ when developing DSSs. However, SE
is a challenging discipline, and building on vast unfamiliar
codebases often leads to unexpected consequences.
Both from the data and algorithm’s perspective, this
paradigm shift resembles a Cambrian explosion in the quan-
tity and intrinsic complexity of data and code.
Why is the problem challenging?
In this section, we want to discuss some significant challenges
data scientists face when developing a correct and effective
DSS. Some of these challenges are due to human nature,
whereas others are of a technical nature.
Challenge 1: Missing SE skills
Most data scientists only learn to write small codebases,
whereas SE focuses on working with large codebases. As
mentioned above, code is the interface to many data science
tools, and SE is the discipline of organizing interfaces me-
thodically. For this paper,
we define SE as the discipline
of managing the complexity of code and data with inter-
faces as one of its primary tools
[Par72]. While many SE
practices focus on enterprise software and do not trivially
apply to all components of DSSs, it is our conviction that
SE methodologies must play a more prominent role in future
data science projects.
Challenge 2: Correctness and efficacy
A DSS must work correctly, i.e., it does what you think it
does. It also must be efficacious, i.e., produce relevant and
usable predictions. Without SE, following earlier arguments,
this tends to lead to the following implications:
Multiple Experiments
Messy Code Incorrect Conclusions
So why do we truly need correctness and efficacy for a trust-
worthy high-performing model? Firstly, as mentioned, a
published, executable code can provide computational repro-
ducibility, but repeatability requires correctness. Secondly,
while an incorrect DSS can be efficacious due to a lucky bug,
it is uninterpretable and hard to modify.
Without correct-
ness, it is impossible to understand, interpret, or trust the
outputs of and conclusion based on a DSS.
See Figure 1
for visualization of why we need correctness and efficacy.
Challenge 3: Perverse incentives in academia
Software engineers, industrial data scientists, and academic
data scientists produce different products within wildly dif-
ferent incentive structures. Software engineers are rewarded
for creating high-performing, well-documented, and reusable
codebases; data scientists are rewarded based on their DSS
outputs. Like software engineers, industrial data scientists are
rewarded based on the system’s usefulness to the company.
Academic data scientists, however, aim to use their results to
write marketable papers to further their field, apply for grants,
and enhance their reputation.
For academia, there is a conflict between short-term and
long-term incentives.
Academic careers are peripatetic in nature and most posi-
tions are temporary for early career researchers, who tend to
be those developing DSSs. Therefore, in the short-term it is
rewarding to publish papers quickly, and give less attention
to reusability of the codebase, as careful reusable develop-
ment leads to delayed gratification. The short-term academic
incentive structure might even discourage producing and pub-
lishing code comprehensible for a broad audience to avoid
getting ‘scooped’ by competitors.
In the long-term however, a clear incentive to develop
reusable DSSs is that this increases the probability that the
paper will become influential and be well cited. For example,
if two similar papers are published, but only one provides
good code, it is almost certain that future papers will compare
directly to this one. Over time, this will dramatically (and
multiplicative) separate the popularity of the two papers. Not
having this incentivised however leads to an enormous value
destruction for society.
However, the grant system gives one potential mecha-
nism to resolve the perverse encourage realisation of these
long-term incentives by encouraging proposals that involve
the development of DSSs must involve resources for the con-
struction of reusable and deployable codebases.
Interestingly, this is not a new phenomenon;
Knuth [Knu84] discussed it in the 1980s when he was
advised to publish T
E
X’s source code. However, if a field’s
incentive structure and goals are misaligned, see, e.g., the
positive publication bias [BW19;ZC21], the path of least
resistance easily wins the upper hand.
Challenge 4: Short-circuits
The democratization of powerful data analysis and machine
learning tools allows for short-circuits as keen amateurs can
develop complex DSSs relatively quickly. This is not to say
that using powerful publicly available tools or short-circuits
are inherently bad. On the contrary, if every practitioner was
writing private versions of common tool kits, this would be a
major source of bugs.
However, powerful tools reduce the accidental complex-
ity, not the intrinsic complexity of DSS. Thus, they make
2/10
摘要:

Navigatingthechallengesincreatingcomplexdatasystems:adevelopmentphilosophyS¨orenDittmer1,6,+,*,MichaelRoberts1,+,*,JulianGilbey1,AnderBiguri1,AIX-COVNETCollaboration2,JacobusPreller3,JamesH.F.Rudd4,JohnA.D.Aston5,andCarola-BibianeSch¨onlieb11DepartmentofAppliedMathematicsandTheoreticalPhysics,Univer...

展开>> 收起<<
Navigating the challenges in creating complex data systems a development philosophy Soren Dittmer16 Michael Roberts1 Julian Gilbey1 Ander Biguri1 AIX-COVNET.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:548.48KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注