Fostering better coding practices for data scientists_2

2025-04-27 0 0 589.33KB 27 页 10玖币
侵权投诉
Fostering better coding practices for data scientists
Randall Pruim, Maria-Cristiana Gîrjău,, Nicholas J. Horton,*
Calvin University
Amherst College
Columbia University
*nhorton@amherst.edu
Abstract.
Many data science students and practitioners don’t see the value in making
time to learn and adopt good coding practices as long as the code “works”. However, code
standards are an important part of modern data science practice, and they play an essential
role in the development of data acumen. Good coding practices lead to more reliable code
and save more time than they cost, making them important even for beginners. We believe
that principled coding is vital for quality data science practice. To effectively instill these
practices within academic programs, instructors and programs need to begin establishing
these practices early, to reinforce them often, and to hold themselves to a higher standard
while guiding students. We describe key aspects of good coding practices for data science,
illustrating with examples in R and in Python, though similar standards are applicable to
other software environments. Practical coding guidelines are organized into a top ten list.
Keywords: data acumen, data science, data science practice, data science education, code
quality, code style
This article is ©2023 by author(s) as listed above. The article is licensed under a Creative Commons
Attribution (CC BY 4.0) International license (https://creativecommons.org/licenses/by/4.0/legalcode),
except where otherwise indicated with respect to particular material included in the article. The article
should be attributed to the author(s) identified above.
Media summary
Many data science students and practitioners are reluctant to adopt good coding practices as long
as the code “works”. Yet meticulous attention to detail is an important characteristic of a data
scientist.
Code standards are an important part of modern data science, and they play an essential role in
ensuring the quality of data science in research and in the workforce. Responsible coding practices
lead to more reliable code and save more time than they cost, making them important even for
beginners. We believe that principled coding is vital for quality data science practice.
To effectively instill these habits of mind within academic programs, instructors and programs need
to begin establishing these practices early, to reinforce them often, and to hold themselves to a
higher standard while guiding students. We describe key aspects of good coding for data science,
arXiv:2210.03991v2 [stat.CO] 25 Aug 2023
Fostering better coding practices for data scientists 2
illustrating them with examples and motivation. Practical coding guidelines are organized into a
top ten list.
1. Introduction
Coding is an increasingly important part of statistical analyses (Deboran Nolan and Temple Lang
2010; Hardin et al. 2021). The goal of code is not just to solve an immediate analysis problem but
also to establish reusable workflows and to communicate. As projects and analyses become more
sophisticated, it’s important that structures and expectations be set to facilitate data science as a
team sport in a sustainable and reproducible way (Horton et al. 2022).
Consensus reports (National Academies of Science, Engineering, and Medicine 2018) have highlighted
the importance of workflow and reproducibility as a component of data science practice:
“Modern data science has at its core the creation of workflows—pipelines of processes
that combine simpler tools to solve larger tasks. Documenting, incrementally
improving, sharing, and generalizing such workflows are an important part of data
science practice owing to the team nature of data science and broader significance
of scientific reproducibility and replicability. Documenting and sharing workflows
enable others to understand how data have been used and refined and what steps
were taken in an analysis process. This can increase the confidence in results and
improve trust in the process as well as enable reuse of analyses or results in a
meaningful way (page 2-12).
Adhering to established coding standards is an important step in fostering effective analyses and an
important component of data acumen. Unfortunately, attention to these issues has not been central
to many curricula in statistics and data science.
1.1. Why bother? Many data science students, and some practitioners, do not make time to learn
and adopt good coding practices. As long as the code “works”, they are satisfied and ready to move
on. But how do they know that “it works”? For data analysts, these issues are an important part of
data science practice, because it affects the bottom line:
Good coding practices lead to more reliable and maintainable code.
It is easier to notice and fix errors in well written code. And it is less likely that the errors occur in
the first place if the authors are using good practices. Trisovic et al. (2022) carried out a large-scale
study on code quality and found that 74% of R files did not run successfully. After incorporating
some automated code cleaning targeting “some of the most common execution errors”, 56% still
failed.
Even when the code is correct, following good coding practices makes the code easier to read and
understand, saving time and promoting good communication among team members.
In a blog post, Lyman (2021) wrote:
The value of high-quality code can be difficult to communicate. Some managers
see it as a boondoggle, an expensive hobby for overly fastidious programmers, since
investing in code quality can slow development over the short term and doesn’t
appear to alter the user experience. But nothing could be further from the truth.
Fostering better coding practices for data scientists 3
Learning and consistently using good coding practices takes some effort and some attention. But
in the end, we agree that they are likely to save far more time than they consume (Ball et al.
2022). Well written code is more likely to be correct, saving the time of redoing things, and easier
to maintain, saving effort when it becomes necessary to modify or adapt the code in the future.
Collaborations with other team members are likely to be more positive if the quality of individual
contributions is higher.
We are convinced that
Good coding practice is important, even for beginners.
It is easier and more efficient to learn good coding practices as one learns to program than to unlearn
bad habits later. This makes it especially important that the code beginners see meets the highest
standards for coding practice. We can’t expect beginners to mimic these practices perfectly from
the start, and we recommend focusing student attention (and feedback) on just a few key coding
practices early on. But if they don’t have a good model to emulate, we are impeding their progress
unnecessarily. As an additional benefit, modeling good coding practices will make it easier for
students (and others) to learn not only good coding technique but also the concepts and applications
that the code is illustrating.
In this paper, we will motivate the importance of principled coding, illustrate key aspects of good
coding practices, and suggest ways that these practices can be included in the data science and
statistics curriculum.
1.2. Prior work. We acknowledge that much of what we discuss is not novel, but it is nonetheless
important (and, we argue, under-appreciated and under-emphasized).
Many calls for better coding practices and enumerations of such practices exist. Computer science
curricula have long emphasized these practices beginning in introductory programming courses and
continuing throughout the curriculum (Keuning, Heeren, and Jeuring 2017; Borstler et al. 2017),
especially in courses like software engineering or in capstone projects courses (e.g., Berkeley’s CS169,
https://bcourses.berkeley.edu/courses/1507976). Stegeman et al. (2014) and (2016) have described
rubrics and assessment for code quality in programming courses.
The importance of good coding practices is also recognized in industry (“Google Style Guide”
2019; Ghani 2022) and across the sciences (Wilson et al. 2017; Aruliah et al. 2012; Filazzola and
Lortie 2022) and social sciences (Gentzkow and Shapiro 2022). Dogucu and Çetinkaya-Rundel
(2022) motivates the importance of code quality, style guides, file organization, and related topics.
Related work by Carey and Papin (2018) that describes rules for new programmers has relevance
for teaching data analysis. Deborah Nolan and Stoudt (2021) offer a “Dirty Dozen” set of helpful
code recommendations, and Abouzekry (2012) provides ten tips for better coding.
Code quality has been an area where some previous research has been undertaken. Schulte (2008)
introduced a block model to help study comprehension of program components (atoms, blocks,
relations, and macrostructure). Keuning, Heeren, and Jeuring (2017) and Keuning, Heeren, and
Jeuring (2019) have explored other aspects of teaching code quality.
While the particular coding practices enumerated vary some by author, programming language, and
application area, the overall message is clear: Good coding practices are important across a wide
Fostering better coding practices for data scientists 4
range of contexts to ensure that people, especially those working in teams, are productive and that
their work is reliable, maintainable, and reproducible. Furthermore, there is broad agreement about
the basic contours of what constitutes good coding practice. Unfortunately, the abundance of such
calls indicates that practice continues to fall short of principle.
1.3. A Motivating Example. An April 2021 twitter post (Meyer 2021) commented:
“It is really painful when taking a graduate level data science course and the
instructor’s code is considerably below any acceptable standard in the real world.
Here is some real life code from a demo offered for the current homework. . .
data <- read.csv(fname)
train <- data[0:(length(data[,1])-8),]
train.ts <- ts(train[,c(2,3,4,5)],,start=c(2014,1),freq=52)
test <- data[(length(data[,1])-7):length(data[,1]),]
We concur that this code, while short, is exceptionally hard to read. There are many ways that this
could be improved, some of which are demonstrated below.
# R
n_test <- 8
data <- readr::read_csv(fname) 1
train <- head(data, -n_test)
test <- tail(data, n_test)
train_ts <- ts(
dplyr::select(train, 2:5), 2
start = c(2014,1),
freq = 52)
1:
Alternatively we could load the entire
readr
package with
library(readr)
and avoid the
::
.
For demonstration code, the explicit package reference can help the reader know which
functions come from which packages. When using packages that are already familiar to the
reader or when using many functions from the same package, loading with
library()
is
more appropriate. We will demonstrate both styles in the various examples presented here.
2:
Ideally we would use column names rather than numerical indices here.
dplyr::select()
would be more useful in that case, especially in conjunction with functions like
matches()
,
contains(),begins_with(), etc.
The suggested revisions add white space to improve readability and clarify the type of subsetting
that is happening by taking advantage of the
head()
and
tail()
functions. In addition to improved
formatting, we specify that the
select()
function is coming from the
dplyr
package.
1
We note
that there are still aspects of the code that are brittle (e.g., assumptions regarding the ordering of
the five variables in the dataset). The code snippet in the tweet does not provide enough context
to appropriately address these. Using the native pipe in R, we could rearrange the nested tasks,
1
We could also have loaded the
dplyr
package with
library(dplyr)
, and we would have done so had we
used several functions from that package.
Fostering better coding practices for data scientists 5
making the order of operations more transparent. We will see examples of this shortly, and of a
similar approach called method chaining (see Augspurger 2016), in Python.
The revised code is easier to read and understand (and maintain) than the original.
We believe that examples of this kind are all too common. This particular example is compelling
because it shows that students notice the quality of the code instructors present and motivates why
academics need to live up to industry standards.
2. Establishing Good Coding Practices
2.1. The Four C’s. As mentioned in the introduction, many lists of good coding practices have
been published. These lists can provide useful guidance as one progresses — or, in the case of an
instructor, leads students in their progression — toward better coding practices.
In addition to providing students with specific coding guidelines like our top ten list below, we think
it is also important that students understand the higher level goals that the specific guidelines are
intended to support. These higher level goals take into account that computer code simultaneously
communicates both to humans and to the computer and provide a framework for establishing a set
of specific coding practices. We describe these as the four C’s for good code:
Correctness: It is important that the code be correct so that the computer does what is
intended. This, in itself, is not profound. But we emphasize two things about correctness:
First, correctness is a necessary but insufficient metric for good code. Second, the other
goals support and promote correctness.
Clarity: It is also important that the code be clear, so that humans reading and writing
the code can tell what it is intended to do, and easily make modifications as necessary.
(This advice applies both to other humans and to the same human at some later date.)
Containment: It is helpful if the code is appropriately contained, to keep separate things
that should be separate and together things that should be together. Vartanian (2022) refers
to this idea as “low coupling, high cohesion. Other authors refer to this as modularization.
Proper containment includes things like preferring a data frame over several individual
vectors, using functions to contain reusable code, and keeping code used across files or
projects in a module or package.
Consistency: Finally, it is useful if code exhibits internal consistency of style, naming
conventions, and other coding practices.
The specific guidelines outlined below serve as concrete advice for developing practices that promote
creating code that satisfies the 4 C’s.
2.2. A Progressive Approach. Before revealing our top ten list, we want to emphasize the
importance of taking a progressive approach, both for oneself and for students. Developing good
coding practices takes time and attention. Any list of coding guidelines can exceed the available
cognitive resources to take them on. As is true for any behavioral intervention, change takes time
and effort.
摘要:

FosteringbettercodingpracticesfordatascientistsRandallPruim†,Maria-CristianaGîrjău‡,⋄,NicholasJ.Horton‡,*†CalvinUniversity‡AmherstCollege⋄ColumbiaUniversity*nhorton@amherst.eduAbstract.Manydatasciencestudentsandpractitionersdon’tseethevalueinmakingtimetolearnandadoptgoodcodingpracticesaslongasthecod...

展开>> 收起<<
Fostering better coding practices for data scientists_2.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:589.33KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注