Fostering better coding practices for data scientists_2

2025-04-27 0 0 589.33KB 27 页 10玖币

侵权投诉

Fostering better coding practices for data scientists

Randall Pruim†, Maria-Cristiana Gîrjău‡,⋄, Nicholas J. Horton‡,*

†Calvin University

‡Amherst College

⋄Columbia University

*nhorton@amherst.edu

Abstract.

Many data science students and practitioners don’t see the value in making

time to learn and adopt good coding practices as long as the code “works”. However, code

standards are an important part of modern data science practice, and they play an essential

role in the development of data acumen. Good coding practices lead to more reliable code

and save more time than they cost, making them important even for beginners. We believe

that principled coding is vital for quality data science practice. To eﬀectively instill these

practices within academic programs, instructors and programs need to begin establishing

these practices early, to reinforce them often, and to hold themselves to a higher standard

while guiding students. We describe key aspects of good coding practices for data science,

illustrating with examples in R and in Python, though similar standards are applicable to

other software environments. Practical coding guidelines are organized into a top ten list.

Keywords: data acumen, data science, data science practice, data science education, code

quality, code style

Attribution (CC BY 4.0) International license (https://creativecommons.org/licenses/by/4.0/legalcode),

except where otherwise indicated with respect to particular material included in the article. The article

should be attributed to the author(s) identiﬁed above.

Media summary

Many data science students and practitioners are reluctant to adopt good coding practices as long

as the code “works”. Yet meticulous attention to detail is an important characteristic of a data

scientist.

Code standards are an important part of modern data science, and they play an essential role in

ensuring the quality of data science in research and in the workforce. Responsible coding practices

lead to more reliable code and save more time than they cost, making them important even for

beginners. We believe that principled coding is vital for quality data science practice.

To eﬀectively instill these habits of mind within academic programs, instructors and programs need

to begin establishing these practices early, to reinforce them often, and to hold themselves to a

higher standard while guiding students. We describe key aspects of good coding for data science,

arXiv:2210.03991v2 [stat.CO] 25 Aug 2023

Fostering better coding practices for data scientists 2

illustrating them with examples and motivation. Practical coding guidelines are organized into a

top ten list.

1. Introduction

Coding is an increasingly important part of statistical analyses (Deboran Nolan and Temple Lang

2010; Hardin et al. 2021). The goal of code is not just to solve an immediate analysis problem but

also to establish reusable workﬂows and to communicate. As projects and analyses become more

sophisticated, it’s important that structures and expectations be set to facilitate data science as a

team sport in a sustainable and reproducible way (Horton et al. 2022).

Consensus reports (National Academies of Science, Engineering, and Medicine 2018) have highlighted

the importance of workﬂow and reproducibility as a component of data science practice:

“Modern data science has at its core the creation of workﬂows—pipelines of processes

that combine simpler tools to solve larger tasks. Documenting, incrementally

improving, sharing, and generalizing such workﬂows are an important part of data

science practice owing to the team nature of data science and broader signiﬁcance

of scientiﬁc reproducibility and replicability. Documenting and sharing workﬂows

enable others to understand how data have been used and reﬁned and what steps

were taken in an analysis process. This can increase the conﬁdence in results and

improve trust in the process as well as enable reuse of analyses or results in a

meaningful way (page 2-12).”

Adhering to established coding standards is an important step in fostering eﬀective analyses and an

important component of data acumen. Unfortunately, attention to these issues has not been central

to many curricula in statistics and data science.

1.1. Why bother? Many data science students, and some practitioners, do not make time to learn

and adopt good coding practices. As long as the code “works”, they are satisﬁed and ready to move

on. But how do they know that “it works”? For data analysts, these issues are an important part of

data science practice, because it aﬀects the bottom line:

Good coding practices lead to more reliable and maintainable code.

It is easier to notice and ﬁx errors in well written code. And it is less likely that the errors occur in

the ﬁrst place if the authors are using good practices. Trisovic et al. (2022) carried out a large-scale

study on code quality and found that 74% of R ﬁles did not run successfully. After incorporating

some automated code cleaning targeting “some of the most common execution errors”, 56% still

failed.

Even when the code is correct, following good coding practices makes the code easier to read and

understand, saving time and promoting good communication among team members.

In a blog post, Lyman (2021) wrote:

The value of high-quality code can be diﬃcult to communicate. Some managers

see it as a boondoggle, an expensive hobby for overly fastidious programmers, since

investing in code quality can slow development over the short term and doesn’t

appear to alter the user experience. But nothing could be further from the truth.

Fostering better coding practices for data scientists 3

Learning and consistently using good coding practices takes some eﬀort and some attention. But

in the end, we agree that they are likely to save far more time than they consume (Ball et al.

2022). Well written code is more likely to be correct, saving the time of redoing things, and easier

to maintain, saving eﬀort when it becomes necessary to modify or adapt the code in the future.

Collaborations with other team members are likely to be more positive if the quality of individual

contributions is higher.

We are convinced that

Good coding practice is important, even for beginners.

It is easier and more eﬃcient to learn good coding practices as one learns to program than to unlearn

bad habits later. This makes it especially important that the code beginners see meets the highest

standards for coding practice. We can’t expect beginners to mimic these practices perfectly from

the start, and we recommend focusing student attention (and feedback) on just a few key coding

practices early on. But if they don’t have a good model to emulate, we are impeding their progress

unnecessarily. As an additional beneﬁt, modeling good coding practices will make it easier for

students (and others) to learn not only good coding technique but also the concepts and applications

that the code is illustrating.

In this paper, we will motivate the importance of principled coding, illustrate key aspects of good

coding practices, and suggest ways that these practices can be included in the data science and

statistics curriculum.

1.2. Prior work. We acknowledge that much of what we discuss is not novel, but it is nonetheless

important (and, we argue, under-appreciated and under-emphasized).

Many calls for better coding practices and enumerations of such practices exist. Computer science

curricula have long emphasized these practices beginning in introductory programming courses and

continuing throughout the curriculum (Keuning, Heeren, and Jeuring 2017; Borstler et al. 2017),

especially in courses like software engineering or in capstone projects courses (e.g., Berkeley’s CS169,

https://bcourses.berkeley.edu/courses/1507976). Stegeman et al. (2014) and (2016) have described

rubrics and assessment for code quality in programming courses.

The importance of good coding practices is also recognized in industry (“Google Style Guide”

2019; Ghani 2022) and across the sciences (Wilson et al. 2017; Aruliah et al. 2012; Filazzola and

Lortie 2022) and social sciences (Gentzkow and Shapiro 2022). Dogucu and Çetinkaya-Rundel

(2022) motivates the importance of code quality, style guides, ﬁle organization, and related topics.

Related work by Carey and Papin (2018) that describes rules for new programmers has relevance

for teaching data analysis. Deborah Nolan and Stoudt (2021) oﬀer a “Dirty Dozen” set of helpful

code recommendations, and Abouzekry (2012) provides ten tips for better coding.

Code quality has been an area where some previous research has been undertaken. Schulte (2008)

introduced a block model to help study comprehension of program components (atoms, blocks,

relations, and macrostructure). Keuning, Heeren, and Jeuring (2017) and Keuning, Heeren, and

Jeuring (2019) have explored other aspects of teaching code quality.

While the particular coding practices enumerated vary some by author, programming language, and

application area, the overall message is clear: Good coding practices are important across a wide

Fostering better coding practices for data scientists 4

range of contexts to ensure that people, especially those working in teams, are productive and that

their work is reliable, maintainable, and reproducible. Furthermore, there is broad agreement about

the basic contours of what constitutes good coding practice. Unfortunately, the abundance of such

calls indicates that practice continues to fall short of principle.

1.3. A Motivating Example. An April 2021 twitter post (Meyer 2021) commented:

“It is really painful when taking a graduate level data science course and the

instructor’s code is considerably below any acceptable standard in the real world.

Here is some real life code from a demo oﬀered for the current homework. . . ”

data <- read.csv(fname)

train <- data[0:(length(data[,1])-8),]

train.ts <- ts(train[,c(2,3,4,5)],,start=c(2014,1),freq=52)

test <- data[(length(data[,1])-7):length(data[,1]),]

We concur that this code, while short, is exceptionally hard to read. There are many ways that this

could be improved, some of which are demonstrated below.

# R

n_test <- 8

data <- readr::read_csv(fname) 1

train <- head(data, -n_test)

test <- tail(data, n_test)

train_ts <- ts(

dplyr::select(train, 2:5), 2

start = c(2014,1),

freq = 52)

Alternatively we could load the entire

readr

package with

library(readr)

and avoid the

For demonstration code, the explicit package reference can help the reader know which

functions come from which packages. When using packages that are already familiar to the

reader or when using many functions from the same package, loading with

library()

more appropriate. We will demonstrate both styles in the various examples presented here.

Ideally we would use column names rather than numerical indices here.

dplyr::select()

would be more useful in that case, especially in conjunction with functions like

matches()

contains(),begins_with(), etc.

The suggested revisions add white space to improve readability and clarify the type of subsetting

that is happening by taking advantage of the

head()

and

tail()

functions. In addition to improved

formatting, we specify that the

select()

function is coming from the

dplyr

package.

We note

that there are still aspects of the code that are brittle (e.g., assumptions regarding the ordering of

the ﬁve variables in the dataset). The code snippet in the tweet does not provide enough context

to appropriately address these. Using the native pipe in R, we could rearrange the nested tasks,

We could also have loaded the

dplyr

package with

library(dplyr)

, and we would have done so had we

used several functions from that package.

Fostering better coding practices for data scientists 5

making the order of operations more transparent. We will see examples of this shortly, and of a

similar approach called method chaining (see Augspurger 2016), in Python.

The revised code is easier to read and understand (and maintain) than the original.

We believe that examples of this kind are all too common. This particular example is compelling

because it shows that students notice the quality of the code instructors present and motivates why

academics need to live up to industry standards.

2. Establishing Good Coding Practices

2.1. The Four C’s. As mentioned in the introduction, many lists of good coding practices have

been published. These lists can provide useful guidance as one progresses — or, in the case of an

instructor, leads students in their progression — toward better coding practices.

In addition to providing students with speciﬁc coding guidelines like our top ten list below, we think

it is also important that students understand the higher level goals that the speciﬁc guidelines are

intended to support. These higher level goals take into account that computer code simultaneously

communicates both to humans and to the computer and provide a framework for establishing a set

of speciﬁc coding practices. We describe these as the four C’s for good code:

•

Correctness: It is important that the code be correct so that the computer does what is

intended. This, in itself, is not profound. But we emphasize two things about correctness:

First, correctness is a necessary but insuﬃcient metric for good code. Second, the other

goals support and promote correctness.

•

Clarity: It is also important that the code be clear, so that humans reading and writing

the code can tell what it is intended to do, and easily make modiﬁcations as necessary.

(This advice applies both to other humans and to the same human at some later date.)

•

Containment: It is helpful if the code is appropriately contained, to keep separate things

that should be separate and together things that should be together. Vartanian (2022) refers

to this idea as “low coupling, high cohesion.” Other authors refer to this as modularization.

Proper containment includes things like preferring a data frame over several individual

vectors, using functions to contain reusable code, and keeping code used across ﬁles or

projects in a module or package.

•

Consistency: Finally, it is useful if code exhibits internal consistency of style, naming

conventions, and other coding practices.

The speciﬁc guidelines outlined below serve as concrete advice for developing practices that promote

creating code that satisﬁes the 4 C’s.

2.2. A Progressive Approach. Before revealing our top ten list, we want to emphasize the

importance of taking a progressive approach, both for oneself and for students. Developing good

coding practices takes time and attention. Any list of coding guidelines can exceed the available

cognitive resources to take them on. As is true for any behavioral intervention, change takes time

and eﬀort.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FosteringbettercodingpracticesfordatascientistsRandallPruim†,Maria-CristianaGîrjău‡,⋄,NicholasJ.Horton‡,*†CalvinUniversity‡AmherstCollege⋄ColumbiaUniversity*nhorton@amherst.eduAbstract.Manydatasciencestudentsandpractitionersdon’tseethevalueinmakingtimetolearnandadoptgoodcodingpracticesaslongasthecod...

展开>> 收起<<

Fostering better coding practices for data scientists_2.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Fostering better coding practices for data scientists_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: