Fostering better coding practices for data scientists 3
Learning and consistently using good coding practices takes some effort and some attention. But
in the end, we agree that they are likely to save far more time than they consume (Ball et al.
2022). Well written code is more likely to be correct, saving the time of redoing things, and easier
to maintain, saving effort when it becomes necessary to modify or adapt the code in the future.
Collaborations with other team members are likely to be more positive if the quality of individual
contributions is higher.
We are convinced that
Good coding practice is important, even for beginners.
It is easier and more efficient to learn good coding practices as one learns to program than to unlearn
bad habits later. This makes it especially important that the code beginners see meets the highest
standards for coding practice. We can’t expect beginners to mimic these practices perfectly from
the start, and we recommend focusing student attention (and feedback) on just a few key coding
practices early on. But if they don’t have a good model to emulate, we are impeding their progress
unnecessarily. As an additional benefit, modeling good coding practices will make it easier for
students (and others) to learn not only good coding technique but also the concepts and applications
that the code is illustrating.
In this paper, we will motivate the importance of principled coding, illustrate key aspects of good
coding practices, and suggest ways that these practices can be included in the data science and
statistics curriculum.
1.2. Prior work. We acknowledge that much of what we discuss is not novel, but it is nonetheless
important (and, we argue, under-appreciated and under-emphasized).
Many calls for better coding practices and enumerations of such practices exist. Computer science
curricula have long emphasized these practices beginning in introductory programming courses and
continuing throughout the curriculum (Keuning, Heeren, and Jeuring 2017; Borstler et al. 2017),
especially in courses like software engineering or in capstone projects courses (e.g., Berkeley’s CS169,
https://bcourses.berkeley.edu/courses/1507976). Stegeman et al. (2014) and (2016) have described
rubrics and assessment for code quality in programming courses.
The importance of good coding practices is also recognized in industry (“Google Style Guide”
2019; Ghani 2022) and across the sciences (Wilson et al. 2017; Aruliah et al. 2012; Filazzola and
Lortie 2022) and social sciences (Gentzkow and Shapiro 2022). Dogucu and Çetinkaya-Rundel
(2022) motivates the importance of code quality, style guides, file organization, and related topics.
Related work by Carey and Papin (2018) that describes rules for new programmers has relevance
for teaching data analysis. Deborah Nolan and Stoudt (2021) offer a “Dirty Dozen” set of helpful
code recommendations, and Abouzekry (2012) provides ten tips for better coding.
Code quality has been an area where some previous research has been undertaken. Schulte (2008)
introduced a block model to help study comprehension of program components (atoms, blocks,
relations, and macrostructure). Keuning, Heeren, and Jeuring (2017) and Keuning, Heeren, and
Jeuring (2019) have explored other aspects of teaching code quality.
While the particular coding practices enumerated vary some by author, programming language, and
application area, the overall message is clear: Good coding practices are important across a wide