How Do Data Science Workers Communicate Intermediate Results Rock Yuren Pang Ruotong Wang Joely Nelson Leilani Battle

2025-04-29 0 0 416.61KB 9 页 10玖币

How Do Data Science Workers

Communicate Intermediate Results?

Rock Yuren Pang, Ruotong Wang, Joely Nelson, Leilani Battle

Abstract

— Data science workers increasingly collaborate on large-scale projects before communicating insights to a broader audience

in the form of visualization. While prior work has modeled how data science teams, oftentimes with distinct roles and work processes,

communicate knowledge to outside stakeholders, we have little knowledge of how data science workers communicate intermediately

before delivering the ﬁnal products. In this work, we contribute a nuanced description of the intermediate communication process

within data science teams. By analyzing interview data with 8 self-identiﬁed data science workers, we characterized the data science

intermediate communication process with four factors, including the types of audience,communication goals,shared artifacts, and

mode of communication. We also identiﬁed overarching challenges in the current communication process. We also discussed design

implications that might inform better tools that facilitate intermediate communication within data science teams.

Index Terms—Data Science Collaboration, Data Science Communication

1 INTRODUCTION

Data science communication often refers to conveying the ﬁnal analysis

insights to a broader audience [11, 20, 52]. For example, researchers

and companies increasingly communicate information through high-

quality interactive visualizations and dashboards. The New York Times

leverages its rich visualization to inform the general public about topics

including elections, climate change, and sports. To support more ef-

fective ﬁnal-stage communication, researchers and organizations have

developed powerful visualization tools — such as D3.js, Vega-Lite,

Idyll, Tableau, and Microsoft PowerBI — to simplify the development

cycle, condense large-scale dataset, and enable the ﬁnal information

communication with the audience with highly polished visualization.

However, communication among team members also exists through-

out the lifecycle of large-scale data science projects, increasingly in a

collaborative fashion [29,29,42,52]. Today, such collaboration involves

multiple different team players and separate analysis stages ranging

from data cleaning to visualizing sophisticated ﬁndings [20, 42]. Sim-

ilar to communication in the end, communication during the project

also involves explaining technical terms to a non-technical audience

(e.g., managers). However, compared to ﬁnal-stage communication,

intermediate communication can focus more on communicating and

receiving feedback. To add more complexity, individuals might adopt

unique tools at the intermediate stages of a data science project [52].

These resulting intermediate artifacts tend to be far less polished and

could be produced by a wider range of tools. We seek to answer the

question:

How do data science workers communicate data interme-

diately before shipping their ﬁnal product?

We deﬁne intermediate communication as the synchronous or asyn-

chronous decision-making process where team members build and

iterate on the end artifacts for the target audience. In contrast to prior

work that categorizes communication as the ﬁnal step in the data sci-

ence workﬂow [20, 46], we argue that intermediate communication

should be a distinct collaboration element where data science workers

develop, share, reuse, document, and store analysis with other team

members throughout the project lifecycle. Advanced visual analysis

authoring tools enable faster data visualization prototyping [4, 9, 44],

• Rock Yuren Pang is with the University of Washington. E-mail:

ypang2@cs.washington.edu.

• Ruotong Wang is with the University of Washington. E-mail:

ruotongw@cs.washington.edu

• Joely Nelson is with the University of Washington. E-mail:

joelyn@cs.washington.edu

• Leilani Battle is with the University of Washington. E-mail:

leibatt@cs.washington.edu

Manuscript received 14 July. 2022; accepted 16 Aug. 2022. Date of Publication

26 Aug. 2022; date of current version 26 Aug. 2022.

but they largely tackle the engineering side of the problem (i.e., how

to make it easier to make interactive visualization). However, data

science is engaged as an exploration process more than an engineering

process [18, 29, 52]. This exploration process can be more ﬂexible and

diverse, involving different goals, shared artifacts, and audiences.

In this paper, we contribute a more nuanced understanding of the

intermediate communication process. For example, how teams share

resources, resolve conﬂicts, and seek help. We conducted eight in-depth

interviews with people who self-identify as data scientists/analysts

(in industry and academia) and regularly need to communicate and

get feedback on their data science work. To answer the overarching

research question (i.e. how do data science workers communicate data

intermediately), we guided our interview with the following questions:

• Who do data science workers communicate with?

• Why do they communicate with others?

• What forms of communication take place in your project?

• What challenges, if any, hinder intermediate communication?

In the interviews, we focused on participants’ experiences of com-

municating and receiving feedback, as well as the challenges they en-

countered in the process. In particular, we identiﬁed four major factors

inﬂuencing intermediate communication in data science projects:

(1) common goals of intermediate communication, (2) types of arti-

facts that are shared in the communication process, (3) modes of com-

munication (i.e., synchronous versus asynchronous), and (4) common

audience conﬁgurations observed during intermediate communication.

2 RELATED WORK

2.1 Data Science Workers and Stages

Although data science has become a popular term and has gained an

increasing amount of attention over recent years [38], there does not

exist an agreement on the deﬁnition of data scientists including their

necessary skills and related work tasks [8]. Prior literature has used data

scientists [11, 45], data analysts [20, 24], and data science workers [29]

interchangeably. Mueller et al. [29] posited that data science is a human

activity and people involved in data science are therefore data science

workers. People who do the work of data science span across multiple

job categories and titles, and the deﬁnition of data scientists (or data

science workers) is likely to become more diverse over time [29]. In

our study, we use the term data science workers to refer to anyone

whose primary job function deals with or draws meaningful insights

from large datasets.

In addition to a variety of data science roles, prior research has pro-

posed frameworks that capture the data analysis process over time. In

the knowledge discovery and database (KDD) community, the Cross-

Industry Standard Processes for Data Mining (CRISP-DM) established

ﬁve standard phases for data mining based on a prior KDD model [16]:

business understanding, data understanding, modeling, evaluation, and

arXiv:2210.03305v1 [cs.HC] 7 Oct 2022

Fig. 1: Collaboration among roles in data science projects from a

survey of 183 IBM employees. The thickness of each arc represents the

normalized proportion of people who reported each directed-type of

collaboration. Communicator was categorized as a separate role. [52]

deployment [48]. Though capturing a generic data science workﬂow,

this early data science model did not have explicit references to com-

munication. Visualization research studies have examined data science

workers with similar phases. In an interview with 35 data analysts,

Kandel et. al [20] formalized the data analysis process as consisting

of the phases: discover, wrangle, proﬁle, model, and report. Batch

and Elmqvist [3] conducted a contextual inquiry with 8 practitioners

and echoed a similar process but found that visualization is optional

at best in analysts’ work. Based on Kandel’s framework, Alspaugh

et. al [2] formed a six-step pipeline by considering the increasing ex-

plorative data analysis process in practice. Recently, Crisan et al. [11]

synthesized a comprehensive model from human-computer interaction,

visualization, and data science literature with retrospective analysis,

resulting in four higher order processes which are preparation, analysis,

deployment, and communication.

However, prior work model communication as a separate stage in

the data science workﬂow, often focusing on the end asynchronous

communication to the public, akin to digital journalism. In this work,

we argue that communication should be considered throughout a data

science project and beyond presentation (explained in Section 2.2).

2.2 Collaboration and Communication in Data Science

Recent work recognized that data science is a

collaborative

pro-

cess [5, 28] and explored how to support data science stakeholders

in the collaboration process [52]. Zhang et al. [52] performed an exten-

sive investigation to understand how data science workers collaborate.

In a large-scale survey with 183 employees at IBM, they identiﬁed 5

major roles and 6 main stages in collaborative data science workﬂow

as well as tooling and practices for collaboration such as asynchronous

data and code documentation. Wang et al. [46] identiﬁed a ”scatter-

gather” pattern of collaboration among data science workers. In the

”scatter” phase, workers function individually and communicate high-

level ideas rather than artifacts in the ensuing ”gather” phases. Previous

work has also presented in-depth cases study of practices under the con-

texts of civic data hackathons [19] and collaboration between domain

experts and data scientists [26, 33]. Crisan et al. [12] also identiﬁed

collaboration as an emerging type of data science work within their

data science model with an emphasis on data visualization.

As an important element of collaboration,

communication

high-

lights the circulation of artifacts and knowledge. Crisan et al. [11]

categorizes two communication processes:

documentation

and

dis-

semination

. Documentation records and describes data science work

and its artifacts, often in the form of capturing the data and analysis

provenance [11, 20]. In response, recent tools have been developed to

address the documentation process, mostly asynchronously capturing

the data provenance or documenting analysis processes [13, 30, 35].

For example, code gathering tools enabled data analysts to review all

archived versions of code outputs and recover the subsets of code that

produce the outputs [17]. URSPRUNG is another transparent prove-

nance collection system designed for data science environments [37].

Tracking the data provenance has also been implemented and tested

on computational notebooks to support collaborative data science

work [21, 22, 34, 36, 41

–

43, 50, 51]. Notably, StoryFacets [31] was

designed to maintain the visual provenance of the analysis to miti-

gate barriers to collaboration among audiences (i.e., expert analysts,

managers, and laypersons).

Dissemination, on the other hand, conveys the insights derived from

the data science work, usually as the form of ﬁnal presentation [7,

15], reports [49] and interactive visualization [39]. The emphasis on

communicating insights to a larger audience motivated enterprise tools,

such as Microsoft PowerBI, Tableau, D3.js, and Vega-Lite, as well as

research products that enable rapid visualization development, such as

Idyll [10], Falx [44],

PI2

[9], and Symphony [4]. In that regard, tooling

largely supports the engineering side of the dissemination process, often

formalized as communication as the end step [11].

Although collaboration has been categorized as an independent

emerging type of work [11], recent work on collaborative data science

also emphasized communicating to the lay audience in the ﬁnal stage of

the pipeline by the roles such as communicator [52]. However, Mueller

et al. [29] suggested that communication might take place through-

out the data analysis process. Brehmer and Kosara [7] contextualized

the intermediate communication by interviewing 23 professionals at

Tableau. They identiﬁed three scenarios involving presentations of data

and provided design suggestions to integrate visualization tools for

data analysis and slideware tools for data presentation. More recent

work started to address the challenges in intermediate communica-

tion. For example, Voder [40] treats data facts as interactive widgets

to provide insights throughout the analysis, though they are restricted

to system-generated descriptive statistics. NB2Slides [53] generates

presentation slides from computational notebooks. Yet, this line of

work largely focused on performative presentation which requires a

presenter to narrate and step through the content. Our work expands

on prior works and covers other aspects of communication such as

asynchronous messaging and screen sharing.

3 METHODS

We conducted semi-structured interviews with self-identiﬁed data sci-

ence workers (see Section 2.1 for details) to better understand the

intermediate communication process and needs.

3.1 Participants

We recruited eight self-identiﬁed data science workers (Table 1) through

internal message boards and individual contacts. Of the 8 participants,

4 are current data science professionals, the other 4 are current Ph.D.

students but have also been employed at large multinational technology

companies. All participants are based in the US. We intentionally

recruited participants who have engaged in large data science projects

both in academia and industry to holistically extract common themes.

Note that our study aims to surface the common goals, practices, and

challenges to better formalize in intermediate communication in data

science projects. In the future, a large-scale survey would be important

to quantitatively verify our ﬁndings.

3.2 Interview Procedures

We conducted remote, semi-structured interviews with each partici-

pant via Zoom or in-person with a recording device for an hour. Our

interview started with our project goal and interview instruction. To

contextualize the concept of intermediate communication, we asked

participants to recall their recent data science project where they com-

municated their ﬁndings with other team members. We asked open-

ended questions and encouraged interviewees to describe their lived

experiences [20]. We organized each interview as follows:

•Background

1) What was a recent data science project you col-

laborated on with other people? 2) Do you communicate your

intermediate results?

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HowDoDataScienceWorkersCommunicateIntermediateResults?RockYurenPang,RuotongWang,JoelyNelson,LeilaniBattleAbstractDatascienceworkersincreasinglycollaborateonlarge-scaleprojectsbeforecommunicatinginsightstoabroaderaudienceintheformofvisualization.Whilepriorworkhasmodeledhowdatascienceteams,oftentimes...

展开>> 收起<<

How Do Data Science Workers Communicate Intermediate Results Rock Yuren Pang Ruotong Wang Joely Nelson Leilani Battle.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

How Do Data Science Workers Communicate Intermediate Results Rock Yuren Pang Ruotong Wang Joely Nelson Leilani Battle

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: