How Do Data Science Workers Communicate Intermediate Results Rock Yuren Pang Ruotong Wang Joely Nelson Leilani Battle

2025-04-29 0 0 416.61KB 9 页 10玖币
侵权投诉
How Do Data Science Workers
Communicate Intermediate Results?
Rock Yuren Pang, Ruotong Wang, Joely Nelson, Leilani Battle
Abstract
— Data science workers increasingly collaborate on large-scale projects before communicating insights to a broader audience
in the form of visualization. While prior work has modeled how data science teams, oftentimes with distinct roles and work processes,
communicate knowledge to outside stakeholders, we have little knowledge of how data science workers communicate intermediately
before delivering the final products. In this work, we contribute a nuanced description of the intermediate communication process
within data science teams. By analyzing interview data with 8 self-identified data science workers, we characterized the data science
intermediate communication process with four factors, including the types of audience,communication goals,shared artifacts, and
mode of communication. We also identified overarching challenges in the current communication process. We also discussed design
implications that might inform better tools that facilitate intermediate communication within data science teams.
Index Terms—Data Science Collaboration, Data Science Communication
1 INTRODUCTION
Data science communication often refers to conveying the final analysis
insights to a broader audience [11, 20, 52]. For example, researchers
and companies increasingly communicate information through high-
quality interactive visualizations and dashboards. The New York Times
leverages its rich visualization to inform the general public about topics
including elections, climate change, and sports. To support more ef-
fective final-stage communication, researchers and organizations have
developed powerful visualization tools — such as D3.js, Vega-Lite,
Idyll, Tableau, and Microsoft PowerBI — to simplify the development
cycle, condense large-scale dataset, and enable the final information
communication with the audience with highly polished visualization.
However, communication among team members also exists through-
out the lifecycle of large-scale data science projects, increasingly in a
collaborative fashion [29,29,42,52]. Today, such collaboration involves
multiple different team players and separate analysis stages ranging
from data cleaning to visualizing sophisticated findings [20, 42]. Sim-
ilar to communication in the end, communication during the project
also involves explaining technical terms to a non-technical audience
(e.g., managers). However, compared to final-stage communication,
intermediate communication can focus more on communicating and
receiving feedback. To add more complexity, individuals might adopt
unique tools at the intermediate stages of a data science project [52].
These resulting intermediate artifacts tend to be far less polished and
could be produced by a wider range of tools. We seek to answer the
question:
How do data science workers communicate data interme-
diately before shipping their final product?
We define intermediate communication as the synchronous or asyn-
chronous decision-making process where team members build and
iterate on the end artifacts for the target audience. In contrast to prior
work that categorizes communication as the final step in the data sci-
ence workflow [20, 46], we argue that intermediate communication
should be a distinct collaboration element where data science workers
develop, share, reuse, document, and store analysis with other team
members throughout the project lifecycle. Advanced visual analysis
authoring tools enable faster data visualization prototyping [4, 9, 44],
Rock Yuren Pang is with the University of Washington. E-mail:
ypang2@cs.washington.edu.
Ruotong Wang is with the University of Washington. E-mail:
ruotongw@cs.washington.edu
Joely Nelson is with the University of Washington. E-mail:
joelyn@cs.washington.edu
Leilani Battle is with the University of Washington. E-mail:
leibatt@cs.washington.edu
Manuscript received 14 July. 2022; accepted 16 Aug. 2022. Date of Publication
26 Aug. 2022; date of current version 26 Aug. 2022.
but they largely tackle the engineering side of the problem (i.e., how
to make it easier to make interactive visualization). However, data
science is engaged as an exploration process more than an engineering
process [18, 29, 52]. This exploration process can be more flexible and
diverse, involving different goals, shared artifacts, and audiences.
In this paper, we contribute a more nuanced understanding of the
intermediate communication process. For example, how teams share
resources, resolve conflicts, and seek help. We conducted eight in-depth
interviews with people who self-identify as data scientists/analysts
(in industry and academia) and regularly need to communicate and
get feedback on their data science work. To answer the overarching
research question (i.e. how do data science workers communicate data
intermediately), we guided our interview with the following questions:
Who do data science workers communicate with?
Why do they communicate with others?
What forms of communication take place in your project?
What challenges, if any, hinder intermediate communication?
In the interviews, we focused on participants’ experiences of com-
municating and receiving feedback, as well as the challenges they en-
countered in the process. In particular, we identified four major factors
influencing intermediate communication in data science projects:
(1) common goals of intermediate communication, (2) types of arti-
facts that are shared in the communication process, (3) modes of com-
munication (i.e., synchronous versus asynchronous), and (4) common
audience configurations observed during intermediate communication.
2 RELATED WORK
2.1 Data Science Workers and Stages
Although data science has become a popular term and has gained an
increasing amount of attention over recent years [38], there does not
exist an agreement on the definition of data scientists including their
necessary skills and related work tasks [8]. Prior literature has used data
scientists [11, 45], data analysts [20, 24], and data science workers [29]
interchangeably. Mueller et al. [29] posited that data science is a human
activity and people involved in data science are therefore data science
workers. People who do the work of data science span across multiple
job categories and titles, and the definition of data scientists (or data
science workers) is likely to become more diverse over time [29]. In
our study, we use the term data science workers to refer to anyone
whose primary job function deals with or draws meaningful insights
from large datasets.
In addition to a variety of data science roles, prior research has pro-
posed frameworks that capture the data analysis process over time. In
the knowledge discovery and database (KDD) community, the Cross-
Industry Standard Processes for Data Mining (CRISP-DM) established
five standard phases for data mining based on a prior KDD model [16]:
business understanding, data understanding, modeling, evaluation, and
arXiv:2210.03305v1 [cs.HC] 7 Oct 2022
Fig. 1: Collaboration among roles in data science projects from a
survey of 183 IBM employees. The thickness of each arc represents the
normalized proportion of people who reported each directed-type of
collaboration. Communicator was categorized as a separate role. [52]
deployment [48]. Though capturing a generic data science workflow,
this early data science model did not have explicit references to com-
munication. Visualization research studies have examined data science
workers with similar phases. In an interview with 35 data analysts,
Kandel et. al [20] formalized the data analysis process as consisting
of the phases: discover, wrangle, profile, model, and report. Batch
and Elmqvist [3] conducted a contextual inquiry with 8 practitioners
and echoed a similar process but found that visualization is optional
at best in analysts’ work. Based on Kandel’s framework, Alspaugh
et. al [2] formed a six-step pipeline by considering the increasing ex-
plorative data analysis process in practice. Recently, Crisan et al. [11]
synthesized a comprehensive model from human-computer interaction,
visualization, and data science literature with retrospective analysis,
resulting in four higher order processes which are preparation, analysis,
deployment, and communication.
However, prior work model communication as a separate stage in
the data science workflow, often focusing on the end asynchronous
communication to the public, akin to digital journalism. In this work,
we argue that communication should be considered throughout a data
science project and beyond presentation (explained in Section 2.2).
2.2 Collaboration and Communication in Data Science
Recent work recognized that data science is a
collaborative
pro-
cess [5, 28] and explored how to support data science stakeholders
in the collaboration process [52]. Zhang et al. [52] performed an exten-
sive investigation to understand how data science workers collaborate.
In a large-scale survey with 183 employees at IBM, they identified 5
major roles and 6 main stages in collaborative data science workflow
as well as tooling and practices for collaboration such as asynchronous
data and code documentation. Wang et al. [46] identified a ”scatter-
gather” pattern of collaboration among data science workers. In the
”scatter” phase, workers function individually and communicate high-
level ideas rather than artifacts in the ensuing ”gather” phases. Previous
work has also presented in-depth cases study of practices under the con-
texts of civic data hackathons [19] and collaboration between domain
experts and data scientists [26, 33]. Crisan et al. [12] also identified
collaboration as an emerging type of data science work within their
data science model with an emphasis on data visualization.
As an important element of collaboration,
communication
high-
lights the circulation of artifacts and knowledge. Crisan et al. [11]
categorizes two communication processes:
documentation
and
dis-
semination
. Documentation records and describes data science work
and its artifacts, often in the form of capturing the data and analysis
provenance [11, 20]. In response, recent tools have been developed to
address the documentation process, mostly asynchronously capturing
the data provenance or documenting analysis processes [13, 30, 35].
For example, code gathering tools enabled data analysts to review all
archived versions of code outputs and recover the subsets of code that
produce the outputs [17]. URSPRUNG is another transparent prove-
nance collection system designed for data science environments [37].
Tracking the data provenance has also been implemented and tested
on computational notebooks to support collaborative data science
work [21, 22, 34, 36, 41
43, 50, 51]. Notably, StoryFacets [31] was
designed to maintain the visual provenance of the analysis to miti-
gate barriers to collaboration among audiences (i.e., expert analysts,
managers, and laypersons).
Dissemination, on the other hand, conveys the insights derived from
the data science work, usually as the form of final presentation [7,
15], reports [49] and interactive visualization [39]. The emphasis on
communicating insights to a larger audience motivated enterprise tools,
such as Microsoft PowerBI, Tableau, D3.js, and Vega-Lite, as well as
research products that enable rapid visualization development, such as
Idyll [10], Falx [44],
PI2
[9], and Symphony [4]. In that regard, tooling
largely supports the engineering side of the dissemination process, often
formalized as communication as the end step [11].
Although collaboration has been categorized as an independent
emerging type of work [11], recent work on collaborative data science
also emphasized communicating to the lay audience in the final stage of
the pipeline by the roles such as communicator [52]. However, Mueller
et al. [29] suggested that communication might take place through-
out the data analysis process. Brehmer and Kosara [7] contextualized
the intermediate communication by interviewing 23 professionals at
Tableau. They identified three scenarios involving presentations of data
and provided design suggestions to integrate visualization tools for
data analysis and slideware tools for data presentation. More recent
work started to address the challenges in intermediate communica-
tion. For example, Voder [40] treats data facts as interactive widgets
to provide insights throughout the analysis, though they are restricted
to system-generated descriptive statistics. NB2Slides [53] generates
presentation slides from computational notebooks. Yet, this line of
work largely focused on performative presentation which requires a
presenter to narrate and step through the content. Our work expands
on prior works and covers other aspects of communication such as
asynchronous messaging and screen sharing.
3 METHODS
We conducted semi-structured interviews with self-identified data sci-
ence workers (see Section 2.1 for details) to better understand the
intermediate communication process and needs.
3.1 Participants
We recruited eight self-identified data science workers (Table 1) through
internal message boards and individual contacts. Of the 8 participants,
4 are current data science professionals, the other 4 are current Ph.D.
students but have also been employed at large multinational technology
companies. All participants are based in the US. We intentionally
recruited participants who have engaged in large data science projects
both in academia and industry to holistically extract common themes.
Note that our study aims to surface the common goals, practices, and
challenges to better formalize in intermediate communication in data
science projects. In the future, a large-scale survey would be important
to quantitatively verify our findings.
3.2 Interview Procedures
We conducted remote, semi-structured interviews with each partici-
pant via Zoom or in-person with a recording device for an hour. Our
interview started with our project goal and interview instruction. To
contextualize the concept of intermediate communication, we asked
participants to recall their recent data science project where they com-
municated their findings with other team members. We asked open-
ended questions and encouraged interviewees to describe their lived
experiences [20]. We organized each interview as follows:
Background
1) What was a recent data science project you col-
laborated on with other people? 2) Do you communicate your
intermediate results?
摘要:

HowDoDataScienceWorkersCommunicateIntermediateResults?RockYurenPang,RuotongWang,JoelyNelson,LeilaniBattleAbstract—Datascienceworkersincreasinglycollaborateonlarge-scaleprojectsbeforecommunicatinginsightstoabroaderaudienceintheformofvisualization.Whilepriorworkhasmodeledhowdatascienceteams,oftentimes...

展开>> 收起<<
How Do Data Science Workers Communicate Intermediate Results Rock Yuren Pang Ruotong Wang Joely Nelson Leilani Battle.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:416.61KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注