
Fig. 1: Collaboration among roles in data science projects from a
survey of 183 IBM employees. The thickness of each arc represents the
normalized proportion of people who reported each directed-type of
collaboration. Communicator was categorized as a separate role. [52]
deployment [48]. Though capturing a generic data science workflow,
this early data science model did not have explicit references to com-
munication. Visualization research studies have examined data science
workers with similar phases. In an interview with 35 data analysts,
Kandel et. al [20] formalized the data analysis process as consisting
of the phases: discover, wrangle, profile, model, and report. Batch
and Elmqvist [3] conducted a contextual inquiry with 8 practitioners
and echoed a similar process but found that visualization is optional
at best in analysts’ work. Based on Kandel’s framework, Alspaugh
et. al [2] formed a six-step pipeline by considering the increasing ex-
plorative data analysis process in practice. Recently, Crisan et al. [11]
synthesized a comprehensive model from human-computer interaction,
visualization, and data science literature with retrospective analysis,
resulting in four higher order processes which are preparation, analysis,
deployment, and communication.
However, prior work model communication as a separate stage in
the data science workflow, often focusing on the end asynchronous
communication to the public, akin to digital journalism. In this work,
we argue that communication should be considered throughout a data
science project and beyond presentation (explained in Section 2.2).
2.2 Collaboration and Communication in Data Science
Recent work recognized that data science is a
collaborative
pro-
cess [5, 28] and explored how to support data science stakeholders
in the collaboration process [52]. Zhang et al. [52] performed an exten-
sive investigation to understand how data science workers collaborate.
In a large-scale survey with 183 employees at IBM, they identified 5
major roles and 6 main stages in collaborative data science workflow
as well as tooling and practices for collaboration such as asynchronous
data and code documentation. Wang et al. [46] identified a ”scatter-
gather” pattern of collaboration among data science workers. In the
”scatter” phase, workers function individually and communicate high-
level ideas rather than artifacts in the ensuing ”gather” phases. Previous
work has also presented in-depth cases study of practices under the con-
texts of civic data hackathons [19] and collaboration between domain
experts and data scientists [26, 33]. Crisan et al. [12] also identified
collaboration as an emerging type of data science work within their
data science model with an emphasis on data visualization.
As an important element of collaboration,
communication
high-
lights the circulation of artifacts and knowledge. Crisan et al. [11]
categorizes two communication processes:
documentation
and
dis-
semination
. Documentation records and describes data science work
and its artifacts, often in the form of capturing the data and analysis
provenance [11, 20]. In response, recent tools have been developed to
address the documentation process, mostly asynchronously capturing
the data provenance or documenting analysis processes [13, 30, 35].
For example, code gathering tools enabled data analysts to review all
archived versions of code outputs and recover the subsets of code that
produce the outputs [17]. URSPRUNG is another transparent prove-
nance collection system designed for data science environments [37].
Tracking the data provenance has also been implemented and tested
on computational notebooks to support collaborative data science
work [21, 22, 34, 36, 41
–
43, 50, 51]. Notably, StoryFacets [31] was
designed to maintain the visual provenance of the analysis to miti-
gate barriers to collaboration among audiences (i.e., expert analysts,
managers, and laypersons).
Dissemination, on the other hand, conveys the insights derived from
the data science work, usually as the form of final presentation [7,
15], reports [49] and interactive visualization [39]. The emphasis on
communicating insights to a larger audience motivated enterprise tools,
such as Microsoft PowerBI, Tableau, D3.js, and Vega-Lite, as well as
research products that enable rapid visualization development, such as
Idyll [10], Falx [44],
PI2
[9], and Symphony [4]. In that regard, tooling
largely supports the engineering side of the dissemination process, often
formalized as communication as the end step [11].
Although collaboration has been categorized as an independent
emerging type of work [11], recent work on collaborative data science
also emphasized communicating to the lay audience in the final stage of
the pipeline by the roles such as communicator [52]. However, Mueller
et al. [29] suggested that communication might take place through-
out the data analysis process. Brehmer and Kosara [7] contextualized
the intermediate communication by interviewing 23 professionals at
Tableau. They identified three scenarios involving presentations of data
and provided design suggestions to integrate visualization tools for
data analysis and slideware tools for data presentation. More recent
work started to address the challenges in intermediate communica-
tion. For example, Voder [40] treats data facts as interactive widgets
to provide insights throughout the analysis, though they are restricted
to system-generated descriptive statistics. NB2Slides [53] generates
presentation slides from computational notebooks. Yet, this line of
work largely focused on performative presentation which requires a
presenter to narrate and step through the content. Our work expands
on prior works and covers other aspects of communication such as
asynchronous messaging and screen sharing.
3 METHODS
We conducted semi-structured interviews with self-identified data sci-
ence workers (see Section 2.1 for details) to better understand the
intermediate communication process and needs.
3.1 Participants
We recruited eight self-identified data science workers (Table 1) through
internal message boards and individual contacts. Of the 8 participants,
4 are current data science professionals, the other 4 are current Ph.D.
students but have also been employed at large multinational technology
companies. All participants are based in the US. We intentionally
recruited participants who have engaged in large data science projects
both in academia and industry to holistically extract common themes.
Note that our study aims to surface the common goals, practices, and
challenges to better formalize in intermediate communication in data
science projects. In the future, a large-scale survey would be important
to quantitatively verify our findings.
3.2 Interview Procedures
We conducted remote, semi-structured interviews with each partici-
pant via Zoom or in-person with a recording device for an hour. Our
interview started with our project goal and interview instruction. To
contextualize the concept of intermediate communication, we asked
participants to recall their recent data science project where they com-
municated their findings with other team members. We asked open-
ended questions and encouraged interviewees to describe their lived
experiences [20]. We organized each interview as follows:
•Background
1) What was a recent data science project you col-
laborated on with other people? 2) Do you communicate your
intermediate results?