2 Related Works
Our work is most closely related to the document-
grounded dialog systems (DGDS) in the litera-
ture. Based on the conversation objective, we can
roughly categorize the related tasks into chitchat,
comprehension, or information seeking.
Document-grounded chitchat datasets such as
WoW (Dinan et al.,2019), Holl-E (Moghe et al.,
2018), CMU-DoG (Zhou et al.,2018) aim to en-
hance early chitchat systems by using information
from grounded textual passages for answer genera-
tion. The goal is similar to an open chitchat system
as the dialog agent tries to keep users engaged in
long, informative, and interactive conversations.
This is different from our setting because users
of our system often have clear goals (information
needs), and the dialog agent needs to provide users
with accurate information as soon as possible.
For document-grounded “comprehension” such
as CoQA (Reddy et al.,2019), Abg-CoQA (Guo
et al.,2021) and ShARC (Saeidi et al.,2018), the
agent is given a textual paragraph and needs to an-
swer users’ questions about the paragraph. This set-
ting is similar to Machine Reading Comprehension
(MRC). However, the difference is that questions
in MRC may not form a coherent dialog. Notice-
ably, several question strategies have been targeted
in Abg-CoQA and ShARC. For example, in Abg-
CoQA, systems can ask clarifying questions to re-
solve different types of ambiguities. In ShARC,
the authors created conversations where the system
can learn to ask “yes/no” questions to understand
users’ information and provide appropriate answers.
The questioning strategy in ShARC is designed
based on text rules that define the relationship be-
tween “conditions” and “solutions” exhibited in
the given paragraph. Although we also address
question strategies, our tasks are more challenging
because we focus on multiple documents.
The third type of DGDS (Penha et al.,2019;
Feng et al.,2020,2021) is closest to our setting
where the agent needs to provide answers to infor-
mation seekers in the shortest possible time. Mantis
(Penha et al.,2019) was collected from online fo-
rums, and the grounded documents are not given in
advance. As a result, Mantis does not come with a
detailed annotation which is needed to study the ca-
pability of the agents to understand documents. In
contrast, given a set of documents, Doc2dial (Feng
et al.,2020) and Multidoc2dial (Feng et al.,2021)
were collected in 2 stages: 1) dialog flows are first
generated by labeling and linking paragraphs, 2)
crowdsourcers then write conversations based on
the suggested flows. Note that Multidoc2dial was
built by rearranging dialogues from doc2dial so
that one conversation can contain information from
multiple documents. Although we follow simi-
lar steps for constructing the dataset, our dialog
flow generation is essentially different, which ad-
dresses the coherence of the generated dialogues,
and the multi-document grounding issue by design.
In addition, our dataset exceeds Doc2dial and Mul-
tidoc2dial in scale, while also highlighting new
challenges such as under-specified user requests.
3 Dataset Collection
This section details the process of collecting
Doc2Bot, including 4 stages: 1)
document col-
lection
which selects targeted domains and doc-
uments; 2)
document graph construction
which
unifies heterogeneous structures from multiple do-
mains to build document graphs; 3)
dialog flow
generation
that simulates the agenda of a user
seeking information from a document graph; and 4)
dialog collection
where crowd-collaborators write
dialogs based on the generated dialog flows.
3.1 Document Collection
For document collection, we examine several po-
tential domains and select 5 representative ones
including public services, technology, insurance,
health care services, and wikiHow. For each do-
main, documents are selected based on two criteria:
1) the documents should be rich in structural types;
2) each document should have links to other doc-
uments so that we can test the ability of machines
to reason over multiple documents. We design a
simple ranking score based on these criteria and
select the top-ranked documents for each domain.
3.2 Document Graph Construction
Documents from different domains or sources have
vastly different formats (HTML, PDF, etc). To-
wards building scalable dialog systems across do-
mains, it is important to have a unified format for
encoding heterogeneous semantic structures in doc-
uments. Bear in mind that our target is to preserve
those structures in the document context. This is un-
like knowledge graphs and event graphs (Fu et al.,
2020;Ma et al.,2021;Hogan et al.,2021) in which
only entities or events are extracted while other
context information is discarded.