Doc2Bot Accessing Heterogeneous Documents via Conversational Bots Haomin Fu12 Yeqin Zhang12 Haiyang Yu2 Jian Sun2 Fei Huang2 Luo Si2 Yongbin Li2yand Cam-Tu Nguyen1y

2025-05-03 0 0 2.57MB 17 页 10玖币

侵权投诉

Doc2Bot: Accessing Heterogeneous Documents via Conversational Bots

Haomin Fu1,2∗

, Yeqin Zhang1,2∗

, Haiyang Yu2, Jian Sun2, Fei Huang2, Luo Si2

Yongbin Li2†and Cam-Tu Nguyen1†

1State Key Laboratory for Novel Software Technology, Nanjing University, China

2Alibaba Group

{haominfu, zhangyeqin}@smail.nju.edu.cn

{yifei.yhy, jian.sun, f.huang, luo.si}@alibaba-inc.com

shuide.lyb@alibaba-inc.com, ncamtu@nju.edu.cn

Abstract

This paper introduces Doc2Bot, a novel

dataset for building machines that help users

seek information via conversations. This is of

particular interest for companies and organiza-

tions that own a large number of manuals or

instruction books. Despite its potential, the na-

ture of our task poses several challenges: (1)

documents contain various structures that hin-

der the ability of machines to comprehend, and

(2) user information needs are often underspec-

iﬁed. Compared to prior datasets that either

focus on a single structural type or overlook

the role of questioning to uncover user needs,

the Doc2Bot dataset is developed to target

such challenges systematically. Our dataset

contains over 100,000 turns based on Chinese

documents from ﬁve domains, larger than any

prior document-grounded dialog dataset for in-

formation seeking. We propose three tasks

in Doc2Bot: (1) dialog state tracking to track

user intentions, (2) dialog policy learning to

plan system actions and contents, and (3) re-

sponse generation which generates responses

based on the outputs of the dialog policy. Base-

line methods based on the latest deep learning

models are presented, indicating that our pro-

posed tasks are challenging and worthy of fur-

ther research.

1 Introduction

The last decade has witnessed a dramatic change in

how humans interact with information retrieval sys-

tems. Although traditional search engines still play

an important role in our daily life, the wide adop-

tion of smart devices with small screens requires

systems to answer user requests more concisely.

Early attempts focus on answering independent

questions (Rajpurkar et al.,2016), whereas recent

studies pay attention to handling interconnected

questions via conversations around a single pas-

sage (Pasupat and Liang,2015;Chen et al.,2020)

*Equal contribution.

†Corresponding authors.

or documents (Feng et al.,2020,2021). Yet, the

nature of heterogeneous documents and our conver-

sational setting pose challenges that require further

attention. We, therefore, develop Doc2Bot

with

these considerations in mind.

The ﬁrst concerns the nature of heterogeneous

documents, which often contain different types of

structures such as tables and sequences. To answer

questions regarding such structural types, systems

need to acquire various skills. Figure 1shows a

conversation between a user and an agent, where

the agent has access to a collection of documents.

In this conversation, every utterance except the

ﬁrst one depends on the conversation history and

the grounded documents. The document contains

diverse structures including conditions (

), se-

quences (

), and tables (

). Each type of struc-

ture poses different challenges for the agent. For

example, the agent needs to understand the rule

text in

to ask questions such as

and

. The

agent also needs to be able to summarize the pro-

cedure in

to produce the answer (

A13

) or jump

to a speciﬁc step of the procedure if being asked

(

A15

). In addition, the agent needs to be capable of

summarizing the tables (

A19

) or answering ques-

tions with row and column constraints (

A21

). Note

that there have been previous attempts (Pasupat and

Liang,2015;Saeidi et al.,2018;Chen et al.,2020;

Zhang et al.,2022b) to address different structures

in isolation. However, little research has been done

on handling multiple types of structures in the same

document context.

The second concerns the nature of our conversa-

tional setting, which is to help users seek informa-

tion from documents. Since user information needs

are often under-speciﬁed, it is desirable for systems

to ask clarifying questions. This resembles the

task of relevance feedback and query reﬁnement

in traditional information retrieval. However, in

Doc2Bot, system feedback is in the form of natural

1https://github.com/Doc2Bot/Doc2Bot

arXiv:2210.11060v3 [cs.CL] 20 Nov 2022



   





 

  







 















 





















Hello, I'd like to know about the social security and medical insurance of HZ

City.

Which one do you want to know? Industrial injury insurance, unemployment

insurance or medical insurance?

Tell me more unemployment insurance, I like to know if I can apply for

unemployment benefit?

From January to December 2021, did you terminate the labor contract due to

your own reasons while you have paid for the insurance over a year?

Yes

Is your last insured institution in HZ City before you lost your job?

Yes, I am insured in HZ City

Ok, then you meet the current conditions for HZ City to apply for

unemployment benefits

...

U12

A13

U14

A15

Can you tell me the procedure for the continuation of unemployment

insurance?

Handling process: 1) Application; 2) Receiving; 3) Recheck; 4) Review; 5)

Completion and 6) Delivery. Would you like to know more about the application

step?

Not now, I'd like to know the processing time of the forth step?

1-3 working days

...

U18

A19

U20

A21

What materials do I need to submit to apply for unemployment benefits?

It is required to submit the application form, social security card and the

certificate of termination of labor issued by the employer.

What is the required paper size of the application form?

HZ city > Social security and medical insurance

> Unemployment Insurance

Figure 1: An example dialog (right) grounded on a document (left) with heterogeneous structures. From the top,

the dialog contains 4 segments S1-4 grounded on 4 corresponding document segments N1-4. Here Uand Astand

for user and agent, respectively.

questions, and thus more user-friendly. For exam-

ple, in Figure 1,

is a kind of multiple-choice

question that the agent asks to narrow down the

search for the answer. In contrast,

and

are

to verify user situations to answer questions regard-

ing condition/solution structure. Although learning

to construct questions from a single passage has

been studied in Machine Reading Comprehension

(Saeidi et al.,2018;Guo et al.,2021), such ﬁner-

grained questions are required only when the pas-

sage containing the answer has been found. For

document-grounded dialog systems (DGDS), the

agent needs to have the skills to narrow down the

search (

) as well as to ask ﬁner questions such

as A3 and A5.

Towards such goals, there are several challenges

that we need to address. First, documents come

in different formats, and thus the process of con-

structing our dataset is more difﬁcult than those

from single passages with homogeneous structures.

The difference in formats also hinders the ability

of machines to learn common patterns. Second,

like human-human conversations, it is desirable

to have samples of human-system conversations

that are natural, and coherent while being diverse

for the machine learning purpose. We target such

challenges systematically and make the following

contributions:

•

We present a uniﬁed representation for hetero-

geneous structures, which not only facilitates

our data collection process but also helps sys-

tems to learn patterns across documents.

•

We propose an agenda-based dialog collec-

tion protocol that controls the diversity and

coherence of dialogues by design. The pro-

tocol also encourages crowd-collaborators to

introduce ambiguities to conversations.

•

We introduce a new dataset Doc2Bot which is

larger in scale compared to recent datasets for

DGDS (Feng et al.,2020,2021) while intro-

ducing new challenges such as a new language

(Chinese), richer relations (e.g, sections, con-

ditions, tables, sequences) and new tasks (e.g.

dialog policy learning).

•

We evaluate our proposed tasks with the latest

machine learning methods. The experiments

show that our tasks are still challenging, which

suggests room for further research.

2 Related Works

Our work is most closely related to the document-

grounded dialog systems (DGDS) in the litera-

ture. Based on the conversation objective, we can

roughly categorize the related tasks into chitchat,

comprehension, or information seeking.

Document-grounded chitchat datasets such as

WoW (Dinan et al.,2019), Holl-E (Moghe et al.,

2018), CMU-DoG (Zhou et al.,2018) aim to en-

hance early chitchat systems by using information

from grounded textual passages for answer genera-

tion. The goal is similar to an open chitchat system

as the dialog agent tries to keep users engaged in

long, informative, and interactive conversations.

This is different from our setting because users

of our system often have clear goals (information

needs), and the dialog agent needs to provide users

with accurate information as soon as possible.

For document-grounded “comprehension” such

as CoQA (Reddy et al.,2019), Abg-CoQA (Guo

et al.,2021) and ShARC (Saeidi et al.,2018), the

agent is given a textual paragraph and needs to an-

swer users’ questions about the paragraph. This set-

ting is similar to Machine Reading Comprehension

(MRC). However, the difference is that questions

in MRC may not form a coherent dialog. Notice-

ably, several question strategies have been targeted

in Abg-CoQA and ShARC. For example, in Abg-

CoQA, systems can ask clarifying questions to re-

solve different types of ambiguities. In ShARC,

the authors created conversations where the system

can learn to ask “yes/no” questions to understand

users’ information and provide appropriate answers.

The questioning strategy in ShARC is designed

based on text rules that deﬁne the relationship be-

tween “conditions” and “solutions” exhibited in

the given paragraph. Although we also address

question strategies, our tasks are more challenging

because we focus on multiple documents.

The third type of DGDS (Penha et al.,2019;

Feng et al.,2020,2021) is closest to our setting

where the agent needs to provide answers to infor-

mation seekers in the shortest possible time. Mantis

(Penha et al.,2019) was collected from online fo-

rums, and the grounded documents are not given in

advance. As a result, Mantis does not come with a

detailed annotation which is needed to study the ca-

pability of the agents to understand documents. In

contrast, given a set of documents, Doc2dial (Feng

et al.,2020) and Multidoc2dial (Feng et al.,2021)

were collected in 2 stages: 1) dialog ﬂows are ﬁrst

generated by labeling and linking paragraphs, 2)

crowdsourcers then write conversations based on

the suggested ﬂows. Note that Multidoc2dial was

built by rearranging dialogues from doc2dial so

that one conversation can contain information from

multiple documents. Although we follow simi-

lar steps for constructing the dataset, our dialog

ﬂow generation is essentially different, which ad-

dresses the coherence of the generated dialogues,

and the multi-document grounding issue by design.

In addition, our dataset exceeds Doc2dial and Mul-

tidoc2dial in scale, while also highlighting new

challenges such as under-speciﬁed user requests.

3 Dataset Collection

This section details the process of collecting

Doc2Bot, including 4 stages: 1)

document col-

lection

which selects targeted domains and doc-

uments; 2)

document graph construction

which

uniﬁes heterogeneous structures from multiple do-

mains to build document graphs; 3)

dialog ﬂow

generation

that simulates the agenda of a user

seeking information from a document graph; and 4)

dialog collection

where crowd-collaborators write

dialogs based on the generated dialog ﬂows.

3.1 Document Collection

For document collection, we examine several po-

tential domains and select 5 representative ones

including public services, technology, insurance,

health care services, and wikiHow. For each do-

main, documents are selected based on two criteria:

1) the documents should be rich in structural types;

2) each document should have links to other doc-

uments so that we can test the ability of machines

to reason over multiple documents. We design a

simple ranking score based on these criteria and

select the top-ranked documents for each domain.

3.2 Document Graph Construction

Documents from different domains or sources have

vastly different formats (HTML, PDF, etc). To-

wards building scalable dialog systems across do-

mains, it is important to have a uniﬁed format for

encoding heterogeneous semantic structures in doc-

uments. Bear in mind that our target is to preserve

those structures in the document context. This is un-

like knowledge graphs and event graphs (Fu et al.,

2020;Ma et al.,2021;Hogan et al.,2021) in which

only entities or events are extracted while other

context information is discarded.

type=section

type=disjunction

type=cond

type=solution

If the insured dies

due to one of the

following

circumstances

The company shall

not be liable

(1) The applicant

intentionally causes the

insured to have an

acute disease type=cond

(2) The insured

intentionally commits

acrime

Exemption

from liability

Figure 2: The structure of a disjunction of conditions

and the associated solution in the insurance domain.

type=object

type=table type=object

type=value

type=attr

Application form

for tractor

driver’s license

Application

Materials

Number of

copies 1 copy

Material

Specification

Photos

Number of

copies 2 copies

type=attr

Material

Specification 1 inch

Figure 3: The structure of a table and its objects in the

domain of public services.

Document Graph

is deﬁned as a directed graph

where a node corresponds to a span of text in the

document. Inspired by property graphs (Hogan

et al.,2021), we associate each node with a node

type and a set of additional property-value pairs.

Each domain has a root node that connects to do-

main documents via title hierarchy.

A number of node types are deﬁned to cover

common discourse relations exhibited in multi-

ple domains (Das et al.,2018;Stede et al.,2019).

These include

section

type to denote section titles

in documents. The types of disjunction,conjunc-

tion

condition

solution

negation

are used to de-

scribe the condition-solution relation as depicted

in Figure 2. The types of

table

object

attribute

value

are to encode the relations in tables as shown

in Figure 3. The types of

sequence

sequence-step

are introduced to indicate the relations of texts in

describing procedures such as

in Figure 1. Last

but not least, the

see-more

type is used to encode

hyperlinks, and the

ordinary

type is assigned to

the nodes belonging to none of the above.

The property-value pairs associated with nodes

are used for additional information. For exam-

ple, each node can be identiﬁed with

docid

and

nodeid

. Likewise,

see-more

nodes have prop-

erties such as linked nodeid. Additionally, we in-

troduce is-super-leaf to indicate whether a node

should be targeted in the dialog ﬂow generation.

3.3 Dialog Flow Generation

Studies of human behaviors in goal-oriented di-

alog systems have long recognized the fact that

users have hidden agendas (Schatzmann and Young,

2009) which direct the interactions between users

and chatbots. This is also the idea behind the con-

struction of well-known datasets such as MultiWoz

(Budzianowski et al.,2018). Although the connec-

tion between DGDS in information-seeking sce-

narios and goal-oriented dialog systems has been

suggested (Feng et al.,2020,2021), DGDS have no

explicit schemes, thus hindering the agenda-based

approach to dialog collection. As an alternative, we

exploit the graph structure of the document graph

to build up agendas for simulating dialog ﬂows be-

tween a user and an agent. Here, a dialog ﬂow is

deﬁned as a sequence of goals, each goal corre-

sponds to a node in our document graph. We mark

nodes, that can be used as goals, with is_super_leaf

being true using a semi-automatic method.

Our agenda-based procedure for generating a di-

alog ﬂow is demonstrated in Algorithm 1. Here, the

procedure takes as inputs the document graph

the transition probabilities

, the maximum number

of goals

nGoal

, and the initial selected document

. The objective is to generate diverse dialog ﬂows

based on which crowd contributors can write con-

versations. For each goal, a prompt can be gener-

ated to suggest questions that can be asked about

the subtree rooted at the goal node (line 6). For

example, given a table in Figure 3as a goal, we

can generate the corresponding prompt by: (1) ran-

domly selecting some “objects” and “attributes” as

constraints, e.g. paper size and application form;

(2) using templates to convert the constraints to

a guideline such as “write a number of question-

answer turns so that the system ﬁnal answer is A4 -

the paper size of the application form”.

We use an agenda stack to contain a list of po-

tential goals that a user might switch to (from the

last goal). The candidates nearer to the top of the

agenda stack are closer to the last goal in the doc-

ument graph. The action of a user switching from

one goal to another is simulated by three factors,

the follow-up rate

ξfl

, the in-jump rate

ξinj

and the

out-jump rate

ξoutj

. When the action is follow-up,

users tend to ask about the related information of

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Doc2Bot:AccessingHeterogeneousDocumentsviaConversationalBotsHaominFu1,2,YeqinZhang1,2,HaiyangYu2,JianSun2,FeiHuang2,LuoSi2YongbinLi2yandCam-TuNguyen1y1StateKeyLaboratoryforNovelSoftwareTechnology,NanjingUniversity,China2AlibabaGroup{haominfu,zhangyeqin}@smail.nju.edu.cn{yifei.yhy,jian.sun,f.huang,...

展开>> 收起<<

Doc2Bot Accessing Heterogeneous Documents via Conversational Bots Haomin Fu12 Yeqin Zhang12 Haiyang Yu2 Jian Sun2 Fei Huang2 Luo Si2 Yongbin Li2yand Cam-Tu Nguyen1y.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Doc2Bot Accessing Heterogeneous Documents via Conversational Bots Haomin Fu12 Yeqin Zhang12 Haiyang Yu2 Jian Sun2 Fei Huang2 Luo Si2 Yongbin Li2yand Cam-Tu Nguyen1y

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: