1 Are Current Task-oriented Dialogue Systems Able to Satisfy Impolite Users

2025-04-30 0 0 613.65KB 12 页 10玖币

侵权投诉

Are Current Task-oriented Dialogue Systems Able

to Satisfy Impolite Users?

Zhiqiang Hu∗, Roy Ka-Wei Lee∗, Nancy F. Chen†

∗Singapore University of Technology and Design, Singapore

zhiqiang hu@mymail.sutd.edu.sg, roy lee@sutd.edu.sg

†Institute of Infocomm Research (I2R), A*STAR, Singapore

nfychen@i2r.a-star.edu.sg

Abstract—Task-oriented dialogue (TOD) systems have assisted

users on many tasks, including ticket booking and service

inquiries. While existing TOD systems have shown promising per-

formance in serving customer needs, these systems mostly assume

that users would interact with the dialogue agent politely. This

assumption is unrealistic as impatient or frustrated customers

may also interact with TOD systems impolitely. This paper aims

to address this research gap by investigating impolite users’

effects on TOD systems. Speciﬁcally, we constructed an impolite

dialogue corpus and conducted extensive experiments to evaluate

the state-of-the-art TOD systems on our impolite dialogue corpus.

Our experimental results show that existing TOD systems are

unable to handle impolite user utterances. We also present

a data augmentation method to improve TOD performance

in impolite dialogues. Nevertheless, handling impolite dialogues

remains a very challenging research task. We hope by releasing

the impolite dialogue corpus and establishing the benchmark

evaluations, more researchers are encouraged to investigate this

new challenging research task.

Index Terms—Task-oriented dialogue systems, impolite users,

data augmentation.

I. INTRODUCTION

Motivation. Task-oriented dialogue (TOD) systems play a

vital role in many businesses and service operations. Specif-

ically, these systems are deployed to assist users with spe-

ciﬁc tasks such as ticket booking and restaurant reservations

through natural language conversations. TOD systems are

usually built through a pipeline architecture that consists of

four sequential modules, including natural language under-

standing (NLU), dialogue state tracking (DST), policy learning

(POL), and natural language generation (NLG) [1]–[4]. More

recently, researchers have also explored leveraging large pre-

trained language models to improve the performance of TOD

systems [5]–[7]. These TOD systems have demonstrated their

effectiveness in understanding and responding to the users’

needs through conversations.

As most TOD systems are developed to serve and assist

humans in performing speciﬁc tasks, the politeness of the

TD systems remains a key design consideration. For instance,

Gupta et al. [8] presented POLLy (Politeness for Language

Learning), a system that combines a spoken language genera-

tor with an AI Planner to model Brown and Levinson’s theory

of politeness in TOD. Bothe et al. [9] developed a dialogue-

based navigation approach incorporating politeness and so-

ciolinguistic features for robotic behavioral modeling. More

User

Bot

(PPTOD)

Impolite

User

Bot

(PPTOD)

What is the phone number and address?

The phone number for the Portuguese

restaurant is 01223 361355 and the address is

Cambridge Leisure Park Clifton Way.

phone number and address? Do I really have to

chase you for every single detail?

I'm sorry, I don't.

Fig. 1. Examples of two dialogue interactions between PPTOD and two types

of users: normal user (top) and impolite user (bottom).

recently, Mishra et al. [10] proposed a politeness adaptive

dialogue system (PADS) that can interact with users politely

and showcases empathy.

Nevertheless, the above studies have focused on generating

polite dialogues and ignored the users’ politeness (or impolite-

ness) in the conversation. Therefore it is unclear how the TOD

systems would respond when the users interact with the TOD

systems in an impolite manner, especially when the users are

in a rush to get information or frustrated when TOD systems

provide irrelevant responses. Consider the example in Figure 1,

we notice that the TOD system, PPOTD [11], is able to provide

a proper response to a user who presents the question in

a normal or polite manner. However, when encountering an

impolite user, PPOTD is not able to provide a proper and

relevant response. Ideally, the TOD systems should be robust

in handling user requests regardless of their politeness.

A straightforward approach to improving TOD systems’

ability to handle impolite users is to train the dialogue systems

with impolite user utterances. Unfortunately, most of the

existing TOD datasets [12]–[15] only capture user utterances

that are neural or polite. The lack of an impolite dialogue

dataset also limits the evaluation of TOD systems; to the

best of our knowledge, there are no existing studies on the

robustness of TOD systems in handling problematic users.

Research Objectives. To address the research gaps, we aim

to investigate the effects of impolite user utterances on TOD

systems. Working towards this goal, we collect and annotate

an impolite dialogue corpus by manually rewriting the user

utterances of the MultiWOZ 2.2 dataset [13]. Speciﬁcally,

arXiv:2210.12942v1 [cs.CL] 24 Oct 2022

human annotators are recruited to rewrite the user utterances

with role-playing scenarios that could encourage impolite user

utterances. For example, “imagine you are in a rush and

frustrated that the systems have given the wrong response for

the second time.”. In total, the human annotators have rewritten

over 10K impolite user utterances. Statistical and linguistic

analyses of the impolite user utterances are also performed to

understand the constructed dataset better.

The impolite dialogue corpus is subsequently used to eval-

uate the performance and limitations of state-of-the-art TOD

systems. Speciﬁcally, we have designed experiments to evalu-

ate the robustness of TOD systems in handling impolite users

and understand the effects of impolite user utterances on these

systems. We have also explored solutions to improve TOD

systems’ robustness in handling impolite users. A possible

solution is to train the TOD systems with more data. However,

the construction of a large-scale impolite dialogue corpus is a

laborious and expensive process. Therefore, we propose a data

augmentation method that utilizes text style transfer techniques

to improve TOD systems’ performance in impolite dialogues.

Contributions. We summarize our contributions as follows:

•We collect and annotate an impolite dialogue corpus

to support the evaluation of TOD systems performance

when handling impolite users. We hope that the impolite

dialogue dataset will encourage researchers to propose

TOD systems that are robust in handling users’ requests.

•We evaluate the performance of six state-of-the-art TOD

systems using our impolite dialogue corpus. The eval-

uation results showed that existing TOD systems have

difﬁculty handling impolite users’ requests.

•We propose a simple data augmentation method that

utilizes text style transfer techniques to improve TOD

systems’ performance in impolite dialogues.

II. RELATED WORK

1) Task Oriented Dialogue Systems: With the rapid ad-

vancement of deep learning techniques, TOD systems have

shown promising performance in handling user requests and

interactions. Recent studies have focused on end-to-end TOD

systems to train a general mapping from user utterance to the

system’s natural language response [16]–[20]. Yang et al. [5]

proposed UBAR by ﬁne-tuning the large pre-trained unidirec-

tional language model GPT-2 [21] on the entire dialog session

sequence. The dialogue session consists of user utterances,

belief states, database results, system actions, and system re-

sponses of every dialog turn. Su et al. [11] proposed PPTOD to

effectively leverage pre-trained language models with a multi-

task pre-training strategy that increases the model’s ability

with heterogeneous dialogue corpora. Lin et al. [7] proposed

Minimalist Transfer Learning (MinTL) to plug-and-play large-

scale pre-trained models for domain transfer in dialogue task

completion. Zang et al. [13] proposed the LABES model,

which treated the dialogue states as discrete latent variables to

reduce the reliance on turn-level DST labels. Kulh´

anek et al.

[22] proposed AuGPT with modiﬁed training objectives for

language model ﬁne-tuning and data augmentation via back-

translation [23] to increase the diversity of the training data.

Existing studies have also leveraged knowledge bases to track

pivotal and critical information required in generating TOD

system agent’s responses [24]–[27]. For instance, Madotto

et al. [28] dynamically updated a knowledge base via ﬁne-

tuning by directly embedding it into the model parameters.

Other studies have also explored reinforcement learning to

build TOD systems [29]–[31]. For instance, Zhao et al. [29]

utilized a Deep Recurrent Q-Networks (DRQN) for building

TOD systems.

2) Modeling Politeness in Dialogue Systems: Recent stud-

ies have also attempted to improve dialogue systems to gen-

erate responses in a more empathetic manner [32]–[35]. Yu et

al. [33] proposed to include user sentiment obtained through

multimodal information (acoustic, dialogic, and textual) in

the end-to-end learning framework to make TOD systems

more user-adaptive and effective. Feng et al. [34] constructed

a corpus containing task-oriented dialogues with emotion

labels for emotion recognition in TOD systems. However, the

impoliteness of users is not modeled as too few instances

exist in the MultiWOZ dataset. The lack of impolite dialogue

data motivates us to construct an impolite dialogue corpus to

facilitate downstream analysis.

Politeness is a human virtue and a crucial aspect of com-

munication [36], [37]. Danescu-Niculescu-Mizil et al. [37]

proposed a computational framework to identify the linguistic

aspects of politeness with application to social factors. Re-

searchers have also attempted to model and include politeness

in TOD systems [38]–[40]. For instance, Golchha et al. [38]

utilized a reinforced pointer generator network to transform

a generic response into a polite response. More recently,

Madaan et al. [40] adopted a text style transfer approach

to generate polite sentences while preserving the intended

content. Nevertheless, most of these studies have focused

on generating polite responses, neglecting the handling of

impolite inputs, i.e., impolite user utterances. This study aims

to ﬁll this research gap by extensively evaluating state-of-the-

art TOD systems’ ability to handle impolite dialogues.

3) Data Augmentation in Dialogue Systems: Data augmen-

tation, which aims to enlarge training data size in machine

learning systems, is a common solution to the data scarcity

problem. Data augmentation methods has also been widely

used in dialogue systems [22], [41], [42]. For instance, Kurata

et al. [43] trained an encoder-decoder to reconstruct the utter-

ances in training data. To augment training data, the encoder’s

output hidden states are perturbed randomly to yield different

utterances. Hou et al. [41] proposed a sequence-to-sequence

generation-based data augmentation framework that models

relations between utterances of the same semantic frame in

the training data. Gritta et al. [42] proposed the Conversation

Graph (ConvGraph), which is a graph-based representation of

dialogues, to augment data volume and diversity by generating

dialogue paths. In this paper, we propose a simple data aug-

mentation method that utilizes text style transfer techniques to

generate impolite user utterances for training data to improve

TOD systems’ performance in dealing with impolite users.

III. IMPOLITE DIALOGUE CORPUS

We construct an impolite dialogue dataset to support our

evaluation of TOD systems’ ability to interpret and respond

to impolite users. Speciﬁcally, we recruited eight native En-

glish speakers to rewrite the user utterances in MultiWOZ

2.2 dataset [13] in an impolite manner. To the best of our

knowledge, this is the ﬁrst impolite task-oriented dialogue

corpus. In the subsequent sections, we will discuss the corpus

construction process and provide a preliminary analysis of the

constructed impolite dialogue corpus.

A. Corpus Construction

MultiWOZ 2.2 [13] is a large-scale multi-domain task-

oriented dialogue benchmark that contains dialogues in seven

domains, including attraction, hotel, hospital, bus, restaurant,

train, and taxi. This dataset is also popular and commonly

used to evaluate existing TOD systems [5], [7], [11], [17],

[18], [22]. We performed a preliminary analysis using the

Stanford Politeness classiﬁer trained on Wikipedia requests

data [37] to assign a politeness score to the user utterances in

MultiWOZ 2.2. We found that 99% of the user utterances are

classiﬁed as polite. Hence, we aim to rewrite the user utterance

in MultiWOZ 2.2 to create our impolite dialogue corpus.

Impolite Rewriting. The goal is to rewrite the user ut-

terance in the MultiWOZ 2.2 dataset and present the user

utterance in a rude and impolite manner. We randomly sampled

a subset of dialogues from MultiWOZ 2.2 for rewriting.

Next, we recruited eight native English speakers to rewrite

the user utterances. For each user utterance, the annotators

are presented with the entire dialogue history to have the

conversation’s overall context. The annotators are tasked to

rewrite the user utterances with three objectives: (i) the

rewritten sentences should be impolite, (ii) the content of

the rewritten sentence should be semantically close to the

original sentence, and (iii) the rewritten sentences should be

ﬂuent. To further encourage the diversity of the impolite user

utterance, we also prescribed six role-playing scenarios to aid

the annotators in the rewriting tasks. For instance, we asked

the annotators to imagine they were customers in a bad mood

or impatient customers who wanted to get the information

quickly. The details of the role-playing scenarios are shown

in Table I, and the annotation system interface is included in

the Appendix A-A.

Annotation Quality Control. Impoliteness is subjective,

and the annotators may have different interpretations of im-

politeness. Therefore, we implement iterative checkpoints to

evaluate the quality of the rewritten user utterance. Speciﬁ-

cally, we conducted peer evaluation at various checkpoints to

allow annotators to rate the quality of each other’s rewritten

sentences. The annotators are tasked to rate the rewritten user

utterance based on the following three criteria:

•Politeness. Rate the sentence politeness using a 5-point

Likert scale. 1: strongly opined that the sentence is

impolite, and 5: strongly opined that the sentence is

polite.

•Content Preservation. Compare the original user utter-

ance and rewrite sentence and rate the amount of content

TABLE I

Role-playing scenarios for impolite user annotation.

No. Scenario

1 Imagine that the customer is a sarcastic person in

a bad mood.

2 Imagine that the customer is an impatient customer

that wants to get the information fast.

3 Imagine that the customer is in a bad mood as

something bad has just happened (e.g., just had an

argument with friends or spouse).

4 Imagine that the customer is tired and hungry after

a long-haul ﬂight and need to get this information

fast.

5 Imagine that the customer is a spoilt-brat with a

lot of money.

6 Imagine that the customer is getting help from

CSA for the third time and they didn’t get the

previous information right.

preserved in the rewrite sentence using a 5-point Likert

scale. 1: the original and rewritten sentences have very

different content, and 5: the original and rewritten sen-

tences have the same content.

•Fluency. Rate the ﬂuency of rewritten sentences using

a 5-point Likert scale. 1: unreadable with too many

grammatical errors, 5: perfectly ﬂuent sentence.

Each user utterance is evaluated by two annotators. Never-

theless, we recognize that it is unnatural for all utterances in a

dialogue to be impolite. Thus, we would consider a dialogue

impolite if 50% of the user utterances in a conversation are

rated as impolite (i.e., with Politeness score 2 or less). This

exercise allows the annotators to align their understanding of

the rewriting task. The annotators will revise the unqualiﬁed

dialogues until they are rated impolite in the peer evaluation.

While the annotators are tasked to assess each other’s work,

they are unaware of their evaluation scores to mitigate any

biases. Therefore, the annotators might learn new ways to write

impolite dialogues from each other, but they are not writing

to “optimize” any assessment scores in the human evaluation.

B. Corpus Analysis

In total, the annotators rewrote 1,573 dialogues, comprising

10,667 user utterances. Table II shows the distributions of the

MultiWOZ 2.2 dataset and our impolite dialogue corpus. As

we have sampled a substantial number of dialogues from Mul-

tiWOZ 2.20, we notice that the rewritten impolite dialogues

follow similar domain distributions as the original dataset.

Table III shows the results of the ﬁnal peer evaluation

of all rewritten impolite user utterances. Speciﬁcally, the

average politeness,content preservation, and ﬂuency scores of

the rewritten impolite user utterances are reported. The high

average content preservation and ﬂuency scores suggest that

the high-quality rewritten utterances retain the original user’s

intention in the conversations. More importantly, the average

politeness score is 1.96, indicating that most of the rewrites

are impolite but not too “offensive”. We further examine and

show the politeness score distribution of the rewritten impolite

dialogue in Figure 2. Note that the politeness score of dialogue

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1AreCurrentTask-orientedDialogueSystemsAbletoSatisfyImpoliteUsers?ZhiqiangHu,RoyKa-WeiLee,NancyF.ChenySingaporeUniversityofTechnologyandDesign,Singaporezhiqianghu@mymail.sutd.edu.sg,roylee@sutd.edu.sgyInstituteofInfocommResearch(I2R),A*STAR,Singaporenfychen@i2r.a-star.edu.sgAbstractTask-oriented...

展开>> 收起<<

1 Are Current Task-oriented Dialogue Systems Able to Satisfy Impolite Users.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Are Current Task-oriented Dialogue Systems Able to Satisfy Impolite Users

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: