1 Are Current Task-oriented Dialogue Systems Able to Satisfy Impolite Users

2025-04-30 0 0 613.65KB 12 页 10玖币
侵权投诉
1
Are Current Task-oriented Dialogue Systems Able
to Satisfy Impolite Users?
Zhiqiang Hu, Roy Ka-Wei Lee, Nancy F. Chen
Singapore University of Technology and Design, Singapore
zhiqiang hu@mymail.sutd.edu.sg, roy lee@sutd.edu.sg
Institute of Infocomm Research (I2R), A*STAR, Singapore
nfychen@i2r.a-star.edu.sg
Abstract—Task-oriented dialogue (TOD) systems have assisted
users on many tasks, including ticket booking and service
inquiries. While existing TOD systems have shown promising per-
formance in serving customer needs, these systems mostly assume
that users would interact with the dialogue agent politely. This
assumption is unrealistic as impatient or frustrated customers
may also interact with TOD systems impolitely. This paper aims
to address this research gap by investigating impolite users’
effects on TOD systems. Specifically, we constructed an impolite
dialogue corpus and conducted extensive experiments to evaluate
the state-of-the-art TOD systems on our impolite dialogue corpus.
Our experimental results show that existing TOD systems are
unable to handle impolite user utterances. We also present
a data augmentation method to improve TOD performance
in impolite dialogues. Nevertheless, handling impolite dialogues
remains a very challenging research task. We hope by releasing
the impolite dialogue corpus and establishing the benchmark
evaluations, more researchers are encouraged to investigate this
new challenging research task.
Index Terms—Task-oriented dialogue systems, impolite users,
data augmentation.
I. INTRODUCTION
Motivation. Task-oriented dialogue (TOD) systems play a
vital role in many businesses and service operations. Specif-
ically, these systems are deployed to assist users with spe-
cific tasks such as ticket booking and restaurant reservations
through natural language conversations. TOD systems are
usually built through a pipeline architecture that consists of
four sequential modules, including natural language under-
standing (NLU), dialogue state tracking (DST), policy learning
(POL), and natural language generation (NLG) [1]–[4]. More
recently, researchers have also explored leveraging large pre-
trained language models to improve the performance of TOD
systems [5]–[7]. These TOD systems have demonstrated their
effectiveness in understanding and responding to the users’
needs through conversations.
As most TOD systems are developed to serve and assist
humans in performing specific tasks, the politeness of the
TD systems remains a key design consideration. For instance,
Gupta et al. [8] presented POLLy (Politeness for Language
Learning), a system that combines a spoken language genera-
tor with an AI Planner to model Brown and Levinson’s theory
of politeness in TOD. Bothe et al. [9] developed a dialogue-
based navigation approach incorporating politeness and so-
ciolinguistic features for robotic behavioral modeling. More
User
Bot
(PPTOD)
Impolite
User
Bot
(PPTOD)
What is the phone number and address?
The phone number for the Portuguese
restaurant is 01223 361355 and the address is
Cambridge Leisure Park Clifton Way.
phone number and address? Do I really have to
chase you for every single detail?
I'm sorry, I don't.
Fig. 1. Examples of two dialogue interactions between PPTOD and two types
of users: normal user (top) and impolite user (bottom).
recently, Mishra et al. [10] proposed a politeness adaptive
dialogue system (PADS) that can interact with users politely
and showcases empathy.
Nevertheless, the above studies have focused on generating
polite dialogues and ignored the users’ politeness (or impolite-
ness) in the conversation. Therefore it is unclear how the TOD
systems would respond when the users interact with the TOD
systems in an impolite manner, especially when the users are
in a rush to get information or frustrated when TOD systems
provide irrelevant responses. Consider the example in Figure 1,
we notice that the TOD system, PPOTD [11], is able to provide
a proper response to a user who presents the question in
a normal or polite manner. However, when encountering an
impolite user, PPOTD is not able to provide a proper and
relevant response. Ideally, the TOD systems should be robust
in handling user requests regardless of their politeness.
A straightforward approach to improving TOD systems’
ability to handle impolite users is to train the dialogue systems
with impolite user utterances. Unfortunately, most of the
existing TOD datasets [12]–[15] only capture user utterances
that are neural or polite. The lack of an impolite dialogue
dataset also limits the evaluation of TOD systems; to the
best of our knowledge, there are no existing studies on the
robustness of TOD systems in handling problematic users.
Research Objectives. To address the research gaps, we aim
to investigate the effects of impolite user utterances on TOD
systems. Working towards this goal, we collect and annotate
an impolite dialogue corpus by manually rewriting the user
utterances of the MultiWOZ 2.2 dataset [13]. Specifically,
arXiv:2210.12942v1 [cs.CL] 24 Oct 2022
2
human annotators are recruited to rewrite the user utterances
with role-playing scenarios that could encourage impolite user
utterances. For example, “imagine you are in a rush and
frustrated that the systems have given the wrong response for
the second time.”. In total, the human annotators have rewritten
over 10K impolite user utterances. Statistical and linguistic
analyses of the impolite user utterances are also performed to
understand the constructed dataset better.
The impolite dialogue corpus is subsequently used to eval-
uate the performance and limitations of state-of-the-art TOD
systems. Specifically, we have designed experiments to evalu-
ate the robustness of TOD systems in handling impolite users
and understand the effects of impolite user utterances on these
systems. We have also explored solutions to improve TOD
systems’ robustness in handling impolite users. A possible
solution is to train the TOD systems with more data. However,
the construction of a large-scale impolite dialogue corpus is a
laborious and expensive process. Therefore, we propose a data
augmentation method that utilizes text style transfer techniques
to improve TOD systems’ performance in impolite dialogues.
Contributions. We summarize our contributions as follows:
We collect and annotate an impolite dialogue corpus
to support the evaluation of TOD systems performance
when handling impolite users. We hope that the impolite
dialogue dataset will encourage researchers to propose
TOD systems that are robust in handling users’ requests.
We evaluate the performance of six state-of-the-art TOD
systems using our impolite dialogue corpus. The eval-
uation results showed that existing TOD systems have
difficulty handling impolite users’ requests.
We propose a simple data augmentation method that
utilizes text style transfer techniques to improve TOD
systems’ performance in impolite dialogues.
II. RELATED WORK
1) Task Oriented Dialogue Systems: With the rapid ad-
vancement of deep learning techniques, TOD systems have
shown promising performance in handling user requests and
interactions. Recent studies have focused on end-to-end TOD
systems to train a general mapping from user utterance to the
system’s natural language response [16]–[20]. Yang et al. [5]
proposed UBAR by fine-tuning the large pre-trained unidirec-
tional language model GPT-2 [21] on the entire dialog session
sequence. The dialogue session consists of user utterances,
belief states, database results, system actions, and system re-
sponses of every dialog turn. Su et al. [11] proposed PPTOD to
effectively leverage pre-trained language models with a multi-
task pre-training strategy that increases the model’s ability
with heterogeneous dialogue corpora. Lin et al. [7] proposed
Minimalist Transfer Learning (MinTL) to plug-and-play large-
scale pre-trained models for domain transfer in dialogue task
completion. Zang et al. [13] proposed the LABES model,
which treated the dialogue states as discrete latent variables to
reduce the reliance on turn-level DST labels. Kulh´
anek et al.
[22] proposed AuGPT with modified training objectives for
language model fine-tuning and data augmentation via back-
translation [23] to increase the diversity of the training data.
Existing studies have also leveraged knowledge bases to track
pivotal and critical information required in generating TOD
system agent’s responses [24]–[27]. For instance, Madotto
et al. [28] dynamically updated a knowledge base via fine-
tuning by directly embedding it into the model parameters.
Other studies have also explored reinforcement learning to
build TOD systems [29]–[31]. For instance, Zhao et al. [29]
utilized a Deep Recurrent Q-Networks (DRQN) for building
TOD systems.
2) Modeling Politeness in Dialogue Systems: Recent stud-
ies have also attempted to improve dialogue systems to gen-
erate responses in a more empathetic manner [32]–[35]. Yu et
al. [33] proposed to include user sentiment obtained through
multimodal information (acoustic, dialogic, and textual) in
the end-to-end learning framework to make TOD systems
more user-adaptive and effective. Feng et al. [34] constructed
a corpus containing task-oriented dialogues with emotion
labels for emotion recognition in TOD systems. However, the
impoliteness of users is not modeled as too few instances
exist in the MultiWOZ dataset. The lack of impolite dialogue
data motivates us to construct an impolite dialogue corpus to
facilitate downstream analysis.
Politeness is a human virtue and a crucial aspect of com-
munication [36], [37]. Danescu-Niculescu-Mizil et al. [37]
proposed a computational framework to identify the linguistic
aspects of politeness with application to social factors. Re-
searchers have also attempted to model and include politeness
in TOD systems [38]–[40]. For instance, Golchha et al. [38]
utilized a reinforced pointer generator network to transform
a generic response into a polite response. More recently,
Madaan et al. [40] adopted a text style transfer approach
to generate polite sentences while preserving the intended
content. Nevertheless, most of these studies have focused
on generating polite responses, neglecting the handling of
impolite inputs, i.e., impolite user utterances. This study aims
to fill this research gap by extensively evaluating state-of-the-
art TOD systems’ ability to handle impolite dialogues.
3) Data Augmentation in Dialogue Systems: Data augmen-
tation, which aims to enlarge training data size in machine
learning systems, is a common solution to the data scarcity
problem. Data augmentation methods has also been widely
used in dialogue systems [22], [41], [42]. For instance, Kurata
et al. [43] trained an encoder-decoder to reconstruct the utter-
ances in training data. To augment training data, the encoder’s
output hidden states are perturbed randomly to yield different
utterances. Hou et al. [41] proposed a sequence-to-sequence
generation-based data augmentation framework that models
relations between utterances of the same semantic frame in
the training data. Gritta et al. [42] proposed the Conversation
Graph (ConvGraph), which is a graph-based representation of
dialogues, to augment data volume and diversity by generating
dialogue paths. In this paper, we propose a simple data aug-
mentation method that utilizes text style transfer techniques to
generate impolite user utterances for training data to improve
TOD systems’ performance in dealing with impolite users.
3
III. IMPOLITE DIALOGUE CORPUS
We construct an impolite dialogue dataset to support our
evaluation of TOD systems’ ability to interpret and respond
to impolite users. Specifically, we recruited eight native En-
glish speakers to rewrite the user utterances in MultiWOZ
2.2 dataset [13] in an impolite manner. To the best of our
knowledge, this is the first impolite task-oriented dialogue
corpus. In the subsequent sections, we will discuss the corpus
construction process and provide a preliminary analysis of the
constructed impolite dialogue corpus.
A. Corpus Construction
MultiWOZ 2.2 [13] is a large-scale multi-domain task-
oriented dialogue benchmark that contains dialogues in seven
domains, including attraction, hotel, hospital, bus, restaurant,
train, and taxi. This dataset is also popular and commonly
used to evaluate existing TOD systems [5], [7], [11], [17],
[18], [22]. We performed a preliminary analysis using the
Stanford Politeness classifier trained on Wikipedia requests
data [37] to assign a politeness score to the user utterances in
MultiWOZ 2.2. We found that 99% of the user utterances are
classified as polite. Hence, we aim to rewrite the user utterance
in MultiWOZ 2.2 to create our impolite dialogue corpus.
Impolite Rewriting. The goal is to rewrite the user ut-
terance in the MultiWOZ 2.2 dataset and present the user
utterance in a rude and impolite manner. We randomly sampled
a subset of dialogues from MultiWOZ 2.2 for rewriting.
Next, we recruited eight native English speakers to rewrite
the user utterances. For each user utterance, the annotators
are presented with the entire dialogue history to have the
conversation’s overall context. The annotators are tasked to
rewrite the user utterances with three objectives: (i) the
rewritten sentences should be impolite, (ii) the content of
the rewritten sentence should be semantically close to the
original sentence, and (iii) the rewritten sentences should be
fluent. To further encourage the diversity of the impolite user
utterance, we also prescribed six role-playing scenarios to aid
the annotators in the rewriting tasks. For instance, we asked
the annotators to imagine they were customers in a bad mood
or impatient customers who wanted to get the information
quickly. The details of the role-playing scenarios are shown
in Table I, and the annotation system interface is included in
the Appendix A-A.
Annotation Quality Control. Impoliteness is subjective,
and the annotators may have different interpretations of im-
politeness. Therefore, we implement iterative checkpoints to
evaluate the quality of the rewritten user utterance. Specifi-
cally, we conducted peer evaluation at various checkpoints to
allow annotators to rate the quality of each other’s rewritten
sentences. The annotators are tasked to rate the rewritten user
utterance based on the following three criteria:
Politeness. Rate the sentence politeness using a 5-point
Likert scale. 1: strongly opined that the sentence is
impolite, and 5: strongly opined that the sentence is
polite.
Content Preservation. Compare the original user utter-
ance and rewrite sentence and rate the amount of content
TABLE I
Role-playing scenarios for impolite user annotation.
No. Scenario
1 Imagine that the customer is a sarcastic person in
a bad mood.
2 Imagine that the customer is an impatient customer
that wants to get the information fast.
3 Imagine that the customer is in a bad mood as
something bad has just happened (e.g., just had an
argument with friends or spouse).
4 Imagine that the customer is tired and hungry after
a long-haul flight and need to get this information
fast.
5 Imagine that the customer is a spoilt-brat with a
lot of money.
6 Imagine that the customer is getting help from
CSA for the third time and they didn’t get the
previous information right.
preserved in the rewrite sentence using a 5-point Likert
scale. 1: the original and rewritten sentences have very
different content, and 5: the original and rewritten sen-
tences have the same content.
Fluency. Rate the fluency of rewritten sentences using
a 5-point Likert scale. 1: unreadable with too many
grammatical errors, 5: perfectly fluent sentence.
Each user utterance is evaluated by two annotators. Never-
theless, we recognize that it is unnatural for all utterances in a
dialogue to be impolite. Thus, we would consider a dialogue
impolite if 50% of the user utterances in a conversation are
rated as impolite (i.e., with Politeness score 2 or less). This
exercise allows the annotators to align their understanding of
the rewriting task. The annotators will revise the unqualified
dialogues until they are rated impolite in the peer evaluation.
While the annotators are tasked to assess each other’s work,
they are unaware of their evaluation scores to mitigate any
biases. Therefore, the annotators might learn new ways to write
impolite dialogues from each other, but they are not writing
to “optimize” any assessment scores in the human evaluation.
B. Corpus Analysis
In total, the annotators rewrote 1,573 dialogues, comprising
10,667 user utterances. Table II shows the distributions of the
MultiWOZ 2.2 dataset and our impolite dialogue corpus. As
we have sampled a substantial number of dialogues from Mul-
tiWOZ 2.20, we notice that the rewritten impolite dialogues
follow similar domain distributions as the original dataset.
Table III shows the results of the final peer evaluation
of all rewritten impolite user utterances. Specifically, the
average politeness,content preservation, and fluency scores of
the rewritten impolite user utterances are reported. The high
average content preservation and fluency scores suggest that
the high-quality rewritten utterances retain the original user’s
intention in the conversations. More importantly, the average
politeness score is 1.96, indicating that most of the rewrites
are impolite but not too “offensive”. We further examine and
show the politeness score distribution of the rewritten impolite
dialogue in Figure 2. Note that the politeness score of dialogue
摘要:

1AreCurrentTask-orientedDialogueSystemsAbletoSatisfyImpoliteUsers?ZhiqiangHu,RoyKa-WeiLee,NancyF.ChenySingaporeUniversityofTechnologyandDesign,Singaporezhiqianghu@mymail.sutd.edu.sg,roylee@sutd.edu.sgyInstituteofInfocommResearch(I2R),A*STAR,Singaporenfychen@i2r.a-star.edu.sgAbstract—Task-oriented...

展开>> 收起<<
1 Are Current Task-oriented Dialogue Systems Able to Satisfy Impolite Users.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:613.65KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注