2
human annotators are recruited to rewrite the user utterances
with role-playing scenarios that could encourage impolite user
utterances. For example, “imagine you are in a rush and
frustrated that the systems have given the wrong response for
the second time.”. In total, the human annotators have rewritten
over 10K impolite user utterances. Statistical and linguistic
analyses of the impolite user utterances are also performed to
understand the constructed dataset better.
The impolite dialogue corpus is subsequently used to eval-
uate the performance and limitations of state-of-the-art TOD
systems. Specifically, we have designed experiments to evalu-
ate the robustness of TOD systems in handling impolite users
and understand the effects of impolite user utterances on these
systems. We have also explored solutions to improve TOD
systems’ robustness in handling impolite users. A possible
solution is to train the TOD systems with more data. However,
the construction of a large-scale impolite dialogue corpus is a
laborious and expensive process. Therefore, we propose a data
augmentation method that utilizes text style transfer techniques
to improve TOD systems’ performance in impolite dialogues.
Contributions. We summarize our contributions as follows:
•We collect and annotate an impolite dialogue corpus
to support the evaluation of TOD systems performance
when handling impolite users. We hope that the impolite
dialogue dataset will encourage researchers to propose
TOD systems that are robust in handling users’ requests.
•We evaluate the performance of six state-of-the-art TOD
systems using our impolite dialogue corpus. The eval-
uation results showed that existing TOD systems have
difficulty handling impolite users’ requests.
•We propose a simple data augmentation method that
utilizes text style transfer techniques to improve TOD
systems’ performance in impolite dialogues.
II. RELATED WORK
1) Task Oriented Dialogue Systems: With the rapid ad-
vancement of deep learning techniques, TOD systems have
shown promising performance in handling user requests and
interactions. Recent studies have focused on end-to-end TOD
systems to train a general mapping from user utterance to the
system’s natural language response [16]–[20]. Yang et al. [5]
proposed UBAR by fine-tuning the large pre-trained unidirec-
tional language model GPT-2 [21] on the entire dialog session
sequence. The dialogue session consists of user utterances,
belief states, database results, system actions, and system re-
sponses of every dialog turn. Su et al. [11] proposed PPTOD to
effectively leverage pre-trained language models with a multi-
task pre-training strategy that increases the model’s ability
with heterogeneous dialogue corpora. Lin et al. [7] proposed
Minimalist Transfer Learning (MinTL) to plug-and-play large-
scale pre-trained models for domain transfer in dialogue task
completion. Zang et al. [13] proposed the LABES model,
which treated the dialogue states as discrete latent variables to
reduce the reliance on turn-level DST labels. Kulh´
anek et al.
[22] proposed AuGPT with modified training objectives for
language model fine-tuning and data augmentation via back-
translation [23] to increase the diversity of the training data.
Existing studies have also leveraged knowledge bases to track
pivotal and critical information required in generating TOD
system agent’s responses [24]–[27]. For instance, Madotto
et al. [28] dynamically updated a knowledge base via fine-
tuning by directly embedding it into the model parameters.
Other studies have also explored reinforcement learning to
build TOD systems [29]–[31]. For instance, Zhao et al. [29]
utilized a Deep Recurrent Q-Networks (DRQN) for building
TOD systems.
2) Modeling Politeness in Dialogue Systems: Recent stud-
ies have also attempted to improve dialogue systems to gen-
erate responses in a more empathetic manner [32]–[35]. Yu et
al. [33] proposed to include user sentiment obtained through
multimodal information (acoustic, dialogic, and textual) in
the end-to-end learning framework to make TOD systems
more user-adaptive and effective. Feng et al. [34] constructed
a corpus containing task-oriented dialogues with emotion
labels for emotion recognition in TOD systems. However, the
impoliteness of users is not modeled as too few instances
exist in the MultiWOZ dataset. The lack of impolite dialogue
data motivates us to construct an impolite dialogue corpus to
facilitate downstream analysis.
Politeness is a human virtue and a crucial aspect of com-
munication [36], [37]. Danescu-Niculescu-Mizil et al. [37]
proposed a computational framework to identify the linguistic
aspects of politeness with application to social factors. Re-
searchers have also attempted to model and include politeness
in TOD systems [38]–[40]. For instance, Golchha et al. [38]
utilized a reinforced pointer generator network to transform
a generic response into a polite response. More recently,
Madaan et al. [40] adopted a text style transfer approach
to generate polite sentences while preserving the intended
content. Nevertheless, most of these studies have focused
on generating polite responses, neglecting the handling of
impolite inputs, i.e., impolite user utterances. This study aims
to fill this research gap by extensively evaluating state-of-the-
art TOD systems’ ability to handle impolite dialogues.
3) Data Augmentation in Dialogue Systems: Data augmen-
tation, which aims to enlarge training data size in machine
learning systems, is a common solution to the data scarcity
problem. Data augmentation methods has also been widely
used in dialogue systems [22], [41], [42]. For instance, Kurata
et al. [43] trained an encoder-decoder to reconstruct the utter-
ances in training data. To augment training data, the encoder’s
output hidden states are perturbed randomly to yield different
utterances. Hou et al. [41] proposed a sequence-to-sequence
generation-based data augmentation framework that models
relations between utterances of the same semantic frame in
the training data. Gritta et al. [42] proposed the Conversation
Graph (ConvGraph), which is a graph-based representation of
dialogues, to augment data volume and diversity by generating
dialogue paths. In this paper, we propose a simple data aug-
mentation method that utilizes text style transfer techniques to
generate impolite user utterances for training data to improve
TOD systems’ performance in dealing with impolite users.