BOTSTALK Machine-sourced Framework for Automatic Curation of Large-scale Multi-skill Dialogue Datasets Minju Kim1Chaehyeong Kim1Yongho Song1Seung-won Hwang2Jinyoung Yeo1y

2025-04-29 0 0 1.33MB 22 页 10玖币
侵权投诉
BOTSTALK: Machine-sourced Framework for Automatic Curation of
Large-scale Multi-skill Dialogue Datasets
Minju Kim1Chaehyeong Kim1Yongho Song1Seung-won Hwang2Jinyoung Yeo1
1Department of Artificial Intelligence, Yonsei University
2Department of Computer Science and Engineering, Seoul National University
{minnju,cheris8,kopf_yhs,jinyeo}@yonsei.ac.kr seungwonh@snu.ac.kr
Abstract
To build open-domain chatbots that are able
to use diverse communicative skills, we pro-
pose a novel framework BOTSTALK, where
multiple agents grounded to the specific target
skills participate in a conversation to automat-
ically annotate multi-skill dialogues. We fur-
ther present Blended Skill BotsTalk (BSBT),
a large-scale multi-skill dialogue dataset com-
prising 300K conversations. Through exten-
sive experiments, we demonstrate that our
dataset can be effective for multi-skill dia-
logue systems which require an understanding
of skill blending as well as skill grounding.
Our code and data are available at https://
github.com/convei-lab/BotsTalk.
1 Introduction
A considerable progress has been made towards
open-domain chatbots with different desirable qual-
ities in conversations. Each of these models is capa-
ble of being specialized in one communicative skill,
i.e., skill grounding. A number of distinct large-
scale datasets targeting a specific conversational
skill have recently become available. ConvAI2 (Di-
nan et al.,2020b) is provided for research work that
aims to endow chatbots with personas (Majumder
et al.,2020a;Kim et al.,2020b), enabling chat-
bots to talk about themselves. Wizard of Wikipedia
(WoW) (Dinan et al.,2019) is a popular option for
recent studies (Lian et al.,2019;Zhao et al.,2020;
Kim et al.,2020a) that focus on knowledgeable
conversational agents discussing topics in depth.
Empathetic Dialogues (ED) (Rashkin et al.,2019)
is also commonly used to embody empathy in dia-
logue systems (Santhanam and Shaikh,2019;Ma-
jumder et al.,2020b). Most of such skill-grounded
datasets are designed to improve a single skill, and
thus effective when models are asked to demon-
strate the targeted conversational skill.
Equal contribution
Corresponding author
Benefiting from the advances of these conversa-
tional agents, recent research focuses on another as-
pect of open-domain chatbots: the ability to blend
various conversational skills into one cohesive flow
in a seamless manner, i.e., skill blending. A good
open-domain chatbot should be able to weave mul-
tiple behaviors and skills in a single conversation,
so that it enables to deal with different users and sit-
uations appropriately (Shuster et al.,2020;Roller
et al.,2021). Towards this goal, there is a need to
construct a multi-skill dialogue dataset, which con-
sists of multi-turn dialogues that exhibit multiple
skills. While Smith et al. (2020) propose a crowd-
sourced dataset Blended Skill Talk (BST) of 5K
conversations as a reliable benchmark for measur-
ing dialogue systems’ ability at the blended objec-
tive, it is not sufficient to build a multi-skill chatbot
due to its limited scale. Scaling up crowdsourcing
is not feasible, as it requires labor intensive man-
ual annotation and verification. Instead, automatic
curation shows promising results on large-scale
dialogue generation (Mohapatra et al.,2021).
In this paper, we aim to generate a large-scale
multi-skill dialogue dataset without additional costs
or human efforts. To this end, we introduce an auto-
matic data curation approach named
BOTSTALK
,
where multiple dialogue agents grounded to indi-
vidual skills engage in the conversation to blend all
skills together. Based on this framework, we cre-
ate
Blended Skill BotsTalk (BSBT)
, a large-scale
multi-skill dialogue dataset of 300K conversations
blended and grounded with a number of skills de-
rived from ConvAI2, WoW, and ED. Our experi-
ments demonstrate that by using our dataset dia-
logue models successfully yield large performance
gains in skill blending while maintaining competi-
tive performance in skill grounding. Furthermore,
we validate the quality of BS
B
T dataset by human
evaluation, showing our machine-sourced conversa-
tions are consistently preferred over crowdsourced
ones from BST by human judges across all metrics.
arXiv:2210.12687v1 [cs.CL] 23 Oct 2022
Dataset Dialogue episode
ConvAI2
Skill context for speaker A: I like to ski; I hate Mexican food; I like to eat cheetos; ...
Skill context for speaker B: I am an artist; I have four children; I enjoy walking for exercise; ...
Dialogue context
A: How old are your children?
B: I have four that range in age from 10 to 21. You?
Wizard of Wikipedia
Skill context for speaker A: Armadillo
Skill context for speaker B: Armadillo are ... "armadillo" means "little armoured one" in ...
Dialogue context
A: I don’t think I’ve ever seen an armadillo in real life!
B: I’ve seen them at the zoo. Armadillo means little armored one in Spanish.
Empathetic Dialogues
Skill context for speaker A: My brother jump scared me while I was out playing; Terrified
Skill context for speaker B: None
Dialogue context
A: Just got scared to death.
B: Oh no. What happened?
Table 1: Example dialogues of three single-skill datasets: ConvAI2 provides each speaker persona sentences as
skill context; Wizard of Wikipedia provides a topic and knowledge resources as skill context; Empathetic Dialogues
provides a situation description and emotion as skill context. We only provide two turns of dialogue contexts due
to the limit on the paper length.
2 Related Work
2.1 Skill-grounded Dialogue Datasets
Past research in open-domain chatbots has made
solid strides towards dialogue systems with desir-
able general qualities in a conversation. Generating
responses grounded to specific conversational skill
has been explored in different axes, as shown in Ta-
ble 1(see also Appendix Bfor details). Dinan et al.
(2020b) introduce ConvAI2 dataset which consists
of more than 140K utterances of crowdsourced con-
versations to make chit-chat models more engaging
and personalized by conditioning the models on
profile information. Wizard of Wikipedia (Dinan
et al.,2019) task aims to explore conversation in-
formed by expert knowledge from Wikipedia and
provides about 194K utterances of conversations
on about 1,250 topics. Rashkin et al. (2019) con-
struct a dataset, Empathetic Dialogues, compris-
ing 50K utterances of crowdworker conversations
grounded in an emotional situation for a model to
converse with empathy. However, it remains un-
clear whether models optimized for performance
along specific conversational skill can retain the
learned skill while blending it with other skills.
Hence, Smith et al. (2020) aim to build a con-
versational agent who seamlessly blends being per-
sonable, knowledgeable, and empathetic. In order
to gauge how successful a model is at this blended
objective, Smith et al. (2020) collect a new multi-
skill dialogue dataset of about 5K conversations,
Blended Skill Talk, via crowdsourcing. While this
work provides a testbed for future studies, the scale
of data could hinder further progress, since train-
ing multi-skill chatbots generally requires a large-
scale dataset consisting of conversations that in-
volve multiple skills (Shah et al.,2018).
2.2 Automatic Dialogue Data Annotation
Research in dialogue systems has been consistently
supported by the development of new dialogue
datasets (Williams et al.,2014;Mrkši´
c et al.,2017).
One popular approach is to collect and annotate
dialogues via crowdsourcing (Zhang et al.,2018;
Smith et al.,2020). However, generating multi-
turn dialogues in this manner requires expensive
and exhausting human efforts (Shah et al.,2018;
Sun et al.,2021;Mohapatra et al.,2021).
Therefore, recent study seeks to facilitate open-
domain chatbot development with new datasets au-
tomatically constructed by using existing datasets.
For instance, Lee et al. (2021) create a 45K multi-
modal dialogue dataset by replacing parts of source
dialogues from existing text-only dialogue datasets
with their semantically relevant images. Sun et al.
(2021) propose a Human
AI collaborative data
collection approach for generating diverse chit-chat
response to augment task-oriented dialogues and
present new chit-chat based annotations to 23.8K
dialogues from two popular task-oriented datasets.
Kim et al. (2021b) and Vidgen et al. (2021) present
a model-based dialogue collection framework and
a human-and-model-in-the-loop process for gener-
ating datasets respectively.
3 Problem Formulation
In this section, we formulate the problem of multi-
skill dialogue annotation and desirable characteris-
tics for the dialogue dataset as a training resource.
3.1 Multi-skill Dialogue Annotation
Our goal is to collect a new large-scale multi-skill
dialogue dataset, which seamlessly blends various
skills over the course of a multi-turn conversation.
Here, inspired by Smith et al. (2020), the inputs of
this task are single-skill datasets, which are sepa-
rately collected on a variety of skills. Let
M
be the
set of
M
skill types, e.g.,
M={P,K,E}
, where P,
K, E denote personality, knowledge, and empathy
derived from ConvAI2, WoW, and ED, respectively.
Formally, we refer to
Dm
as a dialogue dataset with
Nmdialogue episodes for skill mM
Dm={(stxi,m, dtxi,t )}Nm
i=1 (1)
where
stxi,m
is a skill-relevant description (i.e.,
skill context) for skill
m
and
dtxi,t
is
t
dialogue
turns (i.e., dialogue context) derived from the skill
context, as shown in Table 1. Based on the input
datasets
D1, ..., DM
, we aim to obtain a new dia-
logue dataset
˜
D
for
M
skills as an output. Formally,
˜
D={(˜
stxi, dtxi,t)}
i=1 (2)
where
˜
stxi
is a set of skill contexts for
M
and
dtxi,t
is the dialogue context derived from the multiple
skills. We will omit the index
i
when dealing with
a single dialogue episode.
3.2 Desirable Characteristics of Multi-skill
Dialogue Datasets
By the above annotation, we aim to build a multi-
skill chatbot that uses all target skills appropriately
in a conversation. For that, we lay out two criteria
that a multi-skill dialogue dataset should meet as a
training resource, namely
skill blending
and
skill
grounding
. Skill blending indicates that a multi-
skill dialogue dataset should enable dialogue mod-
els to exhibit different dialogue skills in a conversa-
tion (Smith et al.,2020), while skill grounding em-
phasizes that dialogue models should learn to main-
tain each dialogue skill when appropriate (Shazeer
et al.,2017). Generally, they have a trade-off rela-
tionship as it is insufficient to represent both skill
blending and grounding in a conversation of finite
length (Madotto et al.,2021). Nevertheless, we
note that skill blending and grounding are not con-
tradictory, as some skill-grounded utterances leave
room for natural shift between skills. Given an ut-
terance “I like sneakers because it is comfortable.
which represents skill type P, it seems reasonable
to annotate an utterance with skill type K “It is be-
cause sneakers were primarily designed for sports.
for next dialogue turn. This example further im-
plies that different skills can be blended naturally
so that the chatbots learn to provide reasonable
responses in a multi-skill dialogue (Roller et al.,
2020).
4 BOTSTALK Framework
We now present BOTSTALK, a novel framework
that automatically annotates multi-skill dialogues
based on multiple single-skill dialogue datasets.
The focus of our framework is to mimic a natu-
ral conversation by featuring both skill blending
and grounding within a dialogue episode. Figure 1
illustrates three main phases of the framework. Im-
plementation details are provided in Appendix C.
4.1 Participants in BOTSTALK
In our framework, multiple participants engage in a
conversation to iteratively generate desirable multi-
skill dialogues.
Skill Agents
The first participants are mul-
tiple single-skill agents who annotate the appro-
priate skill-grounded utterances to the dialogue.
Formally, based on
Dm
for skill
m
, when given
skill context
stxm
, dialogue context
dtxt
, and re-
sponse space
U
, a skill agent has dialogue models
f: (stxm, dtxt)7→ Uwhich return a response
resm,t =f(stxm, dtxt;θm)(3)
where θmis the parameters learned for skill m.
We design two main functions of the skill agent,
generator model and ranker model, parameterized
as
θm
gen
and
θm
rnk
for skill
m
, respectively. For
θgen
, we aim to generate responses from response
space
U
in a token-by-token manner, and thus
employ a dodecaDialogue (Shuster et al.,2020)
model, a modification of a transformer Seq2Seq
architecture. On the other hand, for
θrnk
, we con-
sider the response space
U
as a list of alterna-
tives to pick the correct response, and thus employ
a poly-encoder (Humeau et al.,2020) model, a
transformer-based retrieval architecture, to score
and rank response candidates. Both
θgen
and
θrnk
are fine-tuned on individual single-skill datasets1.
1
On the average, generator and ranker models show around
10 perplexity and 90 accuracy on their respective datasets.
I had some trouble
yesterday because my
sandals were torn.
I own a boat;
I only wear tennis
shoes.
Sneakers; Sneakers
are shoes primarily
designed for sports
or other forms of …
My everyday wear
sandals were torn
yesterday;
Embarrassed
Oh really? I like tennis
shoes more than
sneakers.
Me too! I definitely use
mine everyday wear!
It is because sneakers
were primarily
designed for sports.
Rank 2
I love sneakers and think they are
the most comfortable shoes around.
Is Rank1 utterance natural with previous utterance?
Previous utterance
Oh really? I like tennis shoes
more than sneakers.
It is because sneakers were
primarily designed for sports.
If
If
Rank 1
Rank 3
Figure 1: Illustration of BOTSTALK framework. Green, blue, and purple indicate skill types of P, K, E, respectively.
While all skill agents simulate what response to
annotate, only one skill agent is given priority over
other skill agents, to “speak” the response per dia-
logue turn for the dialogue annotation, conditioned
on a set of skill contexts
˜
stx
and the dialogue con-
text
dtxt
. We call this active agent. This priority
may be passed to another skill agent such that the
current active agent is deactivated, and another skill
agent will be newly activated to speak.
Moderator Agent
A critical constraint for skill
agents is that neither the generator nor the ranker
for skill
m
is able to read other skill contexts in
˜
stx
for different skills. For a skill agent, considering
all possible skill contexts in multi-skill dialogues is
non-trivial. Instead, as an omniscient oracle for all
skill contexts
˜
stx
, we aim to develop another par-
ticipant named moderator agent, which mediates
the conversational flow for desirable multi-skill di-
alogue annotation. To examine the relevance of
the response
rest
with all skill contexts
˜
stx
or the
dialogue context
dtxt
, the moderator agent has a
decision function
g: ( ˜
stx, dtxt, rest)7→ A
where
A
is an action space (i.e., approval or refusal) for
the given response.
4.2 Phase 1: Simulate what to speak
We integrate different dialogue setups from multi-
ple single-skill datasets as seed information to start
a conversation (detailed in Appendix C.3). For a di-
alogue episode, dialogue context is initialized as an
utterance pair (i.e., two dialogue turns) randomly
sampled from a single-skill dataset
Dm
, and the
skill agent for skill
m
becomes the initial active
agent. Then, for a generalizable dialogue setup,
we retrieve the most relevant skill contexts from
each of all input datasets
D1, ..., DM
for the seed
dialogue context with TF-IDF (Chen et al.,2017)
2
.
2
While we use a simple IR baseline as lower bound since
it is not our main focus, one can easily try different IR system.
In the first phase of BOTSTALK, all skill agents
simulate their own responses for the next dialogue
turn. Formally, given a skill context
stxm
and the
current dialogue context
dtxt
in a dialogue episode,
a skill agent for skill
m
generates a plausible re-
sponse resm,t as
resm,t = argmax
restU
P(rest|stxm, dtxt;θm
gen )·g(˜
stx, rest)
(4)
where
g(·)
is the function of the moderator agent,
which we discuss in the subsequent section.
Depending on individual skills, every skill agent
returns its skill-relevant response. For example, as
shown in Figure 1, when “I love sneakers and think
they are the most comfortable shoes around.” is
given as
dtx
, the skill agent for skill P generates a
personal response “Oh really? I like tennis shoes
more than sneakers.” as
resP
based on a given per-
sona. Meanwhile, the skill agents for skill K and E
generate a knowledgeable response “It is because
sneakers were primarily designed for sports.” as
resK
and a empathetic response “Me too! I defi-
nitely use mine everyday wear! as resE.
4.3 Phase 2: Check dialogue consistency
It is well known that neural dialogue systems lack
consistency (Li et al.,2016;Welleck et al.,2019).
Furthermore, as a skill agent uses the specific skill
context
stxm
instead of
˜
stx
for response genera-
tion, the response is more likely to be semantically
in conflict with other skill contexts in
˜
stx
. Suppose
a
stxP
is “I wear sneakers everyday” and a
resE
is
I had some trouble yesterday because my sandals
were torn”. This response is inappropriate because
yesterday because my sandals were torn” is contra-
dictory to “I wear sneakers everyday”. Therefore,
the moderator agent, who has access to all skill con-
texts
˜
stx
, filters out conflicting response candidates
to preserve dialogue consistency.
Specifically, the moderator agent leverages natu-
ral language inference (NLI), a task of determining
whether a hypothesis sentence can be inferred from
the given premise sentence. The hypothesis sen-
tence is classified into three categories: ENTAIL
(true), NEUTRAL (undetermined), and CONTRA-
DICT (false). Based on the NLI classifier, the deci-
sion function of the moderator agent is defined as
g(˜
stx, rest) = (1,NLI( ˜
stx, rest)6→ CONTRADICT
0,otherwise
(5)
which represents approval/refusal of
rest
condi-
tioned on
˜
stx
. A skill agent for skill
m
repeat-
edly generates new response candidates until its
response is approved, as described in Equation 4.
For NLI classifier, we use a RoBERTa (Liu et al.,
2019) model trained on MNLI (Williams et al.,
2018)
3
, which is widely used in fact checking sys-
tems (Kim et al.,2021a)
4
. Overall, about 50%
of utterances are classified as CONTRADICT by
NLI classifier. Out of all utterances classified as
CONTRADICT, about 70% are in conflict with other
types of skill contexts (Figure 2). The result demon-
strates that skill agents indeed generate inconsistent
responses due to the restricted access to other skill
contexts. We also find that the overall proportion of
utterances conflicting with
stxP
is relatively high,
apparently because
stxP
contains more distinct de-
scriptions than stxKand stxE.
4.4 Phase 3: Speak or pass the mic
The objective of the last phase is to score a set of re-
sponse candidates and select a final response when
given the skill contexts and dialogue context. To
this end, we leverage the active agent and the mod-
erator agent, taking into account a balance between
skill blending and skill grounding.
Let
Ures
be the set of response candidates
res1,t, ..., resM,t
from all skill agents. The active
skill agent identifies the most appropriate response
res
t
in
Ures
based on its ranker model
θm
rnk
, then
asks the moderator agent to attach the selected re-
sponse into the next dialogue context
dtxt+1
for
annotation. Formally, we define such process as
res
t= argmax
restUres
P(rest|stxm, dtxt;θm
rnk )·g(dtxt, rest)
(6)
where
g(·)
is the function of the moderator agent.
To compute
g(dtxt, rest)
, the moderator agent
3
Dialogue NLI (Welleck et al.,2019) is biased to ConvAI2.
4The RoBERTa model shows 90.59 accuracy on MNLI.
Figure 2: Percentages of utterances which are classified
as CONTRADICT via NLI classifier, broken down by
the type of skill contexts.
Approval Refusal
(a) KLD distribution (b) Entropy distribution
Figure 3: KL divergence between skill distributions of
consecutive utterances (left) and entropy of skill distri-
butions for all utterances (right).
adopts a skill classifier
P
that identifies correspond-
ing skill for the response. We use a BERT (Devlin
et al.,2019) model trained on utterances in
Dm
and
their corresponding skill labels
m
for all skill types
M5
. Once
P
is learned, the decision function of
the moderator agent is defined as
g(dtxt, rest) = (1,KL(P(res
t1)||P(rest)) < α
0,otherwise
(7)
where
res
t1
is the last utterance of
dtxt
and
P(·)RM
outputs the skill distribution of the
response. Based on KL divergence between two
distributions,
g(dtxt, rest)
is discretized as the ap-
proval/refusal decision by a pre-defined threshold
α
(Figure 3a). Once the moderator agent accepts
the candidate
rest
from an inactive agent as the
final response, the active agent passes the mic, or
the priority for annotation, to the inactive agent.
In practice, we compute entropy of the skill dis-
tributions of all utterances to investigate whether
there is room for shifting between skills. The value
of entropy indicates the uncertainty of the skill type
of an utterance: utterances with high entropy values
are uncertain, generic responses. Figure 3b shows
that the number of generic utterances is far from
negligible, suggesting that there are opportunities
to shift to other skills and thus both skill blending
and grounding can be satisfied in a conversation.
5
The BERT model shows 81.95 accuracy at inference time.
摘要:

BOTSTALK:Machine-sourcedFrameworkforAutomaticCurationofLarge-scaleMulti-skillDialogueDatasetsMinjuKim1ChaehyeongKim1YonghoSong1Seung-wonHwang2JinyoungYeo1y1DepartmentofArticialIntelligence,YonseiUniversity2DepartmentofComputerScienceandEngineering,SeoulNationalUniversity{minnju,cheris8,kopf_yhs,...

展开>> 收起<<
BOTSTALK Machine-sourced Framework for Automatic Curation of Large-scale Multi-skill Dialogue Datasets Minju Kim1Chaehyeong Kim1Yongho Song1Seung-won Hwang2Jinyoung Yeo1y.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:1.33MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注