BOTSTALK Machine-sourced Framework for Automatic Curation of Large-scale Multi-skill Dialogue Datasets Minju Kim1Chaehyeong Kim1Yongho Song1Seung-won Hwang2Jinyoung Yeo1y

2025-04-29 0 0 1.33MB 22 页 10玖币

侵权投诉

BOTSTALK: Machine-sourced Framework for Automatic Curation of

Large-scale Multi-skill Dialogue Datasets

Minju Kim1∗Chaehyeong Kim1∗Yongho Song1∗Seung-won Hwang2Jinyoung Yeo1†

1Department of Artiﬁcial Intelligence, Yonsei University

2Department of Computer Science and Engineering, Seoul National University

{minnju,cheris8,kopf_yhs,jinyeo}@yonsei.ac.kr seungwonh@snu.ac.kr

Abstract

To build open-domain chatbots that are able

to use diverse communicative skills, we pro-

pose a novel framework BOTSTALK, where

multiple agents grounded to the speciﬁc target

skills participate in a conversation to automat-

ically annotate multi-skill dialogues. We fur-

ther present Blended Skill BotsTalk (BSBT),

a large-scale multi-skill dialogue dataset com-

prising 300K conversations. Through exten-

sive experiments, we demonstrate that our

dataset can be effective for multi-skill dia-

logue systems which require an understanding

of skill blending as well as skill grounding.

Our code and data are available at https://

github.com/convei-lab/BotsTalk.

1 Introduction

A considerable progress has been made towards

open-domain chatbots with different desirable qual-

ities in conversations. Each of these models is capa-

ble of being specialized in one communicative skill,

i.e., skill grounding. A number of distinct large-

scale datasets targeting a speciﬁc conversational

skill have recently become available. ConvAI2 (Di-

nan et al.,2020b) is provided for research work that

aims to endow chatbots with personas (Majumder

et al.,2020a;Kim et al.,2020b), enabling chat-

bots to talk about themselves. Wizard of Wikipedia

(WoW) (Dinan et al.,2019) is a popular option for

recent studies (Lian et al.,2019;Zhao et al.,2020;

Kim et al.,2020a) that focus on knowledgeable

conversational agents discussing topics in depth.

Empathetic Dialogues (ED) (Rashkin et al.,2019)

is also commonly used to embody empathy in dia-

logue systems (Santhanam and Shaikh,2019;Ma-

jumder et al.,2020b). Most of such skill-grounded

datasets are designed to improve a single skill, and

thus effective when models are asked to demon-

strate the targeted conversational skill.

∗∗Equal contribution

††Corresponding author

Beneﬁting from the advances of these conversa-

tional agents, recent research focuses on another as-

pect of open-domain chatbots: the ability to blend

various conversational skills into one cohesive ﬂow

in a seamless manner, i.e., skill blending. A good

open-domain chatbot should be able to weave mul-

tiple behaviors and skills in a single conversation,

so that it enables to deal with different users and sit-

uations appropriately (Shuster et al.,2020;Roller

et al.,2021). Towards this goal, there is a need to

construct a multi-skill dialogue dataset, which con-

sists of multi-turn dialogues that exhibit multiple

skills. While Smith et al. (2020) propose a crowd-

sourced dataset Blended Skill Talk (BST) of 5K

conversations as a reliable benchmark for measur-

ing dialogue systems’ ability at the blended objec-

tive, it is not sufﬁcient to build a multi-skill chatbot

due to its limited scale. Scaling up crowdsourcing

is not feasible, as it requires labor intensive man-

ual annotation and veriﬁcation. Instead, automatic

curation shows promising results on large-scale

dialogue generation (Mohapatra et al.,2021).

In this paper, we aim to generate a large-scale

multi-skill dialogue dataset without additional costs

or human efforts. To this end, we introduce an auto-

matic data curation approach named

BOTSTALK

where multiple dialogue agents grounded to indi-

vidual skills engage in the conversation to blend all

skills together. Based on this framework, we cre-

ate

Blended Skill BotsTalk (BSBT)

, a large-scale

multi-skill dialogue dataset of 300K conversations

blended and grounded with a number of skills de-

rived from ConvAI2, WoW, and ED. Our experi-

ments demonstrate that by using our dataset dia-

logue models successfully yield large performance

gains in skill blending while maintaining competi-

tive performance in skill grounding. Furthermore,

we validate the quality of BS

T dataset by human

evaluation, showing our machine-sourced conversa-

tions are consistently preferred over crowdsourced

ones from BST by human judges across all metrics.

arXiv:2210.12687v1 [cs.CL] 23 Oct 2022

Dataset Dialogue episode

ConvAI2

Skill context for speaker A: I like to ski; I hate Mexican food; I like to eat cheetos; ...

Skill context for speaker B: I am an artist; I have four children; I enjoy walking for exercise; ...

Dialogue context

A: How old are your children?

B: I have four that range in age from 10 to 21. You?

Wizard of Wikipedia

Skill context for speaker A: Armadillo

Skill context for speaker B: Armadillo are ... "armadillo" means "little armoured one" in ...

Dialogue context

A: I don’t think I’ve ever seen an armadillo in real life!

B: I’ve seen them at the zoo. Armadillo means little armored one in Spanish.

Empathetic Dialogues

Skill context for speaker A: My brother jump scared me while I was out playing; Terriﬁed

Skill context for speaker B: None

Dialogue context

A: Just got scared to death.

B: Oh no. What happened?

Table 1: Example dialogues of three single-skill datasets: ConvAI2 provides each speaker persona sentences as

skill context; Wizard of Wikipedia provides a topic and knowledge resources as skill context; Empathetic Dialogues

provides a situation description and emotion as skill context. We only provide two turns of dialogue contexts due

to the limit on the paper length.

2 Related Work

2.1 Skill-grounded Dialogue Datasets

Past research in open-domain chatbots has made

solid strides towards dialogue systems with desir-

able general qualities in a conversation. Generating

responses grounded to speciﬁc conversational skill

has been explored in different axes, as shown in Ta-

ble 1(see also Appendix Bfor details). Dinan et al.

(2020b) introduce ConvAI2 dataset which consists

of more than 140K utterances of crowdsourced con-

versations to make chit-chat models more engaging

and personalized by conditioning the models on

proﬁle information. Wizard of Wikipedia (Dinan

et al.,2019) task aims to explore conversation in-

formed by expert knowledge from Wikipedia and

provides about 194K utterances of conversations

on about 1,250 topics. Rashkin et al. (2019) con-

struct a dataset, Empathetic Dialogues, compris-

ing 50K utterances of crowdworker conversations

grounded in an emotional situation for a model to

converse with empathy. However, it remains un-

clear whether models optimized for performance

along speciﬁc conversational skill can retain the

learned skill while blending it with other skills.

Hence, Smith et al. (2020) aim to build a con-

versational agent who seamlessly blends being per-

sonable, knowledgeable, and empathetic. In order

to gauge how successful a model is at this blended

objective, Smith et al. (2020) collect a new multi-

skill dialogue dataset of about 5K conversations,

Blended Skill Talk, via crowdsourcing. While this

work provides a testbed for future studies, the scale

of data could hinder further progress, since train-

ing multi-skill chatbots generally requires a large-

scale dataset consisting of conversations that in-

volve multiple skills (Shah et al.,2018).

2.2 Automatic Dialogue Data Annotation

Research in dialogue systems has been consistently

supported by the development of new dialogue

datasets (Williams et al.,2014;Mrkši´

c et al.,2017).

One popular approach is to collect and annotate

dialogues via crowdsourcing (Zhang et al.,2018;

Smith et al.,2020). However, generating multi-

turn dialogues in this manner requires expensive

and exhausting human efforts (Shah et al.,2018;

Sun et al.,2021;Mohapatra et al.,2021).

Therefore, recent study seeks to facilitate open-

domain chatbot development with new datasets au-

tomatically constructed by using existing datasets.

For instance, Lee et al. (2021) create a 45K multi-

modal dialogue dataset by replacing parts of source

dialogues from existing text-only dialogue datasets

with their semantically relevant images. Sun et al.

(2021) propose a Human

↔

AI collaborative data

collection approach for generating diverse chit-chat

response to augment task-oriented dialogues and

present new chit-chat based annotations to 23.8K

dialogues from two popular task-oriented datasets.

Kim et al. (2021b) and Vidgen et al. (2021) present

a model-based dialogue collection framework and

a human-and-model-in-the-loop process for gener-

ating datasets respectively.

3 Problem Formulation

In this section, we formulate the problem of multi-

skill dialogue annotation and desirable characteris-

tics for the dialogue dataset as a training resource.

3.1 Multi-skill Dialogue Annotation

Our goal is to collect a new large-scale multi-skill

dialogue dataset, which seamlessly blends various

skills over the course of a multi-turn conversation.

Here, inspired by Smith et al. (2020), the inputs of

this task are single-skill datasets, which are sepa-

rately collected on a variety of skills. Let

be the

set of

skill types, e.g.,

M={P,K,E}

, where P,

K, E denote personality, knowledge, and empathy

derived from ConvAI2, WoW, and ED, respectively.

Formally, we refer to

as a dialogue dataset with

Nmdialogue episodes for skill m∈M

Dm={(stxi,m, dtxi,t )}Nm

i=1 (1)

where

stxi,m

is a skill-relevant description (i.e.,

skill context) for skill

and

dtxi,t

dialogue

turns (i.e., dialogue context) derived from the skill

context, as shown in Table 1. Based on the input

datasets

D1, ..., DM

, we aim to obtain a new dia-

logue dataset

for

skills as an output. Formally,

D={(˜

stxi, dtxi,t)}∞

i=1 (2)

where

stxi

is a set of skill contexts for

and

dtxi,t

is the dialogue context derived from the multiple

skills. We will omit the index

when dealing with

a single dialogue episode.

3.2 Desirable Characteristics of Multi-skill

Dialogue Datasets

By the above annotation, we aim to build a multi-

skill chatbot that uses all target skills appropriately

in a conversation. For that, we lay out two criteria

that a multi-skill dialogue dataset should meet as a

training resource, namely

skill blending

and

skill

grounding

. Skill blending indicates that a multi-

skill dialogue dataset should enable dialogue mod-

els to exhibit different dialogue skills in a conversa-

tion (Smith et al.,2020), while skill grounding em-

phasizes that dialogue models should learn to main-

tain each dialogue skill when appropriate (Shazeer

et al.,2017). Generally, they have a trade-off rela-

tionship as it is insufﬁcient to represent both skill

blending and grounding in a conversation of ﬁnite

length (Madotto et al.,2021). Nevertheless, we

note that skill blending and grounding are not con-

tradictory, as some skill-grounded utterances leave

room for natural shift between skills. Given an ut-

terance “I like sneakers because it is comfortable.”

which represents skill type P, it seems reasonable

to annotate an utterance with skill type K “It is be-

cause sneakers were primarily designed for sports.”

for next dialogue turn. This example further im-

plies that different skills can be blended naturally

so that the chatbots learn to provide reasonable

responses in a multi-skill dialogue (Roller et al.,

2020).

4 BOTSTALK Framework

We now present BOTSTALK, a novel framework

that automatically annotates multi-skill dialogues

based on multiple single-skill dialogue datasets.

The focus of our framework is to mimic a natu-

ral conversation by featuring both skill blending

and grounding within a dialogue episode. Figure 1

illustrates three main phases of the framework. Im-

plementation details are provided in Appendix C.

4.1 Participants in BOTSTALK

In our framework, multiple participants engage in a

conversation to iteratively generate desirable multi-

skill dialogues.

Skill Agents

The ﬁrst participants are mul-

tiple single-skill agents who annotate the appro-

priate skill-grounded utterances to the dialogue.

Formally, based on

for skill

, when given

skill context

stxm

, dialogue context

dtxt

, and re-

sponse space

, a skill agent has dialogue models

f: (stxm, dtxt)7→ Uwhich return a response

resm,t =f(stxm, dtxt;θm)(3)

where θmis the parameters learned for skill m.

We design two main functions of the skill agent,

generator model and ranker model, parameterized

θm

gen

and

θm

rnk

for skill

, respectively. For

θgen

, we aim to generate responses from response

space

in a token-by-token manner, and thus

employ a dodecaDialogue (Shuster et al.,2020)

model, a modiﬁcation of a transformer Seq2Seq

architecture. On the other hand, for

θrnk

, we con-

sider the response space

as a list of alterna-

tives to pick the correct response, and thus employ

a poly-encoder (Humeau et al.,2020) model, a

transformer-based retrieval architecture, to score

and rank response candidates. Both

θgen

and

θrnk

are ﬁne-tuned on individual single-skill datasets1.

On the average, generator and ranker models show around

10 perplexity and 90 accuracy on their respective datasets.

I had some trouble

yesterday because my

sandals were torn.

I own a boat;

I only wear tennis

shoes.

Sneakers; Sneakers

are shoes primarily

designed for sports

or other forms of …

My everyday wear

sandals were torn

yesterday;

Embarrassed

Oh really? I like tennis

shoes more than

sneakers.

Me too! I definitely use

mine everyday wear!

It is because sneakers

were primarily

designed for sports.

Rank 2

I love sneakers and think they are

the most comfortable shoes around.

Is Rank1 utterance natural with previous utterance?

Previous utterance

Oh really? I like tennis shoes

more than sneakers.

It is because sneakers were

primarily designed for sports.

Rank 1

Rank 3

Figure 1: Illustration of BOTSTALK framework. Green, blue, and purple indicate skill types of P, K, E, respectively.

While all skill agents simulate what response to

annotate, only one skill agent is given priority over

other skill agents, to “speak” the response per dia-

logue turn for the dialogue annotation, conditioned

on a set of skill contexts

stx

and the dialogue con-

text

dtxt

. We call this active agent. This priority

may be passed to another skill agent such that the

current active agent is deactivated, and another skill

agent will be newly activated to speak.

Moderator Agent

A critical constraint for skill

agents is that neither the generator nor the ranker

for skill

is able to read other skill contexts in

stx

for different skills. For a skill agent, considering

all possible skill contexts in multi-skill dialogues is

non-trivial. Instead, as an omniscient oracle for all

skill contexts

stx

, we aim to develop another par-

ticipant named moderator agent, which mediates

the conversational ﬂow for desirable multi-skill di-

alogue annotation. To examine the relevance of

the response

rest

with all skill contexts

stx

or the

dialogue context

dtxt

, the moderator agent has a

decision function

g: ( ˜

stx, dtxt, rest)7→ A

where

is an action space (i.e., approval or refusal) for

the given response.

4.2 Phase 1: Simulate what to speak

We integrate different dialogue setups from multi-

ple single-skill datasets as seed information to start

a conversation (detailed in Appendix C.3). For a di-

alogue episode, dialogue context is initialized as an

utterance pair (i.e., two dialogue turns) randomly

sampled from a single-skill dataset

, and the

skill agent for skill

becomes the initial active

agent. Then, for a generalizable dialogue setup,

we retrieve the most relevant skill contexts from

each of all input datasets

D1, ..., DM

for the seed

dialogue context with TF-IDF (Chen et al.,2017)

While we use a simple IR baseline as lower bound since

it is not our main focus, one can easily try different IR system.

In the ﬁrst phase of BOTSTALK, all skill agents

simulate their own responses for the next dialogue

turn. Formally, given a skill context

stxm

and the

current dialogue context

dtxt

in a dialogue episode,

a skill agent for skill

generates a plausible re-

sponse resm,t as

resm,t = argmax

rest∈U

P(rest|stxm, dtxt;θm

gen )·g(˜

stx, rest)

(4)

where

g(·)

is the function of the moderator agent,

which we discuss in the subsequent section.

Depending on individual skills, every skill agent

returns its skill-relevant response. For example, as

shown in Figure 1, when “I love sneakers and think

they are the most comfortable shoes around.” is

given as

dtx

, the skill agent for skill P generates a

personal response “Oh really? I like tennis shoes

more than sneakers.” as

resP

based on a given per-

sona. Meanwhile, the skill agents for skill K and E

generate a knowledgeable response “It is because

sneakers were primarily designed for sports.” as

resK

and a empathetic response “Me too! I deﬁ-

nitely use mine everyday wear! ” as resE.

4.3 Phase 2: Check dialogue consistency

It is well known that neural dialogue systems lack

consistency (Li et al.,2016;Welleck et al.,2019).

Furthermore, as a skill agent uses the speciﬁc skill

context

stxm

instead of

stx

for response genera-

tion, the response is more likely to be semantically

in conﬂict with other skill contexts in

stx

. Suppose

stxP

is “I wear sneakers everyday” and a

resE

“I had some trouble yesterday because my sandals

were torn”. This response is inappropriate because

“yesterday because my sandals were torn” is contra-

dictory to “I wear sneakers everyday”. Therefore,

the moderator agent, who has access to all skill con-

texts

stx

, ﬁlters out conﬂicting response candidates

to preserve dialogue consistency.

Speciﬁcally, the moderator agent leverages natu-

ral language inference (NLI), a task of determining

whether a hypothesis sentence can be inferred from

the given premise sentence. The hypothesis sen-

tence is classiﬁed into three categories: ENTAIL

(true), NEUTRAL (undetermined), and CONTRA-

DICT (false). Based on the NLI classiﬁer, the deci-

sion function of the moderator agent is deﬁned as

g(˜

stx, rest) = (1,NLI( ˜

stx, rest)6→ CONTRADICT

0,otherwise

(5)

which represents approval/refusal of

rest

condi-

tioned on

stx

. A skill agent for skill

repeat-

edly generates new response candidates until its

response is approved, as described in Equation 4.

For NLI classiﬁer, we use a RoBERTa (Liu et al.,

2019) model trained on MNLI (Williams et al.,

2018)

, which is widely used in fact checking sys-

tems (Kim et al.,2021a)

. Overall, about 50%

of utterances are classiﬁed as CONTRADICT by

NLI classiﬁer. Out of all utterances classiﬁed as

CONTRADICT, about 70% are in conﬂict with other

types of skill contexts (Figure 2). The result demon-

strates that skill agents indeed generate inconsistent

responses due to the restricted access to other skill

contexts. We also ﬁnd that the overall proportion of

utterances conﬂicting with

stxP

is relatively high,

apparently because

stxP

contains more distinct de-

scriptions than stxKand stxE.

4.4 Phase 3: Speak or pass the mic

The objective of the last phase is to score a set of re-

sponse candidates and select a ﬁnal response when

given the skill contexts and dialogue context. To

this end, we leverage the active agent and the mod-

erator agent, taking into account a balance between

skill blending and skill grounding.

Let

Ures

be the set of response candidates

res1,t, ..., resM,t

from all skill agents. The active

skill agent identiﬁes the most appropriate response

res∗

Ures

based on its ranker model

θm

rnk

, then

asks the moderator agent to attach the selected re-

sponse into the next dialogue context

dtxt+1

for

annotation. Formally, we deﬁne such process as

res∗

t= argmax

rest∈Ures

P(rest|stxm, dtxt;θm

rnk )·g(dtxt, rest)

(6)

where

g(·)

is the function of the moderator agent.

To compute

g(dtxt, rest)

, the moderator agent

Dialogue NLI (Welleck et al.,2019) is biased to ConvAI2.

4The RoBERTa model shows 90.59 accuracy on MNLI.

Figure 2: Percentages of utterances which are classiﬁed

as CONTRADICT via NLI classiﬁer, broken down by

the type of skill contexts.

Approval Refusal

(a) KLD distribution (b) Entropy distribution

Figure 3: KL divergence between skill distributions of

consecutive utterances (left) and entropy of skill distri-

butions for all utterances (right).

adopts a skill classiﬁer

that identiﬁes correspond-

ing skill for the response. We use a BERT (Devlin

et al.,2019) model trained on utterances in

and

their corresponding skill labels

for all skill types

. Once

is learned, the decision function of

the moderator agent is deﬁned as

g(dtxt, rest) = (1,KL(P(res∗

t−1)||P(rest)) < α

0,otherwise

(7)

where

res∗

t−1

is the last utterance of

dtxt

and

P(·)∈RM

outputs the skill distribution of the

response. Based on KL divergence between two

distributions,

g(dtxt, rest)

is discretized as the ap-

proval/refusal decision by a pre-deﬁned threshold

(Figure 3a). Once the moderator agent accepts

the candidate

rest

from an inactive agent as the

ﬁnal response, the active agent passes the mic, or

the priority for annotation, to the inactive agent.

In practice, we compute entropy of the skill dis-

tributions of all utterances to investigate whether

there is room for shifting between skills. The value

of entropy indicates the uncertainty of the skill type

of an utterance: utterances with high entropy values

are uncertain, generic responses. Figure 3b shows

that the number of generic utterances is far from

negligible, suggesting that there are opportunities

to shift to other skills and thus both skill blending

and grounding can be satisﬁed in a conversation.

The BERT model shows 81.95 accuracy at inference time.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BOTSTALK:Machine-sourcedFrameworkforAutomaticCurationofLarge-scaleMulti-skillDialogueDatasetsMinjuKim1ChaehyeongKim1YonghoSong1Seung-wonHwang2JinyoungYeo1y1DepartmentofArticialIntelligence,YonseiUniversity2DepartmentofComputerScienceandEngineering,SeoulNationalUniversity{minnju,cheris8,kopf_yhs,...

展开>> 收起<<

BOTSTALK Machine-sourced Framework for Automatic Curation of Large-scale Multi-skill Dialogue Datasets Minju Kim1Chaehyeong Kim1Yongho Song1Seung-won Hwang2Jinyoung Yeo1y.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

BOTSTALK Machine-sourced Framework for Automatic Curation of Large-scale Multi-skill Dialogue Datasets Minju Kim1Chaehyeong Kim1Yongho Song1Seung-won Hwang2Jinyoung Yeo1y

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: