
BOTSTALK: Machine-sourced Framework for Automatic Curation of
Large-scale Multi-skill Dialogue Datasets
Minju Kim1∗Chaehyeong Kim1∗Yongho Song1∗Seung-won Hwang2Jinyoung Yeo1†
1Department of Artificial Intelligence, Yonsei University
2Department of Computer Science and Engineering, Seoul National University
{minnju,cheris8,kopf_yhs,jinyeo}@yonsei.ac.kr seungwonh@snu.ac.kr
Abstract
To build open-domain chatbots that are able
to use diverse communicative skills, we pro-
pose a novel framework BOTSTALK, where
multiple agents grounded to the specific target
skills participate in a conversation to automat-
ically annotate multi-skill dialogues. We fur-
ther present Blended Skill BotsTalk (BSBT),
a large-scale multi-skill dialogue dataset com-
prising 300K conversations. Through exten-
sive experiments, we demonstrate that our
dataset can be effective for multi-skill dia-
logue systems which require an understanding
of skill blending as well as skill grounding.
Our code and data are available at https://
github.com/convei-lab/BotsTalk.
1 Introduction
A considerable progress has been made towards
open-domain chatbots with different desirable qual-
ities in conversations. Each of these models is capa-
ble of being specialized in one communicative skill,
i.e., skill grounding. A number of distinct large-
scale datasets targeting a specific conversational
skill have recently become available. ConvAI2 (Di-
nan et al.,2020b) is provided for research work that
aims to endow chatbots with personas (Majumder
et al.,2020a;Kim et al.,2020b), enabling chat-
bots to talk about themselves. Wizard of Wikipedia
(WoW) (Dinan et al.,2019) is a popular option for
recent studies (Lian et al.,2019;Zhao et al.,2020;
Kim et al.,2020a) that focus on knowledgeable
conversational agents discussing topics in depth.
Empathetic Dialogues (ED) (Rashkin et al.,2019)
is also commonly used to embody empathy in dia-
logue systems (Santhanam and Shaikh,2019;Ma-
jumder et al.,2020b). Most of such skill-grounded
datasets are designed to improve a single skill, and
thus effective when models are asked to demon-
strate the targeted conversational skill.
∗∗Equal contribution
††Corresponding author
Benefiting from the advances of these conversa-
tional agents, recent research focuses on another as-
pect of open-domain chatbots: the ability to blend
various conversational skills into one cohesive flow
in a seamless manner, i.e., skill blending. A good
open-domain chatbot should be able to weave mul-
tiple behaviors and skills in a single conversation,
so that it enables to deal with different users and sit-
uations appropriately (Shuster et al.,2020;Roller
et al.,2021). Towards this goal, there is a need to
construct a multi-skill dialogue dataset, which con-
sists of multi-turn dialogues that exhibit multiple
skills. While Smith et al. (2020) propose a crowd-
sourced dataset Blended Skill Talk (BST) of 5K
conversations as a reliable benchmark for measur-
ing dialogue systems’ ability at the blended objec-
tive, it is not sufficient to build a multi-skill chatbot
due to its limited scale. Scaling up crowdsourcing
is not feasible, as it requires labor intensive man-
ual annotation and verification. Instead, automatic
curation shows promising results on large-scale
dialogue generation (Mohapatra et al.,2021).
In this paper, we aim to generate a large-scale
multi-skill dialogue dataset without additional costs
or human efforts. To this end, we introduce an auto-
matic data curation approach named
BOTSTALK
,
where multiple dialogue agents grounded to indi-
vidual skills engage in the conversation to blend all
skills together. Based on this framework, we cre-
ate
Blended Skill BotsTalk (BSBT)
, a large-scale
multi-skill dialogue dataset of 300K conversations
blended and grounded with a number of skills de-
rived from ConvAI2, WoW, and ED. Our experi-
ments demonstrate that by using our dataset dia-
logue models successfully yield large performance
gains in skill blending while maintaining competi-
tive performance in skill grounding. Furthermore,
we validate the quality of BS
B
T dataset by human
evaluation, showing our machine-sourced conversa-
tions are consistently preferred over crowdsourced
ones from BST by human judges across all metrics.
arXiv:2210.12687v1 [cs.CL] 23 Oct 2022