
2 Dataset Creation
We collect a high quality dataset via a 4-step crowd-
sourcing procedure as illustrated in Fig 2.
2.1 Initial Query Collection
In order to determine the topic of the multi-intent
user query, we sample an initial query from two
Chinese user query understanding datasets for
task-oriented conversational agents, namely SMP-
ECDT
2
(Zhang et al.,2017) and RiSAWOZ
3
(Quan
et al.,2020). Then we ask human annotators to sim-
plify the initial queries that have excessive length
(longer than 15 characters), or are too verbose or
repetitive in terms of semantics
4
. RiSAWOZ is a
a large-scale multi-domain Chinese Wizard-of-Oz
NLU dataset with rich semantic annotations, which
covers 12 domains in tourist attraction,railway,
hotel,restaurant, etc. SMP-ECDT is released as
the benchmark for the “domain and intent identifi-
cation for user query” task in the evaluation track
of Chinese Social Media Processing conference
(SMP) 2017 and 2019. It covers divergent practical
user queries from 30 domains which are collected
from the production chatbots of iFLYTEK. We use
the two source datasets as our query resources as
they comprise a variety of common and naturally
occurring user queries in daily life for task-oriented
chatbot and cover diverse domains and topics.
2.2 Follow-up Query Creation
After specifying an initial query, we ask human an-
notators to put themselves in the same position of
a real end user and imagine they are eliciting mul-
tiple intents in a single complex user query while
interacting with conversational agents. The anno-
tators are instructed to write up to 3 subsequent
queries on what they need or what they would like
to know about according to the designated initial
query. Although most subsequent queries stick to
the topic of the initial query, we allow the human
annotators to switch to a different topic which is un-
related to the initial query
5
. For example in Figure
1, the second sub-query asks about the weather in
Nanjing, where the initial query is an inquiry on the
2http://ir.hit.edu.cn/SMP2017-ECDT
3https://github.com/terryqj0107/RiSAWOZ
4
The sentence simplification phase makes the annotated
multi-intent queries sound more natural, as users are not likely
to elicit a lengthy query. Given the fact that we would add 2
or 3 following sub-queries to the initial queries, they should
be simplified to keep a proper query length (Fig 2).
5
In fact, we neither encourage nor discourage topic
switching in the annotation instruction.
railway information. We observe that 37.3% anno-
tated multi-intent queries involve topic switching
by manually checking 300 subsampled instances
in the training set, which conforms to the user be-
haviour in the real-world multi-intent queries.
2.3 Query Aggregation
In the pilot study, we tried to ask human anno-
tators to manually aggregate the sub-queries but
found that the derived queries are somewhat lack
of variations in the conjunctions between the sub-
queries, as the annotators tend to always pick up
the most common Chinese conjunctions like ’and’,
’or’, ’then’. We even observed sloppy annotators
trying to hack the annotation job by not using any
conjunctions at all for each query (most queries are
fluent even without conjunctions). In a nutshell,
we find it challenging to screen the annotators and
ensure the diversity and naturalness of the derived
query in the human-only annotation. We then resort
to human-in-the-loop annotation, sampling from
a rich conjunction set to connect sub-queries and
post-checking the sentence fluency of aggregated
queries by GPT-2. After each round of annotation
(we have 6 rounds of annotations), we randomly
pick up 100 samples and check their quality, find-
ing that over 95% of samples are of high quality.
Actually most sentences in the Fig 9 (appendix) are
fluent and natural (especially in Chinese) without
cherry-picking.
More concretely we propose a set of pre-defined
templates that correspond to different text infilling
strategies between consecutive queries. Specifi-
cally, with a 50% chance we concatenate two con-
secutive queries without using any text filler. For
the other 50% chance, we sample a piece of text
from a set of pre-defined text fillers with differ-
ent sampling weights, such as “
首先
” (first of all),
“
以及
” (and), “
我还想知道
” (I also would like to
know), “
接下来
” (then), “
最后
” (finally), and then
use the sampled text filler as a conjunction while
concatenating consecutive queries. Although be-
ing locally coherent, the derived multi-intent query
may still exhibit some global incoherence and syn-
tactic issues, especially for longer text. We thus
post-process the derived query with a ranking pro-
cedure as an additional screening step. For each an-
notated query set, we generate 10 candidate multi-
intent queries with different sampled templates and
rank them according to language model perplexity
using a GPT-2 (117M) model. We only keep the