
CSS: Combining Self-training and Self-supervised Learning for Few-shot
Dialogue State Tracking
Haoning Zhang1,3, Junwei Bao2∗
, Haipeng Sun2,
Huaishao Luo2, Wenye Li4, Shuguang Cui3,1,5
1FNii, CUHK-Shenzhen 2JD AI Research
3SSE, CUHK-Shenzhen 4SDS, CUHK-Shenzhen 5Pengcheng Lab
haoningzhang@link.cuhk.edu.cn, sunhaipeng6@jd.com,
{baojunwei001, huaishaoluo}@gmail.com,
{wyli, shuguangcui}@cuhk.edu.cn
Abstract
Few-shot dialogue state tracking (DST) is a re-
alistic problem that trains the DST model with
limited labeled data. Existing few-shot meth-
ods mainly transfer knowledge learned from
external labeled dialogue data (e.g., from ques-
tion answering, dialogue summarization, ma-
chine reading comprehension tasks, etc.) into
DST, whereas collecting a large amount of ex-
ternal labeled data is laborious, and the exter-
nal data may not effectively contribute to the
DST-specific task. In this paper, we propose a
few-shot DST framework called CSS, which
Combines Self-training and Self-supervised
learning methods. The unlabeled data of the
DST task is incorporated into the self-training
iterations, where the pseudo labels are pre-
dicted by a DST model trained on limited la-
beled data in advance. Besides, a contrastive
self-supervised method is used to learn better
representations, where the data is augmented
by the dropout operation to train the model.
Experimental results on the MultiWOZ dataset
show that our proposed CSS achieves competi-
tive performance in several few-shot scenarios.
1 Introduction
Dialogue state tracking (DST) is an essential sub-
task in a task-oriented dialogue system (Yang et al.,
2021;Ramachandran et al.,2022;Sun et al.,2022).
It predicts the dialogue state corresponding to the
user’s intents at each dialogue turn, which will be
used to extract the preference and generate the natu-
ral language response (Williams and Young,2007;
Young et al.,2010;Lee and Kim,2016;Mrkši´
c
et al.,2017;Xu and Hu,2018;Wu et al.,2019a;
Kim et al.,2020;Ye et al.,2021;Wang et al.,2022).
Figure 1gives an example of DST in a conversa-
tion, where the dialogue state is accumulated and
updated after each turn.
Training a DST model requires plenty of dia-
logue corpus containing dialogue utterances and
∗Corresponding author
Usr: Hi I am looking for a restaurant in the north that
serves Asian oriental food.
Sys: I would recommend Saigon city. Would you like to
make a reservation?
Usr: That sounds great! We would like a reservation for
Monday at 16:45 for 6 people. Can I get the reference
number for our reservation?
restaurant-area-north restaurant-food-Asian
restaurant-name-Saigon city restaurant-book day-Monday
restaurant-book time-16:45 restaurant-book people-6
Table 1: A dialogue example containing utterances
from user and system sides and the corresponding di-
alogue state (a set of domain-slot-value pairs).
human-annotated state labels, whereas annotating
is costly. Therefore, the DST models are expected
to have acceptable performance when trained with
limited labeled data, i.e., in the few-shot cases (Wu
et al.,2020b). Previous studies on few-shot DST
solve the data scarcity issue mainly by leveraging
external labeled dialogue corpus to pre-train the
language models, which are then transferred into
the DST task (Wu et al.,2020a;Su et al.,2022;
Shin et al.,2022). However, there exist several
disadvantages: first, collecting a large amount of
external labeled data is still laborious; second, uti-
lizing the external data is heavily dependent on
computational resources since the language models
have to be further pre-trained; third, the external
data always comes from different conversation sce-
narios and NLP tasks, such as dialogues in multi
topics, question answering, dialogue summary, etc.
The data types and distributions differ from the
DST-specific training data, making it less efficient
to transfer the learned knowledge into DST.
We consider utilizing the unlabeled data of the
DST task, which is easy to access and has sim-
ilar contents to the limited labeled data, so that
the DST model can be enhanced by training on
an enlarged amount of data corpus. In this pa-
per, we propose a few-shot DST framework called
CSS, which
C
ombines the
S
elf-training and
S
elf-
arXiv:2210.05146v1 [cs.CL] 11 Oct 2022