CSS Combining Self-training and Self-supervised Learning for Few-shot Dialogue State Tracking Haoning Zhang13 Junwei Bao2 Haipeng Sun2

2025-04-27 0 0 667.43KB 9 页 10玖币
侵权投诉
CSS: Combining Self-training and Self-supervised Learning for Few-shot
Dialogue State Tracking
Haoning Zhang1,3, Junwei Bao2
, Haipeng Sun2,
Huaishao Luo2, Wenye Li4, Shuguang Cui3,1,5
1FNii, CUHK-Shenzhen 2JD AI Research
3SSE, CUHK-Shenzhen 4SDS, CUHK-Shenzhen 5Pengcheng Lab
haoningzhang@link.cuhk.edu.cn, sunhaipeng6@jd.com,
{baojunwei001, huaishaoluo}@gmail.com,
{wyli, shuguangcui}@cuhk.edu.cn
Abstract
Few-shot dialogue state tracking (DST) is a re-
alistic problem that trains the DST model with
limited labeled data. Existing few-shot meth-
ods mainly transfer knowledge learned from
external labeled dialogue data (e.g., from ques-
tion answering, dialogue summarization, ma-
chine reading comprehension tasks, etc.) into
DST, whereas collecting a large amount of ex-
ternal labeled data is laborious, and the exter-
nal data may not effectively contribute to the
DST-specific task. In this paper, we propose a
few-shot DST framework called CSS, which
Combines Self-training and Self-supervised
learning methods. The unlabeled data of the
DST task is incorporated into the self-training
iterations, where the pseudo labels are pre-
dicted by a DST model trained on limited la-
beled data in advance. Besides, a contrastive
self-supervised method is used to learn better
representations, where the data is augmented
by the dropout operation to train the model.
Experimental results on the MultiWOZ dataset
show that our proposed CSS achieves competi-
tive performance in several few-shot scenarios.
1 Introduction
Dialogue state tracking (DST) is an essential sub-
task in a task-oriented dialogue system (Yang et al.,
2021;Ramachandran et al.,2022;Sun et al.,2022).
It predicts the dialogue state corresponding to the
user’s intents at each dialogue turn, which will be
used to extract the preference and generate the natu-
ral language response (Williams and Young,2007;
Young et al.,2010;Lee and Kim,2016;Mrkši´
c
et al.,2017;Xu and Hu,2018;Wu et al.,2019a;
Kim et al.,2020;Ye et al.,2021;Wang et al.,2022).
Figure 1gives an example of DST in a conversa-
tion, where the dialogue state is accumulated and
updated after each turn.
Training a DST model requires plenty of dia-
logue corpus containing dialogue utterances and
Corresponding author
Usr: Hi I am looking for a restaurant in the north that
serves Asian oriental food.
Sys: I would recommend Saigon city. Would you like to
make a reservation?
Usr: That sounds great! We would like a reservation for
Monday at 16:45 for 6 people. Can I get the reference
number for our reservation?
restaurant-area-north restaurant-food-Asian
restaurant-name-Saigon city restaurant-book day-Monday
restaurant-book time-16:45 restaurant-book people-6
Table 1: A dialogue example containing utterances
from user and system sides and the corresponding di-
alogue state (a set of domain-slot-value pairs).
human-annotated state labels, whereas annotating
is costly. Therefore, the DST models are expected
to have acceptable performance when trained with
limited labeled data, i.e., in the few-shot cases (Wu
et al.,2020b). Previous studies on few-shot DST
solve the data scarcity issue mainly by leveraging
external labeled dialogue corpus to pre-train the
language models, which are then transferred into
the DST task (Wu et al.,2020a;Su et al.,2022;
Shin et al.,2022). However, there exist several
disadvantages: first, collecting a large amount of
external labeled data is still laborious; second, uti-
lizing the external data is heavily dependent on
computational resources since the language models
have to be further pre-trained; third, the external
data always comes from different conversation sce-
narios and NLP tasks, such as dialogues in multi
topics, question answering, dialogue summary, etc.
The data types and distributions differ from the
DST-specific training data, making it less efficient
to transfer the learned knowledge into DST.
We consider utilizing the unlabeled data of the
DST task, which is easy to access and has sim-
ilar contents to the limited labeled data, so that
the DST model can be enhanced by training on
an enlarged amount of data corpus. In this pa-
per, we propose a few-shot DST framework called
CSS, which
C
ombines the
S
elf-training and
S
elf-
arXiv:2210.05146v1 [cs.CL] 11 Oct 2022
Initialization
Teacher Model
Pseudo Label
Prediction
Student Model
Iteration
L
U
L+U
Model
Update
Training
Process
Non-Training
Process
Context Input
BERT Context

Push Apart
Pull Close
Mini-batch
Data
Augmentation
Slot-Context
Attention
Slot-Value
Matching
(a) Overall Framework (b) Model Architecture
BERT
State
(Slot)
BERT
State
(Value)
finetune
fixedfixed
Figure 1: The description of CSS. Part (a) is the overall teacher-student training iteration process, Land Ucorre-
spond to labeled and unlabeled data. Part (b) is the model architecture for both teacher and student, where the red
dashed box is the illustration of the self-supervised learning object through the dropout augmentation: narrow the
distance between each instance and its corresponding augmented one (pull close), enlarge its distance to the rest in
the same batch in representation area (push apart).
supervised methods. Specifically, a DST model is
first trained on limited labeled data and used to gen-
erate the pseudo labels of the unlabeled data; then
both the labeled and unlabeled data can be used to
train the model iteratively. Besides, we augment
the data through the contrastive self-supervised
dropout operation to learn better representations.
Each training instance is masked through a dropout
embedding layer, which will act as the contrastive
pair, and the model is trained to pull the origi-
nal and dropout instances closer in the representa-
tion area. Experiments on the multi-domain dia-
logue dataset MultiWOZ demonstrate that our CSS
achieves competitive performance with existing
few-shot DST models.
2 Related Work
Few-shot DST focuses on the model performance
with limited labeled training data, which overcomes
the general data scarcity issue. Existing DST mod-
els enhance the few-shot performance mainly by
incorporating external data of different tasks to
further pre-train a language model, which is still
collection and computational resources demanding
(Gao et al.,2020;Lin et al.,2021;Su et al.,2022;
Shin et al.,2022). Inspired by self-training that in-
corporates predicted pseudo labels of the unlabeled
data to enlarge the training corpus (Wang et al.,
2020;Mi et al.,2021;Sun et al.,2021), in this
paper, we build our framework upon the NoisyStu-
dent method (Xie et al.,2020) to enhance the DST
model in few-shot cases.
Self-supervised learning trains a model on an
auxiliary task with the automatically obtained
ground-truth (Mikolov et al.,2013;Jin et al.,
2018;Wu et al.,2019b;Devlin et al.,2019;Lewis
et al.,2020). As one of the self-supervised ap-
proaches, contrastive learning succeeds in various
NLP-related tasks, which helps the model learn
high-quality representations (Cai et al.,2020;Klein
and Nabi,2020;Gao et al.,2021;Yan et al.,2021).
In this paper, we construct contrastive data pairs
by the dropout operation to train the DST model,
which does not need extra supervision.
3 Methology
Figure 1shows the CSS framework, where (a) is
the overall training framework, and (b) is the ar-
chitecture of both teacher and student models. Our
CSS follows the NoisyStudent self-training frame-
work (Xie et al.,2020). After deriving a teacher
DST model trained with labeled data, it’s contin-
uously trained and updated into the student DST
model with both labeled and unlabeled data, where
the pseudo labels of the unlabeled data are syn-
chronously predicted. Unlike the original NoisyS-
tudent augmenting training data only in the student
training stage, we implement the contrastive self-
supervised learning method in both training teacher
and student models, where each training instance
is augmented through a dropout operation, and the
model is trained to group each instance with its
augmented pair closer and diverse it far from the
rest in the same batch.
3.1 DST Task and Base Model
Let’s define
Dt={(Qt, Rt)}t=1:T
as the set of
system query and user response pairs in total
T
摘要:

CSS:CombiningSelf-trainingandSelf-supervisedLearningforFew-shotDialogueStateTrackingHaoningZhang1;3,JunweiBao2,HaipengSun2,HuaishaoLuo2,WenyeLi4,ShuguangCui3;1;51FNii,CUHK-Shenzhen2JDAIResearch3SSE,CUHK-Shenzhen4SDS,CUHK-Shenzhen5PengchengLabhaoningzhang@link.cuhk.edu.cn,sunhaipeng6@jd.com,{baojunw...

展开>> 收起<<
CSS Combining Self-training and Self-supervised Learning for Few-shot Dialogue State Tracking Haoning Zhang13 Junwei Bao2 Haipeng Sun2.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:9 页 大小:667.43KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注