CSS Combining Self-training and Self-supervised Learning for Few-shot Dialogue State Tracking Haoning Zhang13 Junwei Bao2 Haipeng Sun2

2025-04-27 1 0 667.43KB 9 页 10玖币

侵权投诉

CSS: Combining Self-training and Self-supervised Learning for Few-shot

Dialogue State Tracking

Haoning Zhang1,3, Junwei Bao2∗

, Haipeng Sun2,

Huaishao Luo2, Wenye Li4, Shuguang Cui3,1,5

1FNii, CUHK-Shenzhen 2JD AI Research

3SSE, CUHK-Shenzhen 4SDS, CUHK-Shenzhen 5Pengcheng Lab

haoningzhang@link.cuhk.edu.cn, sunhaipeng6@jd.com,

{baojunwei001, huaishaoluo}@gmail.com,

{wyli, shuguangcui}@cuhk.edu.cn

Abstract

Few-shot dialogue state tracking (DST) is a re-

alistic problem that trains the DST model with

limited labeled data. Existing few-shot meth-

ods mainly transfer knowledge learned from

external labeled dialogue data (e.g., from ques-

tion answering, dialogue summarization, ma-

chine reading comprehension tasks, etc.) into

DST, whereas collecting a large amount of ex-

ternal labeled data is laborious, and the exter-

nal data may not effectively contribute to the

DST-speciﬁc task. In this paper, we propose a

few-shot DST framework called CSS, which

Combines Self-training and Self-supervised

learning methods. The unlabeled data of the

DST task is incorporated into the self-training

iterations, where the pseudo labels are pre-

dicted by a DST model trained on limited la-

beled data in advance. Besides, a contrastive

self-supervised method is used to learn better

representations, where the data is augmented

by the dropout operation to train the model.

Experimental results on the MultiWOZ dataset

show that our proposed CSS achieves competi-

tive performance in several few-shot scenarios.

1 Introduction

Dialogue state tracking (DST) is an essential sub-

task in a task-oriented dialogue system (Yang et al.,

2021;Ramachandran et al.,2022;Sun et al.,2022).

It predicts the dialogue state corresponding to the

user’s intents at each dialogue turn, which will be

used to extract the preference and generate the natu-

ral language response (Williams and Young,2007;

Young et al.,2010;Lee and Kim,2016;Mrkši´

et al.,2017;Xu and Hu,2018;Wu et al.,2019a;

Kim et al.,2020;Ye et al.,2021;Wang et al.,2022).

Figure 1gives an example of DST in a conversa-

tion, where the dialogue state is accumulated and

updated after each turn.

Training a DST model requires plenty of dia-

logue corpus containing dialogue utterances and

∗Corresponding author

Usr: Hi I am looking for a restaurant in the north that

serves Asian oriental food.

Sys: I would recommend Saigon city. Would you like to

make a reservation?

Usr: That sounds great! We would like a reservation for

Monday at 16:45 for 6 people. Can I get the reference

number for our reservation?

restaurant-area-north restaurant-food-Asian

restaurant-name-Saigon city restaurant-book day-Monday

restaurant-book time-16:45 restaurant-book people-6

Table 1: A dialogue example containing utterances

from user and system sides and the corresponding di-

alogue state (a set of domain-slot-value pairs).

human-annotated state labels, whereas annotating

is costly. Therefore, the DST models are expected

to have acceptable performance when trained with

limited labeled data, i.e., in the few-shot cases (Wu

et al.,2020b). Previous studies on few-shot DST

solve the data scarcity issue mainly by leveraging

external labeled dialogue corpus to pre-train the

language models, which are then transferred into

the DST task (Wu et al.,2020a;Su et al.,2022;

Shin et al.,2022). However, there exist several

disadvantages: ﬁrst, collecting a large amount of

external labeled data is still laborious; second, uti-

lizing the external data is heavily dependent on

computational resources since the language models

have to be further pre-trained; third, the external

data always comes from different conversation sce-

narios and NLP tasks, such as dialogues in multi

topics, question answering, dialogue summary, etc.

The data types and distributions differ from the

DST-speciﬁc training data, making it less efﬁcient

to transfer the learned knowledge into DST.

We consider utilizing the unlabeled data of the

DST task, which is easy to access and has sim-

ilar contents to the limited labeled data, so that

the DST model can be enhanced by training on

an enlarged amount of data corpus. In this pa-

per, we propose a few-shot DST framework called

CSS, which

ombines the

elf-training and

elf-

arXiv:2210.05146v1 [cs.CL] 11 Oct 2022

Initialization

Teacher Model

Pseudo Label

Prediction

Student Model

Iteration

L+U

Model

Update

Training

Process

Non-Training

Process



Context Input 



BERT Context





Push Apart





Pull Close

Mini-batch

Data

Augmentation

Slot-Context

Attention

Slot-Value

Matching

(a) Overall Framework (b) Model Architecture

BERT

State

(Slot)

BERT

State

(Value)











finetune

fixedfixed

Figure 1: The description of CSS. Part (a) is the overall teacher-student training iteration process, Land Ucorre-

spond to labeled and unlabeled data. Part (b) is the model architecture for both teacher and student, where the red

dashed box is the illustration of the self-supervised learning object through the dropout augmentation: narrow the

distance between each instance and its corresponding augmented one (pull close), enlarge its distance to the rest in

the same batch in representation area (push apart).

supervised methods. Speciﬁcally, a DST model is

ﬁrst trained on limited labeled data and used to gen-

erate the pseudo labels of the unlabeled data; then

both the labeled and unlabeled data can be used to

train the model iteratively. Besides, we augment

the data through the contrastive self-supervised

dropout operation to learn better representations.

Each training instance is masked through a dropout

embedding layer, which will act as the contrastive

pair, and the model is trained to pull the origi-

nal and dropout instances closer in the representa-

tion area. Experiments on the multi-domain dia-

logue dataset MultiWOZ demonstrate that our CSS

achieves competitive performance with existing

few-shot DST models.

2 Related Work

Few-shot DST focuses on the model performance

with limited labeled training data, which overcomes

the general data scarcity issue. Existing DST mod-

els enhance the few-shot performance mainly by

incorporating external data of different tasks to

further pre-train a language model, which is still

collection and computational resources demanding

(Gao et al.,2020;Lin et al.,2021;Su et al.,2022;

Shin et al.,2022). Inspired by self-training that in-

corporates predicted pseudo labels of the unlabeled

data to enlarge the training corpus (Wang et al.,

2020;Mi et al.,2021;Sun et al.,2021), in this

paper, we build our framework upon the NoisyStu-

dent method (Xie et al.,2020) to enhance the DST

model in few-shot cases.

Self-supervised learning trains a model on an

auxiliary task with the automatically obtained

ground-truth (Mikolov et al.,2013;Jin et al.,

2018;Wu et al.,2019b;Devlin et al.,2019;Lewis

et al.,2020). As one of the self-supervised ap-

proaches, contrastive learning succeeds in various

NLP-related tasks, which helps the model learn

high-quality representations (Cai et al.,2020;Klein

and Nabi,2020;Gao et al.,2021;Yan et al.,2021).

In this paper, we construct contrastive data pairs

by the dropout operation to train the DST model,

which does not need extra supervision.

3 Methology

Figure 1shows the CSS framework, where (a) is

the overall training framework, and (b) is the ar-

chitecture of both teacher and student models. Our

CSS follows the NoisyStudent self-training frame-

work (Xie et al.,2020). After deriving a teacher

DST model trained with labeled data, it’s contin-

uously trained and updated into the student DST

model with both labeled and unlabeled data, where

the pseudo labels of the unlabeled data are syn-

chronously predicted. Unlike the original NoisyS-

tudent augmenting training data only in the student

training stage, we implement the contrastive self-

supervised learning method in both training teacher

and student models, where each training instance

is augmented through a dropout operation, and the

model is trained to group each instance with its

augmented pair closer and diverse it far from the

rest in the same batch.

3.1 DST Task and Base Model

Let’s deﬁne

Dt={(Qt, Rt)}t=1:T

as the set of

system query and user response pairs in total

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CSS:CombiningSelf-trainingandSelf-supervisedLearningforFew-shotDialogueStateTrackingHaoningZhang1;3,JunweiBao2,HaipengSun2,HuaishaoLuo2,WenyeLi4,ShuguangCui3;1;51FNii,CUHK-Shenzhen2JDAIResearch3SSE,CUHK-Shenzhen4SDS,CUHK-Shenzhen5PengchengLabhaoningzhang@link.cuhk.edu.cn,sunhaipeng6@jd.com,{baojunw...

展开>> 收起<<

CSS Combining Self-training and Self-supervised Learning for Few-shot Dialogue State Tracking Haoning Zhang13 Junwei Bao2 Haipeng Sun2.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CSS Combining Self-training and Self-supervised Learning for Few-shot Dialogue State Tracking Haoning Zhang13 Junwei Bao2 Haipeng Sun2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: