Weakly Supervised Data Augmentation Through Prompting for Dialogue Understanding Maximillian Chen1 Alexandros Papangelis2 Chenyang Tao2 Andy Rosenbaum2

2025-05-06 0 0 955.23KB 17 页 10玖币

侵权投诉

Weakly Supervised Data Augmentation Through

Prompting for Dialogue Understanding

Maximillian Chen1∗

, Alexandros Papangelis2, Chenyang Tao2, Andy Rosenbaum2,

Seokhwan Kim2,Yang Liu2,Zhou Yu1Dilek Hakkani-Tur2,

1Columbia University, 2Amazon Alexa AI

maxchen@cs.columbia.edu,zy2461@columbia.edu

{papangea, chenyt, andros, seokhwk, yangliud, hakkanit}@amazon.com

Abstract

Dialogue understanding tasks often necessitate abundant annotated data to achieve

good performance and that presents challenges in low-resource settings. To alleviate

this barrier, we explore few-shot data augmentation for dialogue understanding

by prompting large pre-trained language models and present a novel approach

that iterates on augmentation quality by applying weakly-supervised ﬁlters. We

evaluate our methods on the emotion and act classiﬁcation tasks in DAILYDIALOG

and the intent classiﬁcation task in FACEBOOK MULTILINGUAL TASK-ORIENTED

DIALOGUE. Models ﬁne-tuned on our augmented data mixed with few-shot

ground truth data are able to approach or surpass existing full-shot state-of-the-art

performance on both datasets. For DAILYDIALOG speciﬁcally, using 10% of the

ground truth data we outperform the current state-of-the-art model which uses

100% of the data.

1 Introduction & Related Work

Most common ways of automatic data augmentation in natural language tasks include simple perturba-

tions [Wei and Zou, 2019, Karimi et al., 2021, Xie et al., 2020] and generative approaches [Kim et al.,

2021, Sahu et al., 2022, Edunov et al., 2018]. However, these methods do not utilize intersentential

context, which is essential to encode for both dialogue understanding and generation.

On the other hand, modern pre-trained language models (PLMs) can be prompted to complete

dialogues using preﬁx prompts [Liu et al., 2021], which naturally encode conversational context.

PLMs also have shown impressive zero- and few-shot capabilities [Brown et al., 2020, Bommasani

et al., 2021] in dialogue tasks and have been successfully used in generative augmentation frameworks

for tasks such as intent classiﬁcation [Sahu et al., 2022, Li et al., 2021], commonsense reasoning [Yang

et al., 2020], and response generation [Kulhánek et al., 2021, Gao et al., 2020b]. Several studies

examine in-context learning, which involves including training examples as part of a prompt [Wei

et al., 2022, Min et al., 2022, Chen et al., 2022, Lu et al., 2022]. In this work, we take the ﬁrst step

towards applying few-shot prompting to augmenting dialogue datasets. We focus on low-resource

settings, contributing an empirical account of augmenting turn-level dialogue understanding tasks

using discrete prompting which encodes dialogue history as in-context examples.

One challenge with zero- and few-shot prompting with PLMs is that the outputs may exhibit more

diversity than one would expect for a speciﬁc task, which confounds model training [Perez et al.,

2021, Zhao et al., 2021]. Speciﬁcally, PLMs often synthesize data points which lie outside of the data

∗Work done during internship at Amazon Alexa AI

1both in terms of data and cost of computational resources.

NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research.

arXiv:2210.14169v3 [cs.CL] 2 Nov 2022

7/15/22, 5:20 PM

Weak Supervision.drawio

1/1

(neutral)Alice: You're going to set up your own law oﬃce, aren't you?

(neutral)Bob:Yes. After so many years of hard work, I'd rather I had an

oﬃce of my own.

(happy)Alice:If you need help, don't hesitate to ask me.

(happy)Bob:I'll be very glad if you would help.

(happy)Alice: I'd like to wish you every success in your new venture.

(happy)Bob:Thank you. I wish I would.

(happy)Alice:

Generated Responses:

1.Good luck to you. Let's do lunch soon, Bob.

2. It's such a rare pleasure to meet such an ideal partner in your work.

3. You know, you seem quite diﬀerent.

Figure 1: Example augmented conversation from DAILYDIALOG with a generated turn following the

desired emotion “happy.” WEAKDAP ﬁlters out generated turns which do not follow the label (red).

manifold2of a given task, instead following the distribution of the generic pretraining corpora. Due

to their distance from the target task’s distribution, these augmented samples may be considered low

quality. We thus propose WEAKDAP (Weakly supervised Data Augmentation through Prompting), a

framework that iteratively improves the quality of augmented data in dialogue classiﬁcation tasks

by introducing a weakly supervised labeler to ﬁlter prospective data points. Figure 1 demonstrates

WEAKDAP ﬁltering out a low-quality synthetic utterance. We demonstrate the effectiveness of

WEAKDAP on emotion and dialogue act classiﬁcation in DAILYDIALOG [Li et al., 2017], showing

on-par or better performance compared to state-of-the-art full-shot results by augmenting only 10%

of the original data. We additionally examine the robustness of WEAKDAP using a separate task:

cross-lingual augmentation for Spanish intent detection in FBTOD [Schuster et al., 2019].

2 Data Augmentation Methods

Our approach consists of two parts: prompting PLMs using dialogue context, and applying weak

supervision to reﬁne prompt-augmented datasets.

2.1 Constructing Dialogue Prompts

Dialogue contexts can be used to form preﬁx prompts which serve as the input to a PLM

. We

augment the data by replacing dialogue turns, which are selected using the dialogue context

construction strategies below. We illustrate speciﬁc examples of each in Figure 2 and Section E in the

Appendix. Each generated utterance can be prescribed a randomly sampled or ground truth reference

label.

Conversation Trajectory Augmentation (CTA).

We take each speaker’s ﬁrst turn as ground-truth

context and iteratively replace the next turn with a generated utterance. We autoregressively use each

generated utterance as context to generate the next turn. Each ground truth conversation results in

one synthetic conversation with a new “trajectory.”

All-Turn Augmentation (ATA). ATA

iteratively replaces each turn in the conversation with a gener-

ated utterance, but uses the ground truth context instead of the generated context. For a conversation

with nturns, this results in n−1“new” conversations of length 2through n.

Last-Turn Augmentation (LTA).

This is a special case of

ATA

where we simply choose the last turn

of the conversation to replace with a generated utterance. This results in the largest conversational

context, helping guide the conditional output closer to the ground truth language manifold. Relative

to a ground-truth conversation, this yields one new conversation, with an alternate last turn. Example

in Figure 1.

Kim et al. [2021] hypothesizes that synthetic data must lie along the same natural language manifold as the

ground truth data, proposing linear interpolation among existing data.

While augmentation by prompting PLMs can help expand linguistic diversity, it can also introduce biases

which exist in PLMs’ pre-training corpora. Additionally, it may underline biases in the existing low-resource

data being augmented. We discuss this further in Appendix A.

7/8/22, 2:51 PM

Weak Supervision.drawio

1/1

Emotion Augmentation with GPT-J (Original Emotion)

Alice in a neutral mood: Oh you look awful! What's the

matter?

Bob in a neutral mood: Oh! I feel really under the weather.

I've got a sore throat and a bad cough.

Alice in aneutralmood: Oh dear. Maybe you've caught a

cold.

Bob in aneutralmood: Yes, I've had lots of overtime to do

recently and I haven't slept much at all.

Alice in aneutralmood: Well then, you should get some

rest this weekend and don't go out drinking.

Bob in aneutralmood:

Result:

Thanks, but I can't afford to do that.

Emotion Augmentation with GPT-J (Swapped Emotion)

Alicein aneutralmood: Oh you look awful! What's the

matter?

Bobin aneutralmood: Oh! I feel really under the weather.

I've got a sore throat and a bad cough.

Alicein aneutralmood: Oh dear. Maybe you've caught a

cold.

Bobin aneutralmood: Yes, I've had lots of overtime to do

recently and I haven't slept much at all.

Alicein aneutralmood: Well then, you should get some

rest this weekend and don't go out drinking.

Bobin asurprisedmood:

Result:

What's that supposed to mean?

Figure 2: Example conversation augmentation prompt for emotion classiﬁcation using GPT-J, pre-

scribing the original ground-truth emotion (left) and a randomly sampled emotion (right). This is

augmented using Last Turn Augmentation, i.e., the ﬁrst ﬁve turns are taken from the ground-truth data

and the model is asked to generate the sixth and ﬁnal turn. Both boxes represent a new augmented

conversation when taken in aggregate.

2.2 Augmentation with Weak Supervision

While prompting large PLMs provides a convenient, powerful way to bridge the gap between

inadequate training data and data-hungry conversational models, there is a caveat: those PLMs are

trained on generic corpora (i.e., web crawls, books, etc.), whose distribution may considerably differ

from the data needed to train task-speciﬁc models (e.g., see Figure 4). This motivates post-hoc

adjustments to make our prompted augmentations more task-aware. Weak supervision has been

proposed for ﬁnding a “useful representation” for a task [Robinson et al., 2020]. Intuitively, naive

prompted augmentations are less potent because they lack task-knowledge

, which can be distilled

from ground-truth (“gold”) samples by training an auxiliary model. We can then use that model to

ﬁlter out inconsistent generated utterances.

We propose WEAKDAP, a framework generalizeable to any prompt-based augmentation task. In this

work, we prompt GPT-J 6B [Wang and Komatsuzaki, 2021] and the Alexa Teacher Model (ATM)

20B [Soltan et al., 2022]. As Figure 3 illustrates, WEAKDAP consists of three parts. We ﬁrst

augment the “gold” data and train a task classiﬁer on the gold and “silver” data. Then, we iteratively

re-augment the data and re-train the classiﬁer. For the augmentation step on each iteration, we use

the classiﬁer trained during the previous iteration to create a weak silver label for each generated

instance, and ﬁlter out instances where the silver label does not match the prescribed label with high

conﬁdence, i.e., low entropy. We reason that data points which a weak labeler thinks are labeled

incorrectly with low conﬁdence could still be useful for learning during training (further discussion

in Section G in the Appendix). Moreover, this indicates that their labels may be in fact be correct. To

this end, we ﬁlter out incorrect instances classiﬁed in the bottom 80th percentile of entropy, computed

as in the equation below, where C is the number of classes and piis the probability of class i.5

Entropy =

−pi∗log2(pi)

This weakly guarantees that the generated data is not of low-quality. This continues until the

classiﬁer’s performance doesn’t improve by at least for krounds. Here, we ﬁx = 0.005, k = 3.

PLMs only see prompts during generation; to fully account for task knowledge one should include all

available examples in-context, which is generally impractical.

5This threshold is tunable.

Gold

Data

Augmentation

Model

Weak Labeler

Gold +

Silver

Data

Classifier

Data Generation

Weakly

Filtered

Instances

Replace

Weak Labeler

Prompt Function

Silver Data

Weak Filtering

Figure 3: The workﬂow of WEAKDAP. On each iteration, the Gold Data is augmented by replacing

conversation turns generated by providing a PLM with preﬁx prompts. Each prospective silver

training instance is weakly classiﬁed as either following its intended label or not, using a task speciﬁc

classiﬁer. The gold and silver data are used as training data for the next generation’s classiﬁer. This

process repeats until the performance of the classiﬁer does not improve past a threshold.

Other Task-Aware Augmentation Approaches.

Similar task-aware generative augmentation ap-

proaches typically distill task-knowledge into the generator. Yang et al. [2020] proposes augmentation

for commonsense reasoning by ﬁne-tuning two generators (for answering and distracting) and rela-

belling synthetic data points using a task model, while Papangelis et al. [2021] ﬁne-tunes a generator

using reinforcement learning. With large PLMs, these methods are costly and less practical. While

few-shot prompting is a cheaper solution, it is less effective at encoding lots of task knowledge, as

in-context example capacity is limited. WEAKDAP bridges the gap between prompt-based augmen-

tation with little task-knowledge and complex mechanisms with higher computational costs; it does

not need to ﬁne-tune the generator, as we prompt it using dialogue context as in-context utterance

examples.

3 Experiments

We benchmark various augmentation methods on the classiﬁcation tasks in DAILYDIALOG, a high-

quality open-domain dialogue dataset, and the intent detection task of FBTOD, a task-oriented

dialogue dataset (dataset details in Figure C).

3.1 DAILYDIALOG Emotion Classiﬁcation

We ﬁrst conduct a thorough evaluation of our augmentation methods using the emotion classiﬁcation

task in DAILYDIALOG as a case study, in the full and few-shot settings

. For our augmentation

model, we use GPT-J 6B

[Wang and Komatsuzaki, 2021], which is one of the largest causal language

models publicly available, and has been able to achieve performance competitive to GPT-3 on many

tasks [Wang, 2021, Black et al., 2022]. For all DAILYDIALOG experiments we use the Speaker

Turn Model (STM) [He et al., 2021], a RoBERTa [Liu et al., 2019]-based classiﬁcation model with

speaker turn awareness8, as the classiﬁcation task model and weak labeler.

There are seven emotion labels: neutral, anger, disgust, fear, happiness, sadness, and surprise.

Each label is a rich, descriptive token on its own, so in constructing a prompt, we directly use it

as an adjective (e.g., “Alice in a happy mood:”). Additionally, we conjecture that directly using

conversation history forms the best set of in-context examples to generate utterances which convey

6We randomly sample 1%,5%, and 10% of the data.

We examined OPT-30B [Zhang et al., 2022], but it was far slower without large performance improvements.

STM achieves state-of-the-art performance on full-shot DAILYDIALOG act classiﬁcation (

87.5%

accuracy).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WeaklySupervisedDataAugmentationThroughPromptingforDialogueUnderstandingMaximillianChen1,AlexandrosPapangelis2,ChenyangTao2,AndyRosenbaum2,SeokhwanKim2,YangLiu2,ZhouYu1DilekHakkani-Tur2,1ColumbiaUniversity,2AmazonAlexaAImaxchen@cs.columbia.edu,zy2461@columbia.edu{papangea,chenyt,andros,seokhwk,yang...

展开>> 收起<<

Weakly Supervised Data Augmentation Through Prompting for Dialogue Understanding Maximillian Chen1 Alexandros Papangelis2 Chenyang Tao2 Andy Rosenbaum2.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Weakly Supervised Data Augmentation Through Prompting for Dialogue Understanding Maximillian Chen1 Alexandros Papangelis2 Chenyang Tao2 Andy Rosenbaum2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: