Weakly Supervised Data Augmentation Through Prompting for Dialogue Understanding Maximillian Chen1 Alexandros Papangelis2 Chenyang Tao2 Andy Rosenbaum2

2025-05-06 0 0 955.23KB 17 页 10玖币
侵权投诉
Weakly Supervised Data Augmentation Through
Prompting for Dialogue Understanding
Maximillian Chen1
, Alexandros Papangelis2, Chenyang Tao2, Andy Rosenbaum2,
Seokhwan Kim2,Yang Liu2,Zhou Yu1Dilek Hakkani-Tur2,
1Columbia University, 2Amazon Alexa AI
maxchen@cs.columbia.edu,zy2461@columbia.edu
{papangea, chenyt, andros, seokhwk, yangliud, hakkanit}@amazon.com
Abstract
Dialogue understanding tasks often necessitate abundant annotated data to achieve
good performance and that presents challenges in low-resource settings. To alleviate
this barrier, we explore few-shot data augmentation for dialogue understanding
by prompting large pre-trained language models and present a novel approach
that iterates on augmentation quality by applying weakly-supervised filters. We
evaluate our methods on the emotion and act classification tasks in DAILYDIALOG
and the intent classification task in FACEBOOK MULTILINGUAL TASK-ORIENTED
DIALOGUE. Models fine-tuned on our augmented data mixed with few-shot
ground truth data are able to approach or surpass existing full-shot state-of-the-art
performance on both datasets. For DAILYDIALOG specifically, using 10% of the
ground truth data we outperform the current state-of-the-art model which uses
100% of the data.
1 Introduction & Related Work
Most common ways of automatic data augmentation in natural language tasks include simple perturba-
tions [Wei and Zou, 2019, Karimi et al., 2021, Xie et al., 2020] and generative approaches [Kim et al.,
2021, Sahu et al., 2022, Edunov et al., 2018]. However, these methods do not utilize intersentential
context, which is essential to encode for both dialogue understanding and generation.
On the other hand, modern pre-trained language models (PLMs) can be prompted to complete
dialogues using prefix prompts [Liu et al., 2021], which naturally encode conversational context.
PLMs also have shown impressive zero- and few-shot capabilities [Brown et al., 2020, Bommasani
et al., 2021] in dialogue tasks and have been successfully used in generative augmentation frameworks
for tasks such as intent classification [Sahu et al., 2022, Li et al., 2021], commonsense reasoning [Yang
et al., 2020], and response generation [Kulhánek et al., 2021, Gao et al., 2020b]. Several studies
examine in-context learning, which involves including training examples as part of a prompt [Wei
et al., 2022, Min et al., 2022, Chen et al., 2022, Lu et al., 2022]. In this work, we take the first step
towards applying few-shot prompting to augmenting dialogue datasets. We focus on low-resource
1
settings, contributing an empirical account of augmenting turn-level dialogue understanding tasks
using discrete prompting which encodes dialogue history as in-context examples.
One challenge with zero- and few-shot prompting with PLMs is that the outputs may exhibit more
diversity than one would expect for a specific task, which confounds model training [Perez et al.,
2021, Zhao et al., 2021]. Specifically, PLMs often synthesize data points which lie outside of the data
Work done during internship at Amazon Alexa AI
1both in terms of data and cost of computational resources.
NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research.
arXiv:2210.14169v3 [cs.CL] 2 Nov 2022
7/15/22, 5:20 PM
Weak Supervision.drawio
1/1
(neutral)Alice: You're going to set up your own law office, aren't you?
(neutral)Bob:Yes. After so many years of hard work, I'd rather I had an
office of my own.
(happy)Alice:If you need help, don't hesitate to ask me.
(happy)Bob:I'll be very glad if you would help.
(happy)Alice: I'd like to wish you every success in your new venture.
(happy)Bob:Thank you. I wish I would.
(happy)Alice:
Generated Responses:
1.Good luck to you. Let's do lunch soon, Bob.
2. It's such a rare pleasure to meet such an ideal partner in your work.
3. You know, you seem quite different.
Figure 1: Example augmented conversation from DAILYDIALOG with a generated turn following the
desired emotion “happy. WEAKDAP filters out generated turns which do not follow the label (red).
manifold2of a given task, instead following the distribution of the generic pretraining corpora. Due
to their distance from the target task’s distribution, these augmented samples may be considered low
quality. We thus propose WEAKDAP (Weakly supervised Data Augmentation through Prompting), a
framework that iteratively improves the quality of augmented data in dialogue classification tasks
by introducing a weakly supervised labeler to filter prospective data points. Figure 1 demonstrates
WEAKDAP filtering out a low-quality synthetic utterance. We demonstrate the effectiveness of
WEAKDAP on emotion and dialogue act classification in DAILYDIALOG [Li et al., 2017], showing
on-par or better performance compared to state-of-the-art full-shot results by augmenting only 10%
of the original data. We additionally examine the robustness of WEAKDAP using a separate task:
cross-lingual augmentation for Spanish intent detection in FBTOD [Schuster et al., 2019].
2 Data Augmentation Methods
Our approach consists of two parts: prompting PLMs using dialogue context, and applying weak
supervision to refine prompt-augmented datasets.
2.1 Constructing Dialogue Prompts
Dialogue contexts can be used to form prefix prompts which serve as the input to a PLM
3
. We
augment the data by replacing dialogue turns, which are selected using the dialogue context
construction strategies below. We illustrate specific examples of each in Figure 2 and Section E in the
Appendix. Each generated utterance can be prescribed a randomly sampled or ground truth reference
label.
Conversation Trajectory Augmentation (CTA).
We take each speaker’s first turn as ground-truth
context and iteratively replace the next turn with a generated utterance. We autoregressively use each
generated utterance as context to generate the next turn. Each ground truth conversation results in
one synthetic conversation with a new “trajectory.
All-Turn Augmentation (ATA). ATA
iteratively replaces each turn in the conversation with a gener-
ated utterance, but uses the ground truth context instead of the generated context. For a conversation
with nturns, this results in n1“new” conversations of length 2through n.
Last-Turn Augmentation (LTA).
This is a special case of
ATA
where we simply choose the last turn
of the conversation to replace with a generated utterance. This results in the largest conversational
context, helping guide the conditional output closer to the ground truth language manifold. Relative
to a ground-truth conversation, this yields one new conversation, with an alternate last turn. Example
in Figure 1.
2
Kim et al. [2021] hypothesizes that synthetic data must lie along the same natural language manifold as the
ground truth data, proposing linear interpolation among existing data.
3
While augmentation by prompting PLMs can help expand linguistic diversity, it can also introduce biases
which exist in PLMs’ pre-training corpora. Additionally, it may underline biases in the existing low-resource
data being augmented. We discuss this further in Appendix A.
2
7/8/22, 2:51 PM
Weak Supervision.drawio
Emotion Augmentation with GPT-J (Original Emotion)
Alice in a neutral mood: Oh you look awful! What's the
matter?
Bob in a neutral mood: Oh! I feel really under the weather.
I've got a sore throat and a bad cough.
Alice in aneutralmood: Oh dear. Maybe you've caught a
cold.
Bob in aneutralmood: Yes, I've had lots of overtime to do
recently and I haven't slept much at all.
Alice in aneutralmood: Well then, you should get some
rest this weekend and don't go out drinking.
Bob in aneutralmood:
Result:
Thanks, but I can't afford to do that.
Emotion Augmentation with GPT-J (Swapped Emotion)
Alicein aneutralmood: Oh you look awful! What's the
matter?
Bobin aneutralmood: Oh! I feel really under the weather.
I've got a sore throat and a bad cough.
Alicein aneutralmood: Oh dear. Maybe you've caught a
cold.
Bobin aneutralmood: Yes, I've had lots of overtime to do
recently and I haven't slept much at all.
Alicein aneutralmood: Well then, you should get some
rest this weekend and don't go out drinking.
Bobin asurprisedmood:
Result:
What's that supposed to mean?
Figure 2: Example conversation augmentation prompt for emotion classification using GPT-J, pre-
scribing the original ground-truth emotion (left) and a randomly sampled emotion (right). This is
augmented using Last Turn Augmentation, i.e., the first five turns are taken from the ground-truth data
and the model is asked to generate the sixth and final turn. Both boxes represent a new augmented
conversation when taken in aggregate.
2.2 Augmentation with Weak Supervision
While prompting large PLMs provides a convenient, powerful way to bridge the gap between
inadequate training data and data-hungry conversational models, there is a caveat: those PLMs are
trained on generic corpora (i.e., web crawls, books, etc.), whose distribution may considerably differ
from the data needed to train task-specific models (e.g., see Figure 4). This motivates post-hoc
adjustments to make our prompted augmentations more task-aware. Weak supervision has been
proposed for finding a “useful representation” for a task [Robinson et al., 2020]. Intuitively, naive
prompted augmentations are less potent because they lack task-knowledge
4
, which can be distilled
from ground-truth (“gold”) samples by training an auxiliary model. We can then use that model to
filter out inconsistent generated utterances.
We propose WEAKDAP, a framework generalizeable to any prompt-based augmentation task. In this
work, we prompt GPT-J 6B [Wang and Komatsuzaki, 2021] and the Alexa Teacher Model (ATM)
20B [Soltan et al., 2022]. As Figure 3 illustrates, WEAKDAP consists of three parts. We first
augment the “gold” data and train a task classifier on the gold and “silver” data. Then, we iteratively
re-augment the data and re-train the classifier. For the augmentation step on each iteration, we use
the classifier trained during the previous iteration to create a weak silver label for each generated
instance, and filter out instances where the silver label does not match the prescribed label with high
confidence, i.e., low entropy. We reason that data points which a weak labeler thinks are labeled
incorrectly with low confidence could still be useful for learning during training (further discussion
in Section G in the Appendix). Moreover, this indicates that their labels may be in fact be correct. To
this end, we filter out incorrect instances classified in the bottom 80th percentile of entropy, computed
as in the equation below, where C is the number of classes and piis the probability of class i.5
Entropy =
C
X
i
pilog2(pi)
This weakly guarantees that the generated data is not of low-quality. This continues until the
classifier’s performance doesn’t improve by at least for krounds. Here, we fix = 0.005, k = 3.
4
PLMs only see prompts during generation; to fully account for task knowledge one should include all
available examples in-context, which is generally impractical.
5This threshold is tunable.
3
Gold
Data
Augmentation
Model
Weak Labeler
Gold +
Silver
Data
Classifier
Data Generation
Weakly
Filtered
Instances
Replace
Weak Labeler
Prompt Function
Silver Data
Weak Filtering
Figure 3: The workflow of WEAKDAP. On each iteration, the Gold Data is augmented by replacing
conversation turns generated by providing a PLM with prefix prompts. Each prospective silver
training instance is weakly classified as either following its intended label or not, using a task specific
classifier. The gold and silver data are used as training data for the next generation’s classifier. This
process repeats until the performance of the classifier does not improve past a threshold.
Other Task-Aware Augmentation Approaches.
Similar task-aware generative augmentation ap-
proaches typically distill task-knowledge into the generator. Yang et al. [2020] proposes augmentation
for commonsense reasoning by fine-tuning two generators (for answering and distracting) and rela-
belling synthetic data points using a task model, while Papangelis et al. [2021] fine-tunes a generator
using reinforcement learning. With large PLMs, these methods are costly and less practical. While
few-shot prompting is a cheaper solution, it is less effective at encoding lots of task knowledge, as
in-context example capacity is limited. WEAKDAP bridges the gap between prompt-based augmen-
tation with little task-knowledge and complex mechanisms with higher computational costs; it does
not need to fine-tune the generator, as we prompt it using dialogue context as in-context utterance
examples.
3 Experiments
We benchmark various augmentation methods on the classification tasks in DAILYDIALOG, a high-
quality open-domain dialogue dataset, and the intent detection task of FBTOD, a task-oriented
dialogue dataset (dataset details in Figure C).
3.1 DAILYDIALOG Emotion Classification
We first conduct a thorough evaluation of our augmentation methods using the emotion classification
task in DAILYDIALOG as a case study, in the full and few-shot settings
6
. For our augmentation
model, we use GPT-J 6B
7
[Wang and Komatsuzaki, 2021], which is one of the largest causal language
models publicly available, and has been able to achieve performance competitive to GPT-3 on many
tasks [Wang, 2021, Black et al., 2022]. For all DAILYDIALOG experiments we use the Speaker
Turn Model (STM) [He et al., 2021], a RoBERTa [Liu et al., 2019]-based classification model with
speaker turn awareness8, as the classification task model and weak labeler.
There are seven emotion labels: neutral, anger, disgust, fear, happiness, sadness, and surprise.
Each label is a rich, descriptive token on its own, so in constructing a prompt, we directly use it
as an adjective (e.g., Alice in a happy mood:”). Additionally, we conjecture that directly using
conversation history forms the best set of in-context examples to generate utterances which convey
6We randomly sample 1%,5%, and 10% of the data.
7
We examined OPT-30B [Zhang et al., 2022], but it was far slower without large performance improvements.
8
STM achieves state-of-the-art performance on full-shot DAILYDIALOG act classification (
87.5%
accuracy).
4
摘要:

WeaklySupervisedDataAugmentationThroughPromptingforDialogueUnderstandingMaximillianChen1,AlexandrosPapangelis2,ChenyangTao2,AndyRosenbaum2,SeokhwanKim2,YangLiu2,ZhouYu1DilekHakkani-Tur2,1ColumbiaUniversity,2AmazonAlexaAImaxchen@cs.columbia.edu,zy2461@columbia.edu{papangea,chenyt,andros,seokhwk,yang...

展开>> 收起<<
Weakly Supervised Data Augmentation Through Prompting for Dialogue Understanding Maximillian Chen1 Alexandros Papangelis2 Chenyang Tao2 Andy Rosenbaum2.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:17 页 大小:955.23KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注