Dictionary-Assisted Supervised Contrastive Learning Patrick Y. Wu1 Richard Bonneau1245 Joshua A. Tucker123and Jonathan Nagler123 1Center for Social Media and Politics New York University

2025-05-06 0 0 883.32KB 19 页 10玖币
侵权投诉
Dictionary-Assisted Supervised Contrastive Learning
Patrick Y. Wu1, Richard Bonneau1,2,4,5, Joshua A. Tucker1,2,3,and Jonathan Nagler1,2,3
1Center for Social Media and Politics, New York University
2Center for Data Science, New York University
3Department of Politics, New York University
4Department of Biology, New York University
5Courant Institute of Mathematical Sciences, New York University
{pyw230, bonneau, joshua.tucker, jonathan.nagler}@nyu.edu
Abstract
Text analysis in the social sciences often in-
volves using specialized dictionaries to rea-
son with abstract concepts, such as percep-
tions about the economy or abuse on social
media. These dictionaries allow researchers
to impart domain knowledge and note subtle
usages of words relating to a concept(s) of in-
terest. We introduce the dictionary-assisted su-
pervised contrastive learning (DASCL) objec-
tive, allowing researchers to leverage special-
ized dictionaries when fine-tuning pretrained
language models. The text is first keyword
simplified: a common, fixed token replaces
any word in the corpus that appears in the
dictionary(ies) relevant to the concept of in-
terest. During fine-tuning, a supervised con-
trastive objective draws closer the embeddings
of the original and keyword-simplified texts of
the same class while pushing further apart the
embeddings of different classes. The keyword-
simplified texts of the same class are more
textually similar than their original text coun-
terparts, which additionally draws the embed-
dings of the same class closer together. Com-
bining DASCL and cross-entropy improves
classification performance metrics in few-shot
learning settings and social science applica-
tions compared to using cross-entropy alone
and alternative contrastive and data augmenta-
tion methods.1
1 Introduction
We propose a supervised contrastive learning ap-
proach that allows researchers to incorporate dic-
tionaries of words related to a concept of interest
when fine-tuning pretrained language models. It is
conceptually simple, requires low computational
resources, and is usable with most pretrained lan-
guage models.
Dictionaries contain words that hint at the sen-
timent, stance, or perception of a document (see,
1
Our code is available at
https://github.com/
SMAPPNYU/DASCL.
e.g., Fei et al.,2012). Social science experts often
craft these dictionaries, making them useful when
the underlying concept of interest is abstract (see,
e.g., Brady et al.,2017;Young and Soroka,2012).
Dictionaries are also useful when specific words
that are pivotal to determining the classification of
a document may not exist in the training data. This
is a particularly salient issue with small corpora,
which is often the case in the social sciences.
However, recent supervised machine learning ap-
proaches do not use these dictionaries. We propose
a contrastive learning approach, dictionary-assisted
supervised contrastive learning (DASCL), that al-
lows researchers to leverage these expert-crafted
dictionaries when fine-tuning pretrained language
models. We replace all the words in the corpus that
belong to a specific lexicon with a fixed, common
token. When using an appropriate dictionary, key-
word simplification increases the textual similarity
of documents in the same class. We then use a
supervised contrastive objective to draw together
text embeddings of the same class and push fur-
ther apart the text embeddings of different classes
(Khosla et al.,2020;Gunel et al.,2021). Figure 1
visualizes the intuition of our proposed method.
The contributions of this project are as follows.
We propose keyword simplification, detailed
in Section 3.1, to make documents of the same
class more textually similar.
We outline a supervised contrastive loss func-
tion, described in Section 3.2, that learns
patterns within and across the original and
keyword-simplified texts.
We find classification performance improve-
ments in few-shot learning settings and social
science applications compared to two strong
baselines: (1) ROBERTA(Liu et al.,2019) /
BERT (Devlin et al.,2019) fine-tuned with
cross-entropy loss, and (2) the supervised con-
trastive learning approach detailed in Gunel
et al. (2021), the most closely related approach
arXiv:2210.15172v1 [cs.CL] 27 Oct 2022
Figure 1: The blue dots are embeddings of the origi-
nal reviews and the white dots are the embeddings of
the keyword-simplified reviews from the SST-2 dataset
(Wang et al.,2018). Both reviews are positive, although
they do not overlap in any positive words used. The
reviews are more textually similar after keyword sim-
plification. Using BERTBASE-UNCASED out-of-the-box,
the cosine similarity between the original reviews is
.654 and the cosine similarity between the keyword-
simplified reviews is .842. Although there are some
issues with using cosine similarity with BERT embed-
dings (see, e.g., Ethayarajh,2019;Zhou et al.,2022),
we use it as a rough heuristic here.
to DASCL. To be clear, although BERT and
ROBERTAare not state-of-the-art pretrained
language models, DASCL can augment the
loss functions of state-of-the-art pretrained
language models.
2 Related Work
Use of Pretrained Language Models in the So-
cial Sciences
. Transformers-based pretrained lan-
guage models have become the de facto approach
when classifying text data (see, e.g., Devlin et al.,
2019;Liu et al.,2019;Raffel et al.,2020), and
are seeing wider adoption in the social sciences.
Terechshenko et al. (2021) show that RoBERTa
and XLNet (Yang et al.,2019) outperform bag-of-
words approaches for political science text classifi-
cation tasks. Ballard et al. (2022) use BERTweet
(Nguyen et al.,2020) to classify tweets expressing
polarizing rhetoric. Lai et al. (2022) use BERT to
classify the political ideologies of YouTube videos
using text video metadata. DASCL can be used
with most pretrained language models, so it can
potentially improve results across a range of social
science research.
Usage of Dictionaries
. Dictionaries play an im-
portant role in understanding the meaning behind
text in the social sciences. Brady et al. (2017) use a
moral and emotional dictionary to predict whether
tweets using these types of terms increase their dif-
fusion within and between ideological groups. Sim-
chon et al. (2022) create a dictionary of politically
polarized language and analyze how trolls use this
language on social media. Hopkins et al. (2017)
use dictionaries of positive and negative economic
terms to understand perceptions of the economy
in newspaper articles. Although dictionary-based
classification has fallen out of favor, dictionaries
still contain valuable information about usages of
specific or subtle language.
Text Data Augmentation
. Text data augmenta-
tion techniques include backtranslation (Sennrich
et al.,2016) and rule-based data augmentations
such as random synonym replacements, random in-
sertions, random swaps, and random deletions (Wei
and Zou,2019;Karimi et al.,2021). Shorten et al.
(2021) survey text data augmentation techniques.
Longpre et al. (2020) find that task-agnostic data
augmentations typically do not improve the classifi-
cation performance of pretrained language models.
We choose dictionaries for keyword simplification
based on the concept of interest underlying the clas-
sification task and use the keyword-simplified text
with a contrastive loss function.
Contrastive Learning
. Most works on con-
trastive learning have focused on self-supervised
contrastive learning. In computer vision, images
and their augmentations are treated as positives and
other images as negatives. Recent contrastive learn-
ing approaches match or outperform their super-
vised pretrained image model counterparts, often
using a small fraction of available annotated data
(see, e.g., Chen et al.,2020a;He et al.,2020;Chen
et al.,2020b;Grill et al.,2020). Self-supervised
contrastive learning has also been used in natural
language processing, matching or outperforming
pretrained language models on benchmark tasks
(see, e.g., Fang et al.,2020;Klein and Nabi,2020).
Our approach is most closely related to works on
supervised contrastive learning. Wen et al. (2016)
propose a loss function called center loss that mini-
mizes the intraclass distances of the convolutional
neural network features. Khosla et al. (2020) de-
velop a supervised loss function that generalizes
NT-Xent (Chen et al.,2020a) to an arbitrary num-
ber of positives. Our work is closest to that of
Gunel et al. (2021), who also use a version of NT-
Xent extended to an arbitrary number of positives
with pretrained language models. Their supervised
contrastive loss function is detailed in Section A.1.
Figure 2: Overview of the proposed method. Although
ROBERTAis shown, any pretrained language model
will work with this approach. The two RoBERTa net-
works share the same weights. The dimension of the
projection layer is arbitrary.
3 Method
The approach consists of keyword simplification
and the contrastive objective function. Figure 2
shows an overview of the proposed framework.
3.1 Keyword Simplification
The first step of the DASCL framework is keyword
simplification. We select a set of
M
dictionaries
D
.
For each dictionary
di∈ D
,
i∈ {1, ..., M}
, we as-
sign a token
ti
. Then, we iterate through the corpus
and replace any word
wj
in dictionary
di
with the
token
ti
. We repeat these steps for each dictionary.
For example, if we have a dictionary of positive
words, then applying keyword simplification to
a wonderfully warm human drama that remains
vividly in memory long after viewing
would yield
a <positive> <positive> human drama that remains
<positive> in memory long after viewing
There are many off-the-shelf dictionaries that
can be used during keyword simplification. Table
4in Section A.2 contains a sample of dictionaries
reflecting various potential concepts of interest.
3.2 Dictionary-Assisted Supervised
Contrastive Learning (DASCL) Objective
The dictionary-assisted supervised contrastive
learning loss function resembles the loss functions
from Khosla et al. (2020) and Gunel et al. (2021).
Consistent with Khosla et al. (2020), we project the
final hidden layer of the pretrained language model
to an embedding of a lower dimension before using
the contrastive loss function.
Let
Ψ(xi)
,
i∈ {1, ...N}
, be the
L2
-normalized
projection of the output of the pretrained language
encoder for the original text and
Ψ(xi+N)
be the
corresponding
L2
-normalized projection of the out-
put for the keyword-simplified text.
τ > 0
is the
temperature parameter that controls the separation
of the classes, and
λ[0,1]
is the parameter that
balances the cross-entropy and the DASCL loss
functions. We choose
λ
and directly optimize
τ
during training. In our experiments, we use the clas-
sifier token as the output of the pretrained language
encoder. Equation 1is the DASCL loss, Equation
2is the multiclass cross-entropy loss, and Equation
3is the overall loss that is optimized when fine-
tuning the pretrained language model. The original
text and the keyword-simplified text are used with
the DASCL loss (Eq. 1); only the original text is
used with the cross-entropy loss. The keyword-
simplified text is not used during inference.
LDASCL =1
2N
2N
X
i=1
1
2Nyi1×
2N
X
j=1
i6=j,yi=yjlog "exp(Ψ(xi)·Ψ(xj))
P2N
k=1 i6=kexp(Ψ(xi)·Ψ(xk))#
(1)
LCE =1
N
N
X
i=1
C
X
c=0
yi,c ·log ˆyi,c (2)
L= (1 λ)LCE +λLDASCL (3)
4 Experiments
4.1 Few-Shot Learning with SST-2
SST-2, a GLUE benchmark dataset (Wang et al.,
2018), consists of sentences from movie reviews
and binary labels of sentiment (positive or nega-
tive). Similar to Gunel et al. (2021), we experiment
with SST-2 with three training set sizes:
N=
20,
100, and 1,000. Accuracy is this benchmark’s pri-
mary metric of interest; we also report average
precision. We use ROBERTA
BASE
as the pretrained
language model. For keyword simplification, we
use the opinion lexicon (Hu and Liu,2004), which
contains dictionaries of positive and negative words.
Section A.3.3 further describes these dictionaries.
We compare DASCL to two other baselines:
ROBERTA
BASE
using the cross-entropy (CE) loss
function and the combination of the cross-entropy
and supervised contrastive learning (SCL) loss
functions used in Gunel et al. (2021). We also
experiment with augmenting the corpus with the
Loss NAccuracy Avg. Precision
CE 20 .675 ±.066 .791 ±.056
CE w/ DA 20 .650 ±.051 .748 ±.050
CE+SCL 20 .709 ±.077 .826 ±.068
CE+DASCL 20 .777 ±.024 .871 ±.014
CE+DASCL w/ DA 20 .697 ±.075 .796 ±.064
CE 100 .822 ±.019 .897 ±.023
CE w/ DA 100 .831 ±.032 .904 ±.031
CE+SCL 100 .833 ±.042 .883 ±.043
CE+DASCL 100 .858 ±.017 .935 ±.012
CE+DASCL w/ DA 100 .828 ±.020 .908 ±.012
CE 1000 .903 ±.006 .962 ±.007
CE w/ DA 1000 .899 ±.005 .956 ±.006
CE+SCL 1000 .905 ±.005 .960 ±.011
CE+DASCL 1000 .906 ±.006 .959 ±.009
CE+DASCL w/ DA 1000 .904 ±.004 .960 ±.011
Table 1: Accuracy and average precision over the SST-
2 test set in few-shot learning settings. Results are av-
erages over 10 random seeds with standard deviations
reported. DA refers to data augmentation, where the
keyword-simplified text augments the training corpus.
keyword-simplified text (referred to as “data aug-
mentation, or “DA,” in results tables). In other
words, when data augmentation is used, both the
original text and the keyword-simplified text are
used with the cross-entropy loss.
We use the original validation set from the
GLUE benchmark as the test set, and we sample
our own validation set from the training set of equal
size to this test set. Further details about the data
and hyperparameter configurations can be found in
Section A.3. Table 1shows the results across the
three training set configurations.
DASCL improves results the most when there
are only a few observations in the training set.
When
N=20
, using DASCL yields a
10.2
point im-
provement in accuracy over using the cross-entropy
loss function (
p<.001
) and a
6.8
point improve-
ment in accuracy over using the SCL loss function
(
p=.023
). Figure 3in Section A.3.8 visualizes the
learned embeddings using each of these loss func-
tions using t-SNE plots. When the training set’s
size increases, the benefits of using DASCL de-
crease. DASCL only has a slightly higher accuracy
when using 1,000 labeled observations, and the dif-
ference between DASCL and cross-entropy alone
is insignificant (p=.354).
4.2 New York Times Articles about the
Economy
Barberá et al. (2021) classify the tone of New York
Times articles about the American economy as pos-
itive or negative. 3,119 of the 8,462 labeled articles
(3,852 unique articles) in the training set are la-
Loss NAccuracy Avg. Precision
L2 Logit 100 .614 .479
CE 100 .673 ±.027 .593 ±.048
CE w/ DA 100 .663 ±.030 .576 ±.058
CE+SCL 100 .614 ±.000 .394 ±.043
CE+DASCL 100 .705 ±.013 .645 ±.016
CE+DASCL w/ DA 100 .711 ±.013 .644 ±.027
L2 Logit 1000 .624 .482
CE 1000 .716 ±.012 .662 ±.030
CE w/ DA 1000 .710 ±.011 .656 ±.024
CE+SCL 1000 .722 ±.009 .670 ±.022
CE+DASCL 1000 .732 ±.011 .671 ±.025
CE+DASCL w/ DA 1000 .733 ±.008 .681 ±.021
L2 Logit Full .681 .624
CE Full .753 ±.012 .713 ±.015
CE w/ DA Full .752 ±.011 .708 ±.017
CE+SCL Full .756 ±.011 .723 ±.009
CE+DASCL Full .759 ±.006 .741 ±.010
CE+DASCL w/ DA Full .760 ±.008 .739 ±.014
Table 2: Accuracy and average precision over the eco-
nomic media test set (Barberá et al.,2021) when using
100, 1000, and all labeled examples from the training
set for fine-tuning. Except for logistic regression, re-
sults are averages over 10 random seeds with standard
deviations reported.
beled positive; 162 of the 420 articles in the test
set are labeled positive. Accuracy is the primary
metric of interest; we also report average preci-
sion. In addition to using the full training set, we
also experiment with training sets of sizes 100 and
1,000. We use the positive and negative dictionaries
from Lexicoder (Young and Soroka,2012) and dic-
tionaries of positive and negative economic terms
(Hopkins et al.,2017). Barberá et al. (2021) use
logistic regression with
L2
regularization. We use
ROBERTA
BASE
as the pretrained language model.
Section A.4 contains more details about the data,
hyperparameters, and other evaluation metrics. Ta-
ble 2shows the results across the three training set
configurations.
When
N=100
, DASCL outperforms cross-
entropy only, cross-entropy with data augmenta-
tion, and SCL on accuracy (
p<.005
for all) and
average precision (
p<.01
for all). When
N=1000
,
DASCL outperforms cross-entropy only, cross-
entropy with data augmentation, and SCL on accu-
racy (
p<.05
for all) and average precision (but not
statistically significantly). DASCL performs statis-
tically equivalent to DASCL with data augmenta-
tion across all metrics when N=100 and 1000.
When using the full training set, ROBERTA
BASE
is a general improvement over logistic regression.
Although the DASCL losses have slightly higher
accuracy than the other RoBERTa-based models,
the differences are not statistically significant. Us-
摘要:

Dictionary-AssistedSupervisedContrastiveLearningPatrickY.Wu1,RichardBonneau1;2;4;5,JoshuaA.Tucker1;2;3,andJonathanNagler1;2;31CenterforSocialMediaandPolitics,NewYorkUniversity2CenterforDataScience,NewYorkUniversity3DepartmentofPolitics,NewYorkUniversity4DepartmentofBiology,NewYorkUniversity5CourantI...

展开>> 收起<<
Dictionary-Assisted Supervised Contrastive Learning Patrick Y. Wu1 Richard Bonneau1245 Joshua A. Tucker123and Jonathan Nagler123 1Center for Social Media and Politics New York University.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:19 页 大小:883.32KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注