Dictionary-Assisted Supervised Contrastive Learning Patrick Y. Wu1 Richard Bonneau1245 Joshua A. Tucker123and Jonathan Nagler123 1Center for Social Media and Politics New York University

2025-05-06 0 0 883.32KB 19 页 10玖币

侵权投诉

Dictionary-Assisted Supervised Contrastive Learning

Patrick Y. Wu1, Richard Bonneau1,2,4,5, Joshua A. Tucker1,2,3,and Jonathan Nagler1,2,3

1Center for Social Media and Politics, New York University

2Center for Data Science, New York University

3Department of Politics, New York University

4Department of Biology, New York University

5Courant Institute of Mathematical Sciences, New York University

{pyw230, bonneau, joshua.tucker, jonathan.nagler}@nyu.edu

Abstract

Text analysis in the social sciences often in-

volves using specialized dictionaries to rea-

son with abstract concepts, such as percep-

tions about the economy or abuse on social

media. These dictionaries allow researchers

to impart domain knowledge and note subtle

usages of words relating to a concept(s) of in-

terest. We introduce the dictionary-assisted su-

pervised contrastive learning (DASCL) objec-

tive, allowing researchers to leverage special-

ized dictionaries when ﬁne-tuning pretrained

language models. The text is ﬁrst keyword

simpliﬁed: a common, ﬁxed token replaces

any word in the corpus that appears in the

dictionary(ies) relevant to the concept of in-

terest. During ﬁne-tuning, a supervised con-

trastive objective draws closer the embeddings

of the original and keyword-simpliﬁed texts of

the same class while pushing further apart the

embeddings of different classes. The keyword-

simpliﬁed texts of the same class are more

textually similar than their original text coun-

terparts, which additionally draws the embed-

dings of the same class closer together. Com-

bining DASCL and cross-entropy improves

classiﬁcation performance metrics in few-shot

learning settings and social science applica-

tions compared to using cross-entropy alone

and alternative contrastive and data augmenta-

tion methods.1

1 Introduction

We propose a supervised contrastive learning ap-

proach that allows researchers to incorporate dic-

tionaries of words related to a concept of interest

when ﬁne-tuning pretrained language models. It is

conceptually simple, requires low computational

resources, and is usable with most pretrained lan-

guage models.

Dictionaries contain words that hint at the sen-

timent, stance, or perception of a document (see,

Our code is available at

https://github.com/

SMAPPNYU/DASCL.

e.g., Fei et al.,2012). Social science experts often

craft these dictionaries, making them useful when

the underlying concept of interest is abstract (see,

e.g., Brady et al.,2017;Young and Soroka,2012).

Dictionaries are also useful when speciﬁc words

that are pivotal to determining the classiﬁcation of

a document may not exist in the training data. This

is a particularly salient issue with small corpora,

which is often the case in the social sciences.

However, recent supervised machine learning ap-

proaches do not use these dictionaries. We propose

a contrastive learning approach, dictionary-assisted

supervised contrastive learning (DASCL), that al-

lows researchers to leverage these expert-crafted

dictionaries when ﬁne-tuning pretrained language

models. We replace all the words in the corpus that

belong to a speciﬁc lexicon with a ﬁxed, common

token. When using an appropriate dictionary, key-

word simpliﬁcation increases the textual similarity

of documents in the same class. We then use a

supervised contrastive objective to draw together

text embeddings of the same class and push fur-

ther apart the text embeddings of different classes

(Khosla et al.,2020;Gunel et al.,2021). Figure 1

visualizes the intuition of our proposed method.

The contributions of this project are as follows.

•

We propose keyword simpliﬁcation, detailed

in Section 3.1, to make documents of the same

class more textually similar.

•

We outline a supervised contrastive loss func-

tion, described in Section 3.2, that learns

patterns within and across the original and

keyword-simpliﬁed texts.

•

We ﬁnd classiﬁcation performance improve-

ments in few-shot learning settings and social

science applications compared to two strong

baselines: (1) ROBERTA(Liu et al.,2019) /

BERT (Devlin et al.,2019) ﬁne-tuned with

cross-entropy loss, and (2) the supervised con-

trastive learning approach detailed in Gunel

et al. (2021), the most closely related approach

arXiv:2210.15172v1 [cs.CL] 27 Oct 2022

Figure 1: The blue dots are embeddings of the origi-

nal reviews and the white dots are the embeddings of

the keyword-simpliﬁed reviews from the SST-2 dataset

(Wang et al.,2018). Both reviews are positive, although

they do not overlap in any positive words used. The

reviews are more textually similar after keyword sim-

pliﬁcation. Using BERTBASE-UNCASED out-of-the-box,

the cosine similarity between the original reviews is

.654 and the cosine similarity between the keyword-

simpliﬁed reviews is .842. Although there are some

issues with using cosine similarity with BERT embed-

dings (see, e.g., Ethayarajh,2019;Zhou et al.,2022),

we use it as a rough heuristic here.

to DASCL. To be clear, although BERT and

ROBERTAare not state-of-the-art pretrained

language models, DASCL can augment the

loss functions of state-of-the-art pretrained

language models.

2 Related Work

Use of Pretrained Language Models in the So-

cial Sciences

. Transformers-based pretrained lan-

guage models have become the de facto approach

when classifying text data (see, e.g., Devlin et al.,

2019;Liu et al.,2019;Raffel et al.,2020), and

are seeing wider adoption in the social sciences.

Terechshenko et al. (2021) show that RoBERTa

and XLNet (Yang et al.,2019) outperform bag-of-

words approaches for political science text classiﬁ-

cation tasks. Ballard et al. (2022) use BERTweet

(Nguyen et al.,2020) to classify tweets expressing

polarizing rhetoric. Lai et al. (2022) use BERT to

classify the political ideologies of YouTube videos

using text video metadata. DASCL can be used

with most pretrained language models, so it can

potentially improve results across a range of social

science research.

Usage of Dictionaries

. Dictionaries play an im-

portant role in understanding the meaning behind

text in the social sciences. Brady et al. (2017) use a

moral and emotional dictionary to predict whether

tweets using these types of terms increase their dif-

fusion within and between ideological groups. Sim-

chon et al. (2022) create a dictionary of politically

polarized language and analyze how trolls use this

language on social media. Hopkins et al. (2017)

use dictionaries of positive and negative economic

terms to understand perceptions of the economy

in newspaper articles. Although dictionary-based

classiﬁcation has fallen out of favor, dictionaries

still contain valuable information about usages of

speciﬁc or subtle language.

Text Data Augmentation

. Text data augmenta-

tion techniques include backtranslation (Sennrich

et al.,2016) and rule-based data augmentations

such as random synonym replacements, random in-

sertions, random swaps, and random deletions (Wei

and Zou,2019;Karimi et al.,2021). Shorten et al.

(2021) survey text data augmentation techniques.

Longpre et al. (2020) ﬁnd that task-agnostic data

augmentations typically do not improve the classiﬁ-

cation performance of pretrained language models.

We choose dictionaries for keyword simpliﬁcation

based on the concept of interest underlying the clas-

siﬁcation task and use the keyword-simpliﬁed text

with a contrastive loss function.

Contrastive Learning

. Most works on con-

trastive learning have focused on self-supervised

contrastive learning. In computer vision, images

and their augmentations are treated as positives and

other images as negatives. Recent contrastive learn-

ing approaches match or outperform their super-

vised pretrained image model counterparts, often

using a small fraction of available annotated data

(see, e.g., Chen et al.,2020a;He et al.,2020;Chen

et al.,2020b;Grill et al.,2020). Self-supervised

contrastive learning has also been used in natural

language processing, matching or outperforming

pretrained language models on benchmark tasks

(see, e.g., Fang et al.,2020;Klein and Nabi,2020).

Our approach is most closely related to works on

supervised contrastive learning. Wen et al. (2016)

propose a loss function called center loss that mini-

mizes the intraclass distances of the convolutional

neural network features. Khosla et al. (2020) de-

velop a supervised loss function that generalizes

NT-Xent (Chen et al.,2020a) to an arbitrary num-

ber of positives. Our work is closest to that of

Gunel et al. (2021), who also use a version of NT-

Xent extended to an arbitrary number of positives

with pretrained language models. Their supervised

contrastive loss function is detailed in Section A.1.

Figure 2: Overview of the proposed method. Although

ROBERTAis shown, any pretrained language model

will work with this approach. The two RoBERTa net-

works share the same weights. The dimension of the

projection layer is arbitrary.

3 Method

The approach consists of keyword simpliﬁcation

and the contrastive objective function. Figure 2

shows an overview of the proposed framework.

3.1 Keyword Simpliﬁcation

The ﬁrst step of the DASCL framework is keyword

simpliﬁcation. We select a set of

dictionaries

For each dictionary

di∈ D

i∈ {1, ..., M}

, we as-

sign a token

. Then, we iterate through the corpus

and replace any word

in dictionary

with the

token

. We repeat these steps for each dictionary.

For example, if we have a dictionary of positive

words, then applying keyword simpliﬁcation to

a wonderfully warm human drama that remains

vividly in memory long after viewing

would yield

a <positive> <positive> human drama that remains

<positive> in memory long after viewing

There are many off-the-shelf dictionaries that

can be used during keyword simpliﬁcation. Table

4in Section A.2 contains a sample of dictionaries

reﬂecting various potential concepts of interest.

3.2 Dictionary-Assisted Supervised

Contrastive Learning (DASCL) Objective

The dictionary-assisted supervised contrastive

learning loss function resembles the loss functions

from Khosla et al. (2020) and Gunel et al. (2021).

Consistent with Khosla et al. (2020), we project the

ﬁnal hidden layer of the pretrained language model

to an embedding of a lower dimension before using

the contrastive loss function.

Let

Ψ(xi)

i∈ {1, ...N}

, be the

-normalized

projection of the output of the pretrained language

encoder for the original text and

Ψ(xi+N)

be the

corresponding

-normalized projection of the out-

put for the keyword-simpliﬁed text.

τ > 0

is the

temperature parameter that controls the separation

of the classes, and

λ∈[0,1]

is the parameter that

balances the cross-entropy and the DASCL loss

functions. We choose

and directly optimize

during training. In our experiments, we use the clas-

siﬁer token as the output of the pretrained language

encoder. Equation 1is the DASCL loss, Equation

2is the multiclass cross-entropy loss, and Equation

3is the overall loss that is optimized when ﬁne-

tuning the pretrained language model. The original

text and the keyword-simpliﬁed text are used with

the DASCL loss (Eq. 1); only the original text is

used with the cross-entropy loss. The keyword-

simpliﬁed text is not used during inference.

LDASCL =−1

i=1

2Nyi−1×

j=1

i6=j,yi=yjlog "exp(Ψ(xi)·Ψ(xj)/τ)

P2N

k=1 i6=kexp(Ψ(xi)·Ψ(xk)/τ)#

(1)

LCE =−1

i=1

c=0

yi,c ·log ˆyi,c (2)

L= (1 −λ)LCE +λLDASCL (3)

4 Experiments

4.1 Few-Shot Learning with SST-2

SST-2, a GLUE benchmark dataset (Wang et al.,

2018), consists of sentences from movie reviews

and binary labels of sentiment (positive or nega-

tive). Similar to Gunel et al. (2021), we experiment

with SST-2 with three training set sizes:

20,

100, and 1,000. Accuracy is this benchmark’s pri-

mary metric of interest; we also report average

precision. We use ROBERTA

BASE

as the pretrained

language model. For keyword simpliﬁcation, we

use the opinion lexicon (Hu and Liu,2004), which

contains dictionaries of positive and negative words.

Section A.3.3 further describes these dictionaries.

We compare DASCL to two other baselines:

ROBERTA

BASE

using the cross-entropy (CE) loss

function and the combination of the cross-entropy

and supervised contrastive learning (SCL) loss

functions used in Gunel et al. (2021). We also

experiment with augmenting the corpus with the

Loss NAccuracy Avg. Precision

CE 20 .675 ±.066 .791 ±.056

CE w/ DA 20 .650 ±.051 .748 ±.050

CE+SCL 20 .709 ±.077 .826 ±.068

CE+DASCL 20 .777 ±.024 .871 ±.014

CE+DASCL w/ DA 20 .697 ±.075 .796 ±.064

CE 100 .822 ±.019 .897 ±.023

CE w/ DA 100 .831 ±.032 .904 ±.031

CE+SCL 100 .833 ±.042 .883 ±.043

CE+DASCL 100 .858 ±.017 .935 ±.012

CE+DASCL w/ DA 100 .828 ±.020 .908 ±.012

CE 1000 .903 ±.006 .962 ±.007

CE w/ DA 1000 .899 ±.005 .956 ±.006

CE+SCL 1000 .905 ±.005 .960 ±.011

CE+DASCL 1000 .906 ±.006 .959 ±.009

CE+DASCL w/ DA 1000 .904 ±.004 .960 ±.011

Table 1: Accuracy and average precision over the SST-

2 test set in few-shot learning settings. Results are av-

erages over 10 random seeds with standard deviations

reported. DA refers to data augmentation, where the

keyword-simpliﬁed text augments the training corpus.

keyword-simpliﬁed text (referred to as “data aug-

mentation,” or “DA,” in results tables). In other

words, when data augmentation is used, both the

original text and the keyword-simpliﬁed text are

used with the cross-entropy loss.

We use the original validation set from the

GLUE benchmark as the test set, and we sample

our own validation set from the training set of equal

size to this test set. Further details about the data

and hyperparameter conﬁgurations can be found in

Section A.3. Table 1shows the results across the

three training set conﬁgurations.

DASCL improves results the most when there

are only a few observations in the training set.

When

N=20

, using DASCL yields a

10.2

point im-

provement in accuracy over using the cross-entropy

loss function (

p<.001

) and a

6.8

point improve-

ment in accuracy over using the SCL loss function

(

p=.023

). Figure 3in Section A.3.8 visualizes the

learned embeddings using each of these loss func-

tions using t-SNE plots. When the training set’s

size increases, the beneﬁts of using DASCL de-

crease. DASCL only has a slightly higher accuracy

when using 1,000 labeled observations, and the dif-

ference between DASCL and cross-entropy alone

is insigniﬁcant (p=.354).

4.2 New York Times Articles about the

Economy

Barberá et al. (2021) classify the tone of New York

Times articles about the American economy as pos-

itive or negative. 3,119 of the 8,462 labeled articles

(3,852 unique articles) in the training set are la-

Loss NAccuracy Avg. Precision

L2 Logit 100 .614 .479

CE 100 .673 ±.027 .593 ±.048

CE w/ DA 100 .663 ±.030 .576 ±.058

CE+SCL 100 .614 ±.000 .394 ±.043

CE+DASCL 100 .705 ±.013 .645 ±.016

CE+DASCL w/ DA 100 .711 ±.013 .644 ±.027

L2 Logit 1000 .624 .482

CE 1000 .716 ±.012 .662 ±.030

CE w/ DA 1000 .710 ±.011 .656 ±.024

CE+SCL 1000 .722 ±.009 .670 ±.022

CE+DASCL 1000 .732 ±.011 .671 ±.025

CE+DASCL w/ DA 1000 .733 ±.008 .681 ±.021

L2 Logit Full .681 .624

CE Full .753 ±.012 .713 ±.015

CE w/ DA Full .752 ±.011 .708 ±.017

CE+SCL Full .756 ±.011 .723 ±.009

CE+DASCL Full .759 ±.006 .741 ±.010

CE+DASCL w/ DA Full .760 ±.008 .739 ±.014

Table 2: Accuracy and average precision over the eco-

nomic media test set (Barberá et al.,2021) when using

100, 1000, and all labeled examples from the training

set for ﬁne-tuning. Except for logistic regression, re-

sults are averages over 10 random seeds with standard

deviations reported.

beled positive; 162 of the 420 articles in the test

set are labeled positive. Accuracy is the primary

metric of interest; we also report average preci-

sion. In addition to using the full training set, we

also experiment with training sets of sizes 100 and

1,000. We use the positive and negative dictionaries

from Lexicoder (Young and Soroka,2012) and dic-

tionaries of positive and negative economic terms

(Hopkins et al.,2017). Barberá et al. (2021) use

logistic regression with

regularization. We use

ROBERTA

BASE

as the pretrained language model.

Section A.4 contains more details about the data,

hyperparameters, and other evaluation metrics. Ta-

ble 2shows the results across the three training set

conﬁgurations.

When

N=100

, DASCL outperforms cross-

entropy only, cross-entropy with data augmenta-

tion, and SCL on accuracy (

p<.005

for all) and

average precision (

p<.01

for all). When

N=1000

DASCL outperforms cross-entropy only, cross-

entropy with data augmentation, and SCL on accu-

racy (

p<.05

for all) and average precision (but not

statistically signiﬁcantly). DASCL performs statis-

tically equivalent to DASCL with data augmenta-

tion across all metrics when N=100 and 1000.

When using the full training set, ROBERTA

BASE

is a general improvement over logistic regression.

Although the DASCL losses have slightly higher

accuracy than the other RoBERTa-based models,

the differences are not statistically signiﬁcant. Us-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Dictionary-AssistedSupervisedContrastiveLearningPatrickY.Wu1,RichardBonneau1;2;4;5,JoshuaA.Tucker1;2;3,andJonathanNagler1;2;31CenterforSocialMediaandPolitics,NewYorkUniversity2CenterforDataScience,NewYorkUniversity3DepartmentofPolitics,NewYorkUniversity4DepartmentofBiology,NewYorkUniversity5CourantI...

展开>> 收起<<

Dictionary-Assisted Supervised Contrastive Learning Patrick Y. Wu1 Richard Bonneau1245 Joshua A. Tucker123and Jonathan Nagler123 1Center for Social Media and Politics New York University.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Dictionary-Assisted Supervised Contrastive Learning Patrick Y. Wu1 Richard Bonneau1245 Joshua A. Tucker123and Jonathan Nagler123 1Center for Social Media and Politics New York University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: