Exploring Euphemism Detection in Few-Shot and Zero-Shot Settings Sedrick Scott Keh Carnegie Mellon University

2025-05-06 0 0 177.87KB 6 页 10玖币

侵权投诉

Exploring Euphemism Detection in Few-Shot and Zero-Shot Settings

Sedrick Scott Keh

Carnegie Mellon University

skeh@cs.cmu.edu

Abstract

This work builds upon the Euphemism De-

tection Shared Task proposed in the EMNLP

2022 FigLang Workshop, and extends it to

few-shot and zero-shot settings. We demon-

strate a few-shot and zero-shot formulation

using the dataset from the shared task, and

we conduct experiments in these settings us-

ing RoBERTa and GPT-3. Our results show

that language models are able to classify eu-

phemistic terms relatively well even on new

terms unseen during training, indicating that it

is able to capture higher-level concepts related

to euphemisms.

1 Introduction

Euphemisms are ﬁgures of speech which aim to

soften the blow of certain words which may be

too direct or too harsh (Magu and Luo,2018;Felt

and Riloff,2020). In the EMNLP 2022 FigLang

Workshop Euphemism Shared Task, participating

teams are given a set of sentences with potentially

euphemistic terms (PETs) enclosed in brackets, and

the task is to classify whether or not the PET in a

given sentence is used euphemistically.

In this task/dataset, however, there are many

PETs which are repeated throughout both the train-

ing and testing sets (more details in Section 3). In

addition, several PETs are classiﬁed as euphemistic

almost 100% of the time during training. This

raises an important question: is the model actually

learning to classify what a euphemism is, or is it

simply reﬂecting back things it has seen repeatedly

during training? How do we know if the model

we train can truly capture the essence of what a

euphemism is? Even among humans, this is a very

nontrivial task. If one hears the phrase “lose one’s

lunch” for the ﬁrst time, for example, it may not

be immediately obvious that it is a euphemism for

throwing up. However, when used in a sentence,

the context clues together with an understanding

of the meanings of the words “lose” and “lunch”

will allow a human to piece together the meaning.

For a machine to be able to do this, however, is not

trivial at all.

To this end, we test this by checking whether

a model can correctly classify PETs it has never

seen during training. This leads us to our few-

shot/zero-shot setting. The two key contributions

of our paper are as follows: 1) We propose and

formulate the few-shot and zero-shot euphemism

detection settings; and 2) We run initial baselines

on these euphemisms using RoBERTa and GPT-3,

and we present a thorough analysis of our results.

2 Related Work

Compared to other ﬁgures of speech like sim-

iles (Chakrabarty et al.,2020) and metaphors

(Chakrabarty et al.,2021), work on euphemisms

has been limited. Recently, Gavidia et al. (2022);

Lee et al. (2022) released a new dataset of diverse

euphemisms and conducted analysis on automati-

cally identifying potentially euphemistic terms. In

the past, Felt and Riloff (2020) used sentiment anal-

ysis techniques to recognize euphemistic and dys-

phemistic phrases. Other studies also focused on

speciﬁc euphemistic categories such as hate speech

(Magu and Luo,2018) and drugs (Zhu et al.,2021).

In terms of zero-shot ﬁgurative language detec-

tion, the existing literature has also been quite lim-

ited. The few existing studies (Schneider et al.,

2022) mostly focus on metaphors and on low-

resource settings. This leaves out less common

ﬁgures of speech such as euphemisms, and the low-

resource formulation is also not exactly identical

to the zero-shot setting we explore in this paper.

3 Task and Dataset

Our task is similar to the FigLang 2022 Workshop

Shared Task on Euphemism Detection. Given a

sentence containing a potentially euphemistic term

(PET), we want to determine whether the PET is

used euphemistically. The key difference with our

arXiv:2210.12926v1 [cs.CL] 24 Oct 2022

Ave. Test Size Ave. # of unique

PETs in test

Standard 295.0 93.3

Few-Shot (k=1) 279.6 35.0

Few-Shot (k=3) 281.2 35.4

0-shot (random) 280.6 34.3

Death 174.0 14.9

Sexual Activity 45.0 10.4

Employment 176.0 23.5

Politics 161.0 20.9

Bodily Functions 26.0 7.0

Physical/Mental 299.0 36.0

Substances 88.0 9.1

Table 1: Dataset statistics for the few-shot and zero-

shot settings. Because there is some stochasticity in-

volved in dataset creation, we take averages over 10

samples.

task is that we perform the binary classiﬁcation on

a few-shot/zero-shot setting. Similarly, we use the

dataset proposed by Gavidia et al. (2022), which

contains 1965 sentences with PETs, split across

129 unique PETs and 7 different euphemistic cate-

gories (e.g. death, employment, etc.) Furthermore,

the dataset also contains additional information

such as the category and the status of the PET (“al-

ways euph” vs “sometimes euph”).

4 Methodology

4.1 Constructing the Few-Shot Setting

For the

-shot setting, we want the PETs in the

validation/test set to have appeared in the train-

ing set only

times. Let our set of PETs be

P={p1, p2,...pN}

. We construct the test set as

follows. First, we randomly sample a PET

from

, then ﬁnd all sentences

s1, s2,...sM

containing

PET

. Out of these

sentences, we sample

sentences

sj1, sj2,...sjk

to keep in our training

set, moving all the

(M−k)

remaining sentences

to our test set. We repeat this process until we

reach the desired size for our validation/test set.

In our case, we stop when the validation and test

each reach around 15% of our entire dataset (

±2%

to account for the fact that it’s unlikely to reach

15% exactly). In practice, we sample 30% for the

validation+test set combined, then randomly split

this 30% into two sets of 15% in order to increase

the PET diversity in both the validation and the test

splits. For the

-shot setting, we use

k= 1

and

k= 3

. The dataset statistics for the

-shot datasets

can be found in Table 1.

4.2 Constructing the Zero-Shot Setting

For the zero-shot setting, we want the PETs in the

validation/test set to never have appeared in the

training set. There are two ways to achieve this:

1. Random Sampling

– The construction for this

is similar to that of the few-shot setting, except

here, we don’t sample

sj1, sj2,...sjk

to keep in

the training set but rather move all

sentences

s1, s2,...sMto our validation/test set.

2. Type-based

– Rather than randomly choosing

assorted PETs to holdout into our test set, we in-

stead choose the test set PETs to all come from a

single category, while the training set will come

from the remaining categories. These categories

are provided alongside the sentences in the dataset

by Gavidia et al. (2022), and there are 7 categories

in total. Because some categories may contain

more sentences (and more PETs) than others, then

the sizes of the training splits of these categories

will be different. To address this, we subsample

from the training splits of the categories with ex-

cess rows to match the training category with the

least number of rows. This way, we ensure that

all categories have an equal number of rows of

training data, and so any changes in performance

will be likely due to the data quality (rather than

due to simply having more/less data). At the end,

this gives us a training size of 1367 rows for each

category. For the test splits, different categories

also have different sizes, but we choose to leave the

test split sizes unchanged and opted not to do the

sampling like we did for the training step because

the smallest testing category has size 26 (“bodily

functions”), while some other categories had test

sizes of 200+ (“physical/mental”), so we found it

impractical to force the test sizes to be identical.

Statistics for these datasets can be found in Table 1.

In theory, having larger test sets will mostly affect

the variance, but the mean should not be affected

that much. We comment more on this in Section 6.

4.3 Models

We consider two different types of baseline models.

First, we consider networks which we can reason-

ably ﬁne-tune. For this group, we select RoBERTa

(Liu et al.,2019), covering both the RoBERTa-base

model and the RoBERTa-large model, which have

been extensively used for classiﬁcation. The ratio-

nale behind choosing RoBERTa was twofold. First,

RoBERTa is a commonly used standard for various

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ExploringEuphemismDetectioninFew-ShotandZero-ShotSettingsSedrickScottKehCarnegieMellonUniversityskeh@cs.cmu.eduAbstractThisworkbuildsupontheEuphemismDe-tectionSharedTaskproposedintheEMNLP2022FigLangWorkshop,andextendsittofew-shotandzero-shotsettings.Wedemon-strateafew-shotandzero-shotformulationusin...

展开>> 收起<<

Exploring Euphemism Detection in Few-Shot and Zero-Shot Settings Sedrick Scott Keh Carnegie Mellon University.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Exploring Euphemism Detection in Few-Shot and Zero-Shot Settings Sedrick Scott Keh Carnegie Mellon University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: