Eureka EUphemism Recognition Enhanced Through KNN-based Methods and Augmentation Sedrick Scott Keh1 Rohit Bharadwaj2 Emmy Liuy1

2025-05-06 0 0 983.19KB 7 页 10玖币

侵权投诉

Eureka:EUphemism Recognition Enhanced Through

KNN-based Methods and Augmentation

Sedrick Scott Keh∗1, Rohit Bharadwaj∗2, Emmy Liu†1,

Simone Tedeschi†3,4, Varun Gangal 1, Roberto Navigli 3

1Carnegie Mellon University, 2Mohamed bin Zayed University of Artiﬁcial Intelligence,

3Sapienza University of Rome, 4Babelscape, Italy

{skeh,mengyan3,vgangal}@cs.cmu.edu,rohit.bharadwaj@mbzuai.ac.ae

{tedeschi,navigli}@diag.uniroma1.it

Abstract

We introduce EUREKA, an ensemble-based ap-

proach for performing automatic euphemism

detection. We (1) identify and correct poten-

tially mislabelled rows in the dataset, (2) cu-

rate an expanded corpus called EuphAug, (3)

leverage model representations of Potentially

Euphemistic Terms (PETs), and (4) explore us-

ing representations of semantically close sen-

tences to aid in classiﬁcation. Using our aug-

mented dataset and kNN-based methods, EU-

REKA1was able to achieve state-of-the-art re-

sults on the public leaderboard of the Eu-

phemism Detection Shared Task, ranking ﬁrst

with a macro F1 score of 0.881.

1 Introduction

Euphemisms are mild or indirect expressions used

in place of harsher or more direct ones. In every-

day speech, euphemisms function as a means to

politely discuss taboo or sensitive topics (Danescu-

Niculescu-Mizil et al.,2013), to downplay certain

situations (Karam,2011), or to mask intent (Magu

and Luo,2018). The Euphemism Detection task

is a key stepping stone to developing natural lan-

guage systems that are able to process (Tedeschi

et al.,2022;Liu et al.,2022;Jhamtani et al.,2021)

and generate non-literal texts.

In this paper, we detail our methods to the Eu-

phemism Detection Shared Task at the EMNLP

2022 FigLang Workshop

. We achieve perfor-

mance improvements on two fronts:

1. Data

– We explore various data cleaning and

data augmentation (Shorten and Khoshgoftaar,

2019;Feng et al.,2021;Dhole et al.,2021) strate-

gies. We identify and correct potentially misla-

belled rows, and we curate a new dataset called

∗Equal contribution by S. Keh and R. Bharadwaj

†Equal contribution by E. Liu and S. Tedeschi

Our code is available at

https://github.com/

sedrickkeh/EUREKA

2https://sites.google.com/view/

figlang2022/home?authuser=0

EuphAug by extracting sentences from a large un-

labelled corpus using semantic representations of

the sentences or euphemistic terms in the initial

training corpus.

2. Modelling

– We explore various representa-

tional and design choices, such as leveraging the

LM representations of the tokens for euphemistic

expressions (rather than the

[CLS]

token) and in-

corporating sentential context through kNN aug-

mentation and deep averaging networks.

Using these methods, we develop a system called

EUREKA which achieves a macro F1 score of 0.881

on the public leaderboard and ranks ﬁrst among

all submissions. We found the data innovations

to be more signiﬁcant in our case, indicating that

euphemistic terms can be classiﬁed with some accu-

racy if potentially euphemistic spans are identiﬁed

earlier in the pipeline.

2 Task Settings and Dataset

2.1 Task Settings

The task and dataset are speciﬁed by the Eu-

phemism Detection Shared Task, which uses a sub-

set of the euphemism detection dataset of Gavidia

et al. (2022). The goal of the task is to classify a Po-

tentially Euphemistic Term (PET) enclosed within

delimiter tokens as either literal or euphemistic in

that context. The training set contained 207 unique

PETs and 1571 samples, of which 1106 are classi-

ﬁed as euphemisms.

2.2 Data Cleaning

Gavidia et al. (2022) characterize common sources

of ambiguity and disagreement among annotators.

However, while exploring the data, we also spotted

some rows which were, beyond a reasonable doubt,

mislabelled (Table 1). This is an artifact of many

human-annotated datasets (Frenay and Verleysen,

2014) and is a potential source of noise that could

negatively affect performance (Nazari et al.,2018).

arXiv:2210.12846v1 [cs.CL] 23 Oct 2022

Sentence Containing PET Sense

(Euph.)

Sense

(Non-Euph.)

Label

(Original)

Label

(Corrected)

Does your software collect any information about me, my listen-

ing or my surﬁng habits? Can it be <disabled>?

Handicapped Switched off 1 0

Europe developed rapidly [...] Effective and <economical> move-

ment of goods was no longer a maritime monopoly.

Prudent or

frugal

Related to

the economy

0 1

The Lancers continued to hang on to the <slim> one-point line

as Golden West started a possession following [...]

Thin (physical

appearance)

Thin (non-

physical)

1 0

Table 1: Examples of incorrectly labelled sentences identiﬁed by our data cleaning pipeline. The label is 1 if the

term is used euphemistically, 0 otherwise.

Motivated by this, we design a data cleaning

pipeline to quickly identify and correct such er-

rors (Figure 1). Since the goal is simply to correct

as many errors as possible (rather than to be per-

fectly accurate), we take a few heuristic liberties in

our design choices. First, to maximize yield and

avoid dealing with less impactful PETs, we ﬁlter

out PETs which appear

10 times or are classi-

ﬁed as positive/negative

80% of the time. This

leaves us with 33 PETs. We then manually curate a

sense inventory (euphemistic vs. non-euphemistic

senses) using context clues and BabelNet deﬁni-

tions (Navigli and Ponzetto,2012, v5.0). To ensure

the quality of the sense inventory, we have multiple

members of our team look through the assigned

euphemistic and non-euphemistic senses and verify

their appropriateness. Next, for each sentence, we

replace the PET with its euphemistic meaning and

calculate the BERTScores (Zhang* et al.,2020)

between the initial sentences and PET-replaced sen-

tences. Replacing euphemistic PETs should not

change the semantics drastically and hence should

result in a high BERTScore, while replacing non-

euphemistic PETs would lead to a low BERTScore.

To identify potentially misclassiﬁed sentences, we

therefore look for positively-classiﬁed sentences

with low BERTScores or negatively-classiﬁed sen-

tences with high BERTScores. We heuristically

set this threshold at the halfway mark: if a sen-

tence is among the top half of BERTScores and

has a negative label (or among the bottom half and

has a positive label), then we ﬂag it as "potentially

mislabelled". We end up with 203 potentially mis-

labelled sentences.

Once these potentially mislabelled sentences

have been identiﬁed, we go through them man-

ually and correct the ones which we identify as

incorrectly labelled, such as the ones in Table 1. In

cases where we are unsure of what the label should

be (e.g. ambiguous cases as mentioned in Gavidia

et al. (2022)), we leave the original label. As was

done with the sense inventories, multiple members

of our team then verify that the corrections made

are appropriate. Although this still involves some

human labor, it is much more tractable as compared

to having to go through the entire dataset. Out of

the 203 potentially mislabelled rows, we modify

the labels of 25 of them.

2.3 EuphAug Corpus

In addition to data cleaning, we also use data aug-

mentation techniques to gather an extended corpus,

which we call EuphAug. We explore two variants

of EuphAug, as outlined below:

1. Representation-Based Augmentation

– We

search in an external corpus for additional sen-

tences in which speciﬁc PETs appear, then assign

a label to these PETs based on their vector repre-

sentations. We call this procedure EuphAug-R.

Let our training set (provided by task organiz-

ers) be

. Consider a PET

, which appears in

sentences

s1, s2,...sk∈S

, with corresponding

labels

ls1, ls2,...lsk∈ {0,1}

. We search in an

external corpus

(i.e., WikiText) for

sentences

c1, . . . , cn

containing the PET

. Finally, for each

sentence c1, . . . , cnwe assign label lcjas follows:

Algorithm 1 EuphAug-R

Task: Given sentence cjcontaining PET p, assign lcj.

for si∈ {s1, s2,...sk}do

Find disti=dist(si, cj)

Find M= arg max{dist1,dist2,...,distk}.

Find m= arg min{dist1,dist2,...,distk}.

if distM≥δ∧(|distM-δ|>|distm-|)then

Add cjto augmented corpus with label lcj=lsM

else if distm≤∧(|distm-|>|distM-δ|)then

Add

to augmented corpus with label

lcj= 1 −lsM

else

Do not add cjto augmented corpus.

end if

where

and



are manually-tuned thresholds, and

dist(

a, b

) represents the cosine distance between

the sentential embeddings

and

. In other

3https://www.sbert.net/

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Eureka:EUphemismRecognitionEnhancedThroughKNN-basedMethodsandAugmentationSedrickScottKeh1,RohitBharadwaj2,EmmyLiuy1,SimoneTedeschiy3,4,VarunGangal1,RobertoNavigli31CarnegieMellonUniversity,2MohamedbinZayedUniversityofArticialIntelligence,3SapienzaUniversityofRome,4Babelscape,Italy{skeh,mengyan3,v...

展开>> 收起<<

Eureka EUphemism Recognition Enhanced Through KNN-based Methods and Augmentation Sedrick Scott Keh1 Rohit Bharadwaj2 Emmy Liuy1.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Eureka EUphemism Recognition Enhanced Through KNN-based Methods and Augmentation Sedrick Scott Keh1 Rohit Bharadwaj2 Emmy Liuy1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: