Eureka EUphemism Recognition Enhanced Through KNN-based Methods and Augmentation Sedrick Scott Keh1 Rohit Bharadwaj2 Emmy Liuy1

2025-05-06 0 0 983.19KB 7 页 10玖币
侵权投诉
Eureka:EUphemism Recognition Enhanced Through
KNN-based Methods and Augmentation
Sedrick Scott Keh1, Rohit Bharadwaj2, Emmy Liu1,
Simone Tedeschi3,4, Varun Gangal 1, Roberto Navigli 3
1Carnegie Mellon University, 2Mohamed bin Zayed University of Artificial Intelligence,
3Sapienza University of Rome, 4Babelscape, Italy
{skeh,mengyan3,vgangal}@cs.cmu.edu,rohit.bharadwaj@mbzuai.ac.ae
{tedeschi,navigli}@diag.uniroma1.it
Abstract
We introduce EUREKA, an ensemble-based ap-
proach for performing automatic euphemism
detection. We (1) identify and correct poten-
tially mislabelled rows in the dataset, (2) cu-
rate an expanded corpus called EuphAug, (3)
leverage model representations of Potentially
Euphemistic Terms (PETs), and (4) explore us-
ing representations of semantically close sen-
tences to aid in classification. Using our aug-
mented dataset and kNN-based methods, EU-
REKA1was able to achieve state-of-the-art re-
sults on the public leaderboard of the Eu-
phemism Detection Shared Task, ranking first
with a macro F1 score of 0.881.
1 Introduction
Euphemisms are mild or indirect expressions used
in place of harsher or more direct ones. In every-
day speech, euphemisms function as a means to
politely discuss taboo or sensitive topics (Danescu-
Niculescu-Mizil et al.,2013), to downplay certain
situations (Karam,2011), or to mask intent (Magu
and Luo,2018). The Euphemism Detection task
is a key stepping stone to developing natural lan-
guage systems that are able to process (Tedeschi
et al.,2022;Liu et al.,2022;Jhamtani et al.,2021)
and generate non-literal texts.
In this paper, we detail our methods to the Eu-
phemism Detection Shared Task at the EMNLP
2022 FigLang Workshop
2
. We achieve perfor-
mance improvements on two fronts:
1. Data
– We explore various data cleaning and
data augmentation (Shorten and Khoshgoftaar,
2019;Feng et al.,2021;Dhole et al.,2021) strate-
gies. We identify and correct potentially misla-
belled rows, and we curate a new dataset called
Equal contribution by S. Keh and R. Bharadwaj
Equal contribution by E. Liu and S. Tedeschi
1
Our code is available at
https://github.com/
sedrickkeh/EUREKA
2https://sites.google.com/view/
figlang2022/home?authuser=0
EuphAug by extracting sentences from a large un-
labelled corpus using semantic representations of
the sentences or euphemistic terms in the initial
training corpus.
2. Modelling
– We explore various representa-
tional and design choices, such as leveraging the
LM representations of the tokens for euphemistic
expressions (rather than the
[CLS]
token) and in-
corporating sentential context through kNN aug-
mentation and deep averaging networks.
Using these methods, we develop a system called
EUREKA which achieves a macro F1 score of 0.881
on the public leaderboard and ranks first among
all submissions. We found the data innovations
to be more significant in our case, indicating that
euphemistic terms can be classified with some accu-
racy if potentially euphemistic spans are identified
earlier in the pipeline.
2 Task Settings and Dataset
2.1 Task Settings
The task and dataset are specified by the Eu-
phemism Detection Shared Task, which uses a sub-
set of the euphemism detection dataset of Gavidia
et al. (2022). The goal of the task is to classify a Po-
tentially Euphemistic Term (PET) enclosed within
delimiter tokens as either literal or euphemistic in
that context. The training set contained 207 unique
PETs and 1571 samples, of which 1106 are classi-
fied as euphemisms.
2.2 Data Cleaning
Gavidia et al. (2022) characterize common sources
of ambiguity and disagreement among annotators.
However, while exploring the data, we also spotted
some rows which were, beyond a reasonable doubt,
mislabelled (Table 1). This is an artifact of many
human-annotated datasets (Frenay and Verleysen,
2014) and is a potential source of noise that could
negatively affect performance (Nazari et al.,2018).
arXiv:2210.12846v1 [cs.CL] 23 Oct 2022
Sentence Containing PET Sense
(Euph.)
Sense
(Non-Euph.)
Label
(Original)
Label
(Corrected)
Does your software collect any information about me, my listen-
ing or my surfing habits? Can it be <disabled>?
Handicapped Switched off 1 0
Europe developed rapidly [...] Effective and <economical> move-
ment of goods was no longer a maritime monopoly.
Prudent or
frugal
Related to
the economy
0 1
The Lancers continued to hang on to the <slim> one-point line
as Golden West started a possession following [...]
Thin (physical
appearance)
Thin (non-
physical)
1 0
Table 1: Examples of incorrectly labelled sentences identified by our data cleaning pipeline. The label is 1 if the
term is used euphemistically, 0 otherwise.
Motivated by this, we design a data cleaning
pipeline to quickly identify and correct such er-
rors (Figure 1). Since the goal is simply to correct
as many errors as possible (rather than to be per-
fectly accurate), we take a few heuristic liberties in
our design choices. First, to maximize yield and
avoid dealing with less impactful PETs, we filter
out PETs which appear
<
10 times or are classi-
fied as positive/negative
>
80% of the time. This
leaves us with 33 PETs. We then manually curate a
sense inventory (euphemistic vs. non-euphemistic
senses) using context clues and BabelNet defini-
tions (Navigli and Ponzetto,2012, v5.0). To ensure
the quality of the sense inventory, we have multiple
members of our team look through the assigned
euphemistic and non-euphemistic senses and verify
their appropriateness. Next, for each sentence, we
replace the PET with its euphemistic meaning and
calculate the BERTScores (Zhang* et al.,2020)
between the initial sentences and PET-replaced sen-
tences. Replacing euphemistic PETs should not
change the semantics drastically and hence should
result in a high BERTScore, while replacing non-
euphemistic PETs would lead to a low BERTScore.
To identify potentially misclassified sentences, we
therefore look for positively-classified sentences
with low BERTScores or negatively-classified sen-
tences with high BERTScores. We heuristically
set this threshold at the halfway mark: if a sen-
tence is among the top half of BERTScores and
has a negative label (or among the bottom half and
has a positive label), then we flag it as "potentially
mislabelled". We end up with 203 potentially mis-
labelled sentences.
Once these potentially mislabelled sentences
have been identified, we go through them man-
ually and correct the ones which we identify as
incorrectly labelled, such as the ones in Table 1. In
cases where we are unsure of what the label should
be (e.g. ambiguous cases as mentioned in Gavidia
et al. (2022)), we leave the original label. As was
done with the sense inventories, multiple members
of our team then verify that the corrections made
are appropriate. Although this still involves some
human labor, it is much more tractable as compared
to having to go through the entire dataset. Out of
the 203 potentially mislabelled rows, we modify
the labels of 25 of them.
2.3 EuphAug Corpus
In addition to data cleaning, we also use data aug-
mentation techniques to gather an extended corpus,
which we call EuphAug. We explore two variants
of EuphAug, as outlined below:
1. Representation-Based Augmentation
– We
search in an external corpus for additional sen-
tences in which specific PETs appear, then assign
a label to these PETs based on their vector repre-
sentations. We call this procedure EuphAug-R.
Let our training set (provided by task organiz-
ers) be
S
. Consider a PET
p
, which appears in
sentences
s1, s2,...skS
, with corresponding
labels
ls1, ls2,...lsk∈ {0,1}
. We search in an
external corpus
C
(i.e., WikiText) for
n
sentences
c1, . . . , cn
containing the PET
p
. Finally, for each
sentence c1, . . . , cnwe assign label lcjas follows:
Algorithm 1 EuphAug-R
Task: Given sentence cjcontaining PET p, assign lcj.
for si∈ {s1, s2,...sk}do
Find disti=dist(si, cj)
Find M= arg max{dist1,dist2,...,distk}.
Find m= arg min{dist1,dist2,...,distk}.
if distMδ(|distM-δ|>|distm-|)then
Add cjto augmented corpus with label lcj=lsM
else if distm(|distm-|>|distM-δ|)then
Add
cj
to augmented corpus with label
lcj= 1 lsM
else
Do not add cjto augmented corpus.
end if
where
δ
and
are manually-tuned thresholds, and
dist(
a, b
) represents the cosine distance between
the sentential embeddings
3
of
a
and
b
. In other
3https://www.sbert.net/
摘要:

Eureka:EUphemismRecognitionEnhancedThroughKNN-basedMethodsandAugmentationSedrickScottKeh1,RohitBharadwaj2,EmmyLiuy1,SimoneTedeschiy3,4,VarunGangal1,RobertoNavigli31CarnegieMellonUniversity,2MohamedbinZayedUniversityofArticialIntelligence,3SapienzaUniversityofRome,4Babelscape,Italy{skeh,mengyan3,v...

展开>> 收起<<
Eureka EUphemism Recognition Enhanced Through KNN-based Methods and Augmentation Sedrick Scott Keh1 Rohit Bharadwaj2 Emmy Liuy1.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:983.19KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注