Transferring Knowledge via Neighborhood-Aware Optimal Transport for Low-Resource Hate Speech Detection Tulika Bose Irina Illina Dominique Fohr

2025-05-06 0 0 588.08KB 15 页 10玖币
侵权投诉
Transferring Knowledge via Neighborhood-Aware Optimal Transport
for Low-Resource Hate Speech Detection
Tulika Bose Irina Illina Dominique Fohr
Universite de Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France
{tulika.bose, illina, dominique.fohr}@loria.fr
Abstract
Warning: this paper contains content that
may be offensive and distressing.
The concerning rise of hateful content on on-
line platforms has increased the attention to-
wards automatic hate speech detection, com-
monly formulated as a supervised classifi-
cation task. State-of-the-art deep learning-
based approaches usually require a substantial
amount of labeled resources for training. How-
ever, annotating hate speech resources is ex-
pensive, time-consuming, and often harmful to
the annotators. This creates a pressing need to
transfer knowledge from the existing labeled
resources to low-resource hate speech corpora
with the goal of improving system perfor-
mance. For this, neighborhood-based frame-
works have been shown to be effective. How-
ever, they have limited flexibility. In our pa-
per, we propose a novel training strategy that
allows flexible modeling of the relative prox-
imity of neighbors retrieved from a resource-
rich corpus to learn the amount of transfer. In
particular, we incorporate neighborhood infor-
mation with Optimal Transport, which permits
exploiting the geometry of the data embedding
space. By aligning the joint embedding and
label distributions of neighbors, we demon-
strate substantial improvements over strong
baselines, in low-resource scenarios, on differ-
ent publicly available hate speech corpora.
1 Introduction
With the alarming spread of Hate Speech (HS) in
social media, Natural language Processing tech-
niques have been used to develop automatic HS
detection systems, typically to aid manual con-
tent moderation. Although deep learning-based
approaches (Mozafari et al.,2019;Badjatiya et al.,
2017) have become state-of-the-art in this task,
their performance depends on the size of the la-
beled resources available for training (Lee et al.,
2018;Alwosheel et al.,2018).
Annotating a large corpus for HS is considerably
time-consuming, expensive, and harmful to human
annotators (Schmidt and Wiegand,2017;Malmasi
and Zampieri,2018;Poletto et al.,2019;Sarwar
et al.,2022). Moreover, models trained on existing
labeled HS corpora have shown poor generaliza-
tion when evaluated on new HS content (Yin and
Zubiaga,2021;Arango et al.,2019;Swamy et al.,
2019;Karan and Šnajder,2018). This is due to the
differences across these corpora, such as sampling
strategies (Wiegand et al.,2019), varied topics of
discussion (Florio et al.,2020;Saha and Sindhwani,
2012), varied vocabularies, and different victims
of hate. Thus, to address these challenges, here we
aim to devise a strategy that can effectively trans-
fer knowledge from a resource-rich source corpus
with a higher amount of annotated content to a low-
resource target corpus with fewer labeled instances.
One popular way to address this is transfer learn-
ing. For instance, Mozafari et al. (2019) fine-tune
a large-scale pre-trained language model, BERT
(Devlin et al.,2019), on the limited training exam-
ples in HS corpora. Further, a sequential trans-
fer, following Garg et al. (2020), can be per-
formed where a pre-trained model is first fine-
tuned on a resource-rich source corpus and sub-
sequently fine-tuned on the low-resource target cor-
pus. Since this may risk forgetting knowledge from
the source, the source and target corpora can be
mixed for training (Shnarch et al.,2018). Besides,
to learn target-specific patterns without forgetting
the source knowledge, Meftah et al. (2021) aug-
ment pre-trained neurons from the source model
with randomly initialized units for transferring
knowledge to low-resource domains.
Recently, Sarwar et al. (2022) argue that tradi-
tional transfer learning strategies are not systematic.
Therefore, they model the relationship between a
source and a target corpus with a neighborhood
framework and show its effectiveness in transfer
learning for content flagging. They model the in-
arXiv:2210.09340v1 [cs.CL] 17 Oct 2022
teraction between a query instance from the target
and its neighbors retrieved from the source. This
interaction is modeled based on their label agree-
ment – whether the query and its neighbors have
the same labels – while using a fixed neighbor-
hood size. However, different neighbors may have
varying levels of proximity to the queried instance
based on their pair-wise cosine similarities in a sen-
tence embedding space. Therefore, intuitively, the
neighbors should also be weighted according to
these similarity scores.
We hypothesize that simultaneously modeling
the pair-wise distances between instances from the
low-resource target and their respective neighbors
from the resource-rich source, along with their la-
bel distributions should result in a more flexible
and effective transfer. With this aim, we propose a
novel training strategy where the model learns to
assign varying importance to the neighbors corre-
sponding to different target instances by optimizing
the amount of pair-wise transfer. This transfer is
learned without changing the underlying model ar-
chitecture. Such optimization can be efficiently per-
formed using Optimal Transport (OT) (Peyré and
Cuturi,2019;Villani,2009;Kantorovich,2006)
due to its ability to find correspondences between
instances while exploiting the underlying geome-
try of the embedding space. Our contributions are
summarised as follows:
We address HS detection in low-resource sce-
narios with a flexible and systematic transfer
learning strategy.
We propose novel incorporation of neighbor-
hood information with joint distribution Op-
timal Transport. This enables learning of the
amount of transfer between pairs of source
and target instances considering both (i) the
similarity scores of the neighbors and (ii) their
associated labels. To the best of our knowl-
edge, this is the first work that introduces Op-
timal Transport for HS detection.
We demonstrate the effectiveness of our ap-
proach through considerable improvements
over strong baselines, along with quantitative
and qualitative analysis on different HS cor-
pora from varied platforms.
2 Related Works
2.1 Hate Speech Detection
Deep Neural Networks, especially the transformer-
based models, such as the pre-trained BERT, have
dominated the field of HS detection in the past
few years (Alatawi et al.,2021;D’Sa et al.,2020;
Glavaš et al.,2020;Mozafari et al.,2019).
Wiegand et al. (2019); Arango et al. (2019)
raise concerns about data bias present in most
HS corpora, which results in overestimated within-
corpus performance. They, therefore, recommend
cross-corpus evaluations as more realistic settings.
Bigoulaeva et al. (2021); Bose et al. (2021); Pa-
mungkas et al. (2021) perform such cross-corpus
evaluations in this task with no access to labeled in-
stances from the target. However, Yin and Zubiaga
(2021); Wiegand et al. (2019) report fluctuating or
degraded performance across corpora. As pointed
out by Sarwar et al. (2022), in real-life scenarios,
most online platforms could invest in obtaining at
least some labeled training instances for deploying
an HS detection system. Thus, we study a more
realistic setting where a limited amount of labeled
content is available in the target corpus.
2.2 Neighborhood Framework
k
-Nearest Neighbors (
k
NN)-based approaches
have been successfully used in the literature for
an array of tasks such as language modeling (Khan-
delwal et al.,2020), question answering (Kassner
and Schütze,2020), dialogue generation (Fan et al.,
2021), etc. Besides,
k
NN classifiers have been used
for HS detection (Prasetyo and Samudra,2022;
Briliani et al.,2019), which typically predict the
class of an input instance through a simple majority
voting using its neighbors in the training data.
Recently, Sarwar et al. (2022) propose a neigh-
borhood framework
k
NN
+
for transfer learning
in cross-lingual low-resource settings. They show
that a simple kNN classifier is prone to prediction
errors as the neighbors may have similar mean-
ings, but opposite labels. They, instead, model the
interactions between the target corpus instances,
treated as queries, and their nearest neighbors re-
trieved from the source. This neighborhood in-
teraction is modeled based on whether a query
and its neighbors have the same or different la-
bels. In their best performing framework (in cross-
lingual setting) of Cross-Encoder
k
NN
+
,Sarwar
et al. (2022) obtain representations of concatenated
query-neighbor pairs to learn such neighborhood
interactions.
However, Sarwar et al. (2022) do not consider the
varying levels of the proximity of different neigh-
bors to the query. Besides, a mini-batch in their
framework comprises a query and all its neigh-
bors. For fine-tuning large language models like
BERT, the batch size needs to be kept small due
to resource constraints. This could limit the neigh-
borhood size in their framework. This is different
from our approach, where the neighborhood size is
scalable.
2.3 Optimal Transport
Optimal Transport (OT) has become increasingly
popular in diverse NLP applications, as it allows
comparing probability distributions in a geometri-
cally sound manner. These include machine trans-
lation (Xu et al.,2021), interpretable semantic simi-
larity (Lee et al.,2022), rationalizing text matching
(Swanson et al.,2020), etc. Moreover, OT has been
successfully used for domain adaptation in audio,
images, and text (Olvera et al.,2021;Damodaran
et al.,2018;Chen et al.,2020). In this work, we
perform novel incorporation of nearest neighbor-
hood information with OT. Besides, to the best of
our knowledge, this is the first work that introduces
OT to the HS detection task.
3 Proposed Approach
Our problem setting involves a low-resource tar-
get corpus
Xt
with a limited amount of labeled
training data
(Xt
train, Y t
train) = {xt
i, yt
i}nt
i=1
and a
resource-rich source corpus
Xs
from a different
distribution with a large number of annotated data
(Xs
train, Y s
train) = {xs
i, ys
i}ns
i=1
. Given such a set-
ting, we hypothesize that transferring knowledge
from the nearest neighbors in the source should
improve the performance on the insufficiently la-
beled target. Furthermore, to provide additional
control to the model, we propose a systematic trans-
fer. With this transfer mechanism, a model can
learn different weights assigned to the neighbors
in
Xs
train
based on their proximity to the instances
in
Xt
train
simultaneously in a sentence embedding
space and the label space. For this, we incorporate
neighborhood information with Optimal Transport
(OT), as OT can learn correspondences between
instances from
Xs
train
and
Xt
train
by exploiting the
underlying embedding space geometry.
3.1 Joint Distribution Optimal Transport
In this work, we use the joint distribution optimal
transport (JDOT) framework (Courty et al.,2017)
following the works of Damodaran et al. (2018);
Fatras et al. (2021), proposed for unsupervised do-
main adaptation in deep embedding spaces. The
framework aligns the joint distribution
P(Z, Y )
of
the source and the target domains, where
Z
is the
embedding space through a mapping function
g(.)
,
and
Y
is the label space. For a discrete setting, let
µs=Pns
iaiδg(xs
i),ys
i
and
µt=Pnt
ibiδg(xt
i),yt
i
be two empirical distributions on the product space
of
Z×Y
. Here
δg(xi),yi
is the Dirac function at the
position
(g(xi), yi)
, and
ai
,
bi
are uniform proba-
bility weights, i.e. Pns
iai=Pnt
ibi= 1.
The ‘balanced’ OT problem (
OTb
), as defined
by Kantorovich (2006), seeks for a transport plan
γ
in the space of the joint probability distribution
Π(µs, µt)
, with marginals
µs
and
µt
, that mini-
mizes the cost of transport from µsto µt, as:
OTb(µs, µt) = min
γΠ(µst)X
i,j
γi,j ci,j
s.t. γ1nt=µs, γT1ns=µt
(1)
Here
ci,j
is an entry in a cost matrix
CRns×nt
,
representing the pair-wise cost (see Section 3.2),
and
1n
is a vector of ones with dimension
n
. Each
entry
γi,j
indicates the amount of transfer from
location iin the source to jin the target.
The constraint on
γ
requires that all mass from
µs
is transported to
µt
. However, this can be alleviated
through relaxation, leading to the ‘unbalanced’ OT
(OTu) (Benamou,2003), as:
OTu(µs, µt) = min
γΠ(µst)X
i,j
γi,j ci,j + Λ ;
where Λ = Ω(γ) + λKL(γ1nt, µs) + KL(γT1ns, µt)
s.t. γ 0(2)
KL is the Kullback-Leibler divergence that allows
the relaxation of the marginal constraint on
γ
.
λ
is the marginal relaxation coefficient.
Ω(γ) =
Pi,j γi,j log(γi,j )
corresponds to the entropic regu-
larization term, which allows fast computation of
the OT distances (Cuturi,2013).
is the entropy
coefficient.
For models with a high-dimensional embedding
space like ours, Fatras et al. (2021) propose to
make the computation of OT losses scalable us-
ing the mini-batch OT. Thus, for every mini-batch,
we sample an equal number of instances, given by
the batch size
m
, from
Xs
train
and
Xt
train
, which
摘要:

TransferringKnowledgeviaNeighborhood-AwareOptimalTransportforLow-ResourceHateSpeechDetectionTulikaBoseIrinaIllinaDominiqueFohrUniversitedeLorraine,CNRS,Inria,LORIA,F-54000Nancy,France{tulika.bose,illina,dominique.fohr}@loria.frAbstractWarning:thispapercontainscontentthatmaybeoffensiveanddistressing....

展开>> 收起<<
Transferring Knowledge via Neighborhood-Aware Optimal Transport for Low-Resource Hate Speech Detection Tulika Bose Irina Illina Dominique Fohr.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:588.08KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注