Transferring Knowledge via Neighborhood-Aware Optimal Transport for Low-Resource Hate Speech Detection Tulika Bose Irina Illina Dominique Fohr

2025-05-06 0 0 588.08KB 15 页 10玖币

侵权投诉

Transferring Knowledge via Neighborhood-Aware Optimal Transport

for Low-Resource Hate Speech Detection

Tulika Bose Irina Illina Dominique Fohr

Universite de Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France

{tulika.bose, illina, dominique.fohr}@loria.fr

Abstract

Warning: this paper contains content that

may be offensive and distressing.

The concerning rise of hateful content on on-

line platforms has increased the attention to-

wards automatic hate speech detection, com-

monly formulated as a supervised classiﬁ-

cation task. State-of-the-art deep learning-

based approaches usually require a substantial

amount of labeled resources for training. How-

ever, annotating hate speech resources is ex-

pensive, time-consuming, and often harmful to

the annotators. This creates a pressing need to

transfer knowledge from the existing labeled

resources to low-resource hate speech corpora

with the goal of improving system perfor-

mance. For this, neighborhood-based frame-

works have been shown to be effective. How-

ever, they have limited ﬂexibility. In our pa-

per, we propose a novel training strategy that

allows ﬂexible modeling of the relative prox-

imity of neighbors retrieved from a resource-

rich corpus to learn the amount of transfer. In

particular, we incorporate neighborhood infor-

mation with Optimal Transport, which permits

exploiting the geometry of the data embedding

space. By aligning the joint embedding and

label distributions of neighbors, we demon-

strate substantial improvements over strong

baselines, in low-resource scenarios, on differ-

ent publicly available hate speech corpora.

1 Introduction

With the alarming spread of Hate Speech (HS) in

social media, Natural language Processing tech-

niques have been used to develop automatic HS

detection systems, typically to aid manual con-

tent moderation. Although deep learning-based

approaches (Mozafari et al.,2019;Badjatiya et al.,

2017) have become state-of-the-art in this task,

their performance depends on the size of the la-

beled resources available for training (Lee et al.,

2018;Alwosheel et al.,2018).

Annotating a large corpus for HS is considerably

time-consuming, expensive, and harmful to human

annotators (Schmidt and Wiegand,2017;Malmasi

and Zampieri,2018;Poletto et al.,2019;Sarwar

et al.,2022). Moreover, models trained on existing

labeled HS corpora have shown poor generaliza-

tion when evaluated on new HS content (Yin and

Zubiaga,2021;Arango et al.,2019;Swamy et al.,

2019;Karan and Šnajder,2018). This is due to the

differences across these corpora, such as sampling

strategies (Wiegand et al.,2019), varied topics of

discussion (Florio et al.,2020;Saha and Sindhwani,

2012), varied vocabularies, and different victims

of hate. Thus, to address these challenges, here we

aim to devise a strategy that can effectively trans-

fer knowledge from a resource-rich source corpus

with a higher amount of annotated content to a low-

resource target corpus with fewer labeled instances.

One popular way to address this is transfer learn-

ing. For instance, Mozafari et al. (2019) ﬁne-tune

a large-scale pre-trained language model, BERT

(Devlin et al.,2019), on the limited training exam-

ples in HS corpora. Further, a sequential trans-

fer, following Garg et al. (2020), can be per-

formed where a pre-trained model is ﬁrst ﬁne-

tuned on a resource-rich source corpus and sub-

sequently ﬁne-tuned on the low-resource target cor-

pus. Since this may risk forgetting knowledge from

the source, the source and target corpora can be

mixed for training (Shnarch et al.,2018). Besides,

to learn target-speciﬁc patterns without forgetting

the source knowledge, Meftah et al. (2021) aug-

ment pre-trained neurons from the source model

with randomly initialized units for transferring

knowledge to low-resource domains.

Recently, Sarwar et al. (2022) argue that tradi-

tional transfer learning strategies are not systematic.

Therefore, they model the relationship between a

source and a target corpus with a neighborhood

framework and show its effectiveness in transfer

learning for content ﬂagging. They model the in-

arXiv:2210.09340v1 [cs.CL] 17 Oct 2022

teraction between a query instance from the target

and its neighbors retrieved from the source. This

interaction is modeled based on their label agree-

ment – whether the query and its neighbors have

the same labels – while using a ﬁxed neighbor-

hood size. However, different neighbors may have

varying levels of proximity to the queried instance

based on their pair-wise cosine similarities in a sen-

tence embedding space. Therefore, intuitively, the

neighbors should also be weighted according to

these similarity scores.

We hypothesize that simultaneously modeling

the pair-wise distances between instances from the

low-resource target and their respective neighbors

from the resource-rich source, along with their la-

bel distributions should result in a more ﬂexible

and effective transfer. With this aim, we propose a

novel training strategy where the model learns to

assign varying importance to the neighbors corre-

sponding to different target instances by optimizing

the amount of pair-wise transfer. This transfer is

learned without changing the underlying model ar-

chitecture. Such optimization can be efﬁciently per-

formed using Optimal Transport (OT) (Peyré and

Cuturi,2019;Villani,2009;Kantorovich,2006)

due to its ability to ﬁnd correspondences between

instances while exploiting the underlying geome-

try of the embedding space. Our contributions are

summarised as follows:

•

We address HS detection in low-resource sce-

narios with a ﬂexible and systematic transfer

learning strategy.

•

We propose novel incorporation of neighbor-

hood information with joint distribution Op-

timal Transport. This enables learning of the

amount of transfer between pairs of source

and target instances considering both (i) the

similarity scores of the neighbors and (ii) their

associated labels. To the best of our knowl-

edge, this is the ﬁrst work that introduces Op-

timal Transport for HS detection.

•

We demonstrate the effectiveness of our ap-

proach through considerable improvements

over strong baselines, along with quantitative

and qualitative analysis on different HS cor-

pora from varied platforms.

2 Related Works

2.1 Hate Speech Detection

Deep Neural Networks, especially the transformer-

based models, such as the pre-trained BERT, have

dominated the ﬁeld of HS detection in the past

few years (Alatawi et al.,2021;D’Sa et al.,2020;

Glavaš et al.,2020;Mozafari et al.,2019).

Wiegand et al. (2019); Arango et al. (2019)

raise concerns about data bias present in most

HS corpora, which results in overestimated within-

corpus performance. They, therefore, recommend

cross-corpus evaluations as more realistic settings.

Bigoulaeva et al. (2021); Bose et al. (2021); Pa-

mungkas et al. (2021) perform such cross-corpus

evaluations in this task with no access to labeled in-

stances from the target. However, Yin and Zubiaga

(2021); Wiegand et al. (2019) report ﬂuctuating or

degraded performance across corpora. As pointed

out by Sarwar et al. (2022), in real-life scenarios,

most online platforms could invest in obtaining at

least some labeled training instances for deploying

an HS detection system. Thus, we study a more

realistic setting where a limited amount of labeled

content is available in the target corpus.

2.2 Neighborhood Framework

-Nearest Neighbors (

NN)-based approaches

have been successfully used in the literature for

an array of tasks such as language modeling (Khan-

delwal et al.,2020), question answering (Kassner

and Schütze,2020), dialogue generation (Fan et al.,

2021), etc. Besides,

NN classiﬁers have been used

for HS detection (Prasetyo and Samudra,2022;

Briliani et al.,2019), which typically predict the

class of an input instance through a simple majority

voting using its neighbors in the training data.

Recently, Sarwar et al. (2022) propose a neigh-

borhood framework

for transfer learning

in cross-lingual low-resource settings. They show

that a simple kNN classiﬁer is prone to prediction

errors as the neighbors may have similar mean-

ings, but opposite labels. They, instead, model the

interactions between the target corpus instances,

treated as queries, and their nearest neighbors re-

trieved from the source. This neighborhood in-

teraction is modeled based on whether a query

and its neighbors have the same or different la-

bels. In their best performing framework (in cross-

lingual setting) of Cross-Encoder

,Sarwar

et al. (2022) obtain representations of concatenated

query-neighbor pairs to learn such neighborhood

interactions.

However, Sarwar et al. (2022) do not consider the

varying levels of the proximity of different neigh-

bors to the query. Besides, a mini-batch in their

framework comprises a query and all its neigh-

bors. For ﬁne-tuning large language models like

BERT, the batch size needs to be kept small due

to resource constraints. This could limit the neigh-

borhood size in their framework. This is different

from our approach, where the neighborhood size is

scalable.

2.3 Optimal Transport

Optimal Transport (OT) has become increasingly

popular in diverse NLP applications, as it allows

comparing probability distributions in a geometri-

cally sound manner. These include machine trans-

lation (Xu et al.,2021), interpretable semantic simi-

larity (Lee et al.,2022), rationalizing text matching

(Swanson et al.,2020), etc. Moreover, OT has been

successfully used for domain adaptation in audio,

images, and text (Olvera et al.,2021;Damodaran

et al.,2018;Chen et al.,2020). In this work, we

perform novel incorporation of nearest neighbor-

hood information with OT. Besides, to the best of

our knowledge, this is the ﬁrst work that introduces

OT to the HS detection task.

3 Proposed Approach

Our problem setting involves a low-resource tar-

get corpus

with a limited amount of labeled

training data

(Xt

train, Y t

train) = {xt

i, yt

i}nt

i=1

and a

resource-rich source corpus

from a different

distribution with a large number of annotated data

(Xs

train, Y s

train) = {xs

i, ys

i}ns

i=1

. Given such a set-

ting, we hypothesize that transferring knowledge

from the nearest neighbors in the source should

improve the performance on the insufﬁciently la-

beled target. Furthermore, to provide additional

control to the model, we propose a systematic trans-

fer. With this transfer mechanism, a model can

learn different weights assigned to the neighbors

train

based on their proximity to the instances

train

simultaneously in a sentence embedding

space and the label space. For this, we incorporate

neighborhood information with Optimal Transport

(OT), as OT can learn correspondences between

instances from

train

and

train

by exploiting the

underlying embedding space geometry.

3.1 Joint Distribution Optimal Transport

In this work, we use the joint distribution optimal

transport (JDOT) framework (Courty et al.,2017)

following the works of Damodaran et al. (2018);

Fatras et al. (2021), proposed for unsupervised do-

main adaptation in deep embedding spaces. The

framework aligns the joint distribution

P(Z, Y )

the source and the target domains, where

is the

embedding space through a mapping function

g(.)

and

is the label space. For a discrete setting, let

µs=Pns

iaiδg(xs

i),ys

and

µt=Pnt

ibiδg(xt

i),yt

be two empirical distributions on the product space

Z×Y

. Here

δg(xi),yi

is the Dirac function at the

position

(g(xi), yi)

, and

are uniform proba-

bility weights, i.e. Pns

iai=Pnt

ibi= 1.

The ‘balanced’ OT problem (

OTb

), as deﬁned

by Kantorovich (2006), seeks for a transport plan

in the space of the joint probability distribution

Π(µs, µt)

, with marginals

µs

and

µt

, that mini-

mizes the cost of transport from µsto µt, as:

OTb(µs, µt) = min

γ∈Π(µs,µt)X

i,j

γi,j ci,j

s.t. γ1nt=µs, γT1ns=µt

(1)

Here

ci,j

is an entry in a cost matrix

C∈Rns×nt

representing the pair-wise cost (see Section 3.2),

and

is a vector of ones with dimension

. Each

entry

γi,j

indicates the amount of transfer from

location iin the source to jin the target.

The constraint on

requires that all mass from

µs

is transported to

µt

. However, this can be alleviated

through relaxation, leading to the ‘unbalanced’ OT

(OTu) (Benamou,2003), as:

OTu(µs, µt) = min

γ∈Π(µs,µt)X

i,j

γi,j ci,j + Λ ;

where Λ = Ω(γ) + λKL(γ1nt, µs) + KL(γT1ns, µt)

s.t. γ ≥0(2)

KL is the Kullback-Leibler divergence that allows

the relaxation of the marginal constraint on

is the marginal relaxation coefﬁcient.

Ω(γ) =

Pi,j γi,j log(γi,j )

corresponds to the entropic regu-

larization term, which allows fast computation of

the OT distances (Cuturi,2013).



is the entropy

coefﬁcient.

For models with a high-dimensional embedding

space like ours, Fatras et al. (2021) propose to

make the computation of OT losses scalable us-

ing the mini-batch OT. Thus, for every mini-batch,

we sample an equal number of instances, given by

the batch size

, from

train

and

train

, which

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TransferringKnowledgeviaNeighborhood-AwareOptimalTransportforLow-ResourceHateSpeechDetectionTulikaBoseIrinaIllinaDominiqueFohrUniversitedeLorraine,CNRS,Inria,LORIA,F-54000Nancy,France{tulika.bose,illina,dominique.fohr}@loria.frAbstractWarning:thispapercontainscontentthatmaybeoffensiveanddistressing....

展开>> 收起<<

Transferring Knowledge via Neighborhood-Aware Optimal Transport for Low-Resource Hate Speech Detection Tulika Bose Irina Illina Dominique Fohr.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Transferring Knowledge via Neighborhood-Aware Optimal Transport for Low-Resource Hate Speech Detection Tulika Bose Irina Illina Dominique Fohr

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: