Disentangled Text Representation Learning with Information-Theoretic Perspective for Adversarial Robustness Jiahao Zhao Wenji Mao

2025-05-03 0 0 821.65KB 15 页 10玖币

侵权投诉

Disentangled Text Representation Learning with Information-Theoretic

Perspective for Adversarial Robustness

Jiahao Zhao Wenji Mao

Institute of Automation, Chinese Academy of Sciences

{zhaojiahao2019,wenji.mao}@ia.ac.cn

Abstract

Adversarial vulnerability remains a major ob-

stacle to constructing reliable NLP systems.

When imperceptible perturbations are added

to raw input text, the performance of a deep

learning model may drop dramatically under

attacks. Recent work argues the adversarial

vulnerability of the model is caused by the non-

robust features in supervised training. Thus in

this paper, we tackle the adversarial robustness

challenge from the view of disentangled rep-

resentation learning, which is able to explic-

itly disentangle robust and non-robust features

in text. Speciﬁcally, inspired by the variation

of information (VI) in information theory, we

derive a disentangled learning objective com-

posed of mutual information to represent both

the semantic representativeness of latent em-

beddings and differentiation of robust and non-

robust features. On the basis of this, we design

a disentangled learning network to estimate

these mutual information. Experiments on text

classiﬁcation and entailment tasks show that

our method signiﬁcantly outperforms the rep-

resentative methods under adversarial attacks,

indicating that discarding non-robust features

is critical for improving adversarial robust-

ness.

1 Introduction

Although deep neural networks have achieved great

success in a variety of Natural Language Process-

ing (NLP) tasks, recent studies show their vulnera-

bility to malicious perturbations (Goodfellow et al.,

2015;Jia and Liang,2017;Gao et al.,2018;Jin

et al.,2020). By adding imperceptible perturba-

tions (e.g. typos or synonym substitutions) to orig-

inal input text, attackers can generate adversarial

examples to deceive the model. Adversarial exam-

ples pervasively exist in typical NLP tasks, includ-

ing text classiﬁcation (Jin et al.,2020), dependency

parsing (Zheng et al.,2020), machine translation

(Zhang et al.,2021) and many others. These mod-

els work well on clean data but are sensitive to im-

perceptible perturbations. Recent studies indicate

that they are likely to rely on superﬁcial cues rather

than deeper, more difﬁcult language phenomena,

and thus tend to make incomprehensible mistakes

under adversarial examples (Jia and Liang,2017;

Branco et al.,2021).

Tremendous efforts have been made to improve

the adversarial robustness of NLP models. Among

them, the most effective strategy is adversarial train-

ing (Li and Qiu,2021;Wang et al.,2021;Dong

et al.,2021), which minimizes the maximal ad-

versarial loss. As for the discrete nature of text,

another effective strategy is adversarial data aug-

mentation (Min et al.,2020;Zheng et al.,2020;

Ivgi and Berant,2021), which augments the train-

ing set with adversarial examples to re-train the

model. Guided by the information of perturbation

space, these two strategies utilize textual features

as a whole to make the model learn a smooth pa-

rameter landscape, so that it is more stable and

robust to adversarial perturbations.

As adversarial examples pervasively exist, previ-

ous research has studied the underlying reason for

this (Goodfellow et al.,2015;Fawzi et al.,2016;

Schmidt et al.;Tsipras et al.,2019;Ilyas et al.,

2019). One popular argument (Ilyas et al.,2019) is

that adversarial vulnerability is caused by the non-

robust features. While classiﬁers strive to maxi-

mize accuracy in standard supervised training, they

tend to capture any predictive correlation in the

training data and may learn predictive yet brittle

features, leading to the occurrence of adversarial ex-

amples. These non-robust features leave space for

attackers to intentionally manipulate them and trick

the model. Therefore, discarding the non-robust

features can potentially facilitate model robustness

against adversarial attacks, and this issue has not

been explored by previous research on adversarial

robustness in text domain.

To address the above issue, we take the approach

of disentangled representation learning (DRL),

arXiv:2210.14957v1 [cs.CL] 26 Oct 2022

which decomposes different factors into separate

latent spaces. In addition, to measure the depen-

dency between two random variables for disentan-

glement, we take an information-theoretic perspec-

tive with the Variation of Information (VI). Our

work is particularly inspired by the work of Cheng

et al. (2020b), which takes an information-theoretic

approach to text generation and text-style transfer.

As our focus is on disentangling robust and non-

robust features for adversarial robustness, our work

is fundamentally different from the related work in

model structure and learning objective design.

In this paper, we tackle the adversarial ro-

bustness challenge and propose an information-

theoretic Disentangled Text Representation Learn-

ing (DTRL) method. Guided with the VI in in-

formation theory, our method ﬁrst derives a dis-

entangled learning objective that maximizes the

mutual information between robust/non-robust fea-

tures and input data to ensure the semantic rep-

resentativeness of latent embeddings, and mean-

while minimizes the mutual information between

robust and non-robust features to achieve disentan-

glement. On this basis, we leverage adversarial data

augmentation and design a disentangled learning

network which realizes task classiﬁer, domain clas-

siﬁer and discriminator to approximate the above

mutual information. Experimental results show

that our DTRL method improves model robustness

by a large margin over the comparative methods.

The contributions of our work are as follows:

•

We propose a disentangled text representation

learning method, which takes an information-

theoretic perspective to explicitly disentangle

robust and non-robust features for tackling

adversarial robustness challenge.

•

Our method deduces a disentangled learning

objective for effective textual feature decom-

position, and constructs a disentangled learn-

ing network to approximate the mutual infor-

mation in the derived learning objective.

•

Experiments on text classiﬁcation and entail-

ment tasks demonstrate the superiority of our

method against other representative methods,

suggesting eliminating non-robust features is

critical for adversarial robustness.

2 Related work

Textual Adversarial Defense

To defend adver-

sarial attacks, empirical and certiﬁed methods have

been proposed. Empirical methods are dominant

which mainly include adversarial training and data

augmentation. Adversarial training (Miyato et al.,

2019;Li and Qiu,2021;Wang et al.,2021;Dong

et al.,2021;Li et al.,2021) regularizes the model

with adversarial gradient back-propagating to the

embedding layer. Adversarial data augmentation

(Min et al.,2020;Zheng et al.,2020;Ivgi and Be-

rant,2021) generates adversarial examples and re-

trains the model to enhance robustness. Certiﬁed

robustness (Jia et al.,2019;Huang et al.,2019;Shi

et al.,2020) minimizes an upper bound loss of the

worst-case examples to guarantee model robust-

ness. Besides, adversarial example detection (Zhou

et al.,2019;Mozes et al.,2021;Bao et al.,2021)

identiﬁes adversarial examples and recovers the

perturbations. Unlike these previous methods, we

enhance model robustness from the view of DRL

to eliminate non-robust features.

Disentangled Representation Learning

Disen-

tangled representation learning (DRL) encodes

different factors into separate latent spaces, each

with different semantic meanings. The DRL-based

methods are proposed mainly for image-related

tasks. Pan et al. (2021) propose a general dis-

entangled learning method based on information

bottleneck principle (Tishby et al.,2000). Recent

work also extends DRL to text generation tasks,

e.g. style-controlled text generation (Yi et al.,

2020;Cheng et al.,2020b). Different from the

DRL-based text generation work that uses encoder-

decoder framework to disentangle style and content

in text, our work develops the learning objective

and network structure to disentangle robust and

non-robust features for adversarial robustness.

Existing DRL-based methods for adversarial

robustness have solely applied in image domain

(Yang et al.,2021a,b;Kim et al.,2021), mainly

based on the VAE. Different from continuous small

perturbation pixels in image that are suitable for

generative models, text perturbations are discrete in

nature, which are hard to deal with using generative

models due to their overwhelming training costs.

With adversarial data augmentation, our method

uses a lightweight layer with cross-entropy loss for

effective disentangled representation learning.

3 Preliminary

The Variation of Information (VI) is a fundamental

metric in information theory that quantiﬁes the in-

dependence between two random variables. Given

two random variables

and

VI(U;V)

is de-

ﬁned as:

VI(U;V) = H(U) + H(V)−2I(U;V),(1)

where

H(U)

and

H(V)

are the Shannon entropy,

and

I(U;V) = Ep(u,v)hlog p(u,v)

p(u)p(v)i

is the mutual

information between Uand V.

The VI is a positive, symmetric metric. It obeys

the triangle inequality (Kraskov et al.,2003), that

is, for any random variables U,Vand W:

VI(U;V) + VI(U;W)≥VI(V;W).(2)

Equality occurs if and only if the information of

is totally divided into that of Vand W.

4 Problem Deﬁnition

Given a victim model

and an original input

x∈X

where

is input text set, an attack method

is applied to search perturbations to construct an

adversarial example

ˆx∈ˆ

which fools the model

prediction (i.e.

fv(x)6=fv(ˆx)

). Adversarial at-

tacks can be regarded as data augmentation. For

random variables

X, Y ∼pD(x, y)

where

is the

set of class labels,

(x, y)

is the observed value,

is a dataset and

is the data distribution. The

goal of adversarial robustness is to build a classiﬁer

f(y|x)that is robust against adversarial attacks.

5 Proposed Method

The overall architecture of our proposed method is

shown in Fig.1. We ﬁrst apply adversarial attacks

to augment the original textual data. We then de-

sign the disentangled learning objective to separate

features into robust and non-robust ones. Finally,

we construct the disentangled learning network to

implement the learning objective.

5.1 Adversarial Data Augmentation

As adversarial examples have different patterns

other than clean data like word frequency (Mozes

et al.,2021) and ﬂuency (Lei et al.,2022), we use

adversarial examples to guide the non-robust fea-

tures learning. To efﬁciently disentangle robust

and non-robust features, we employ adversarial

data augmentation to get adversarial examples for

the extention of training set.

We denote original training set as

Dtask =

{xi, yi}N

i=1

, where

is input text,

is task label

(e.g. positive or negative),

x∈X

and

y∈Y

. We

apply adversarial data augmentation to

Dtask

and

get adversarial examples

ˆx∈ˆ

. We then construct

domain dataset

Ddomain ={x0

j, y0

j}M

j=1

, where

is input text or adversarial example,

is domain

label (e.g. natural or adversarial),

x0∈ {X, ˆ

y0∈Y0and Y0is the set of domain labels.

5.2 Disentangled Learning Objective

We propose our learning objective that disentan-

gles the robust and non-robust features, and buide

the approximation method to estimate mutual in-

formation in the derived learning objective. We use

the VI in information theory to measure the depen-

dency between latent variables for disentanglement.

In contrast to the computational alternative of gen-

erative model like variational autoencoder (VAE),

our method considers the discrete nature of text

and develops an effective VI-guided disentangled

learning technique with less computational cost.

5.2.1 Learning Objective Derivation

We start from

VI(Zr;Zn)

to measure the indepen-

dence between robust features

and non-robust

features

. By applying the triangle inequality of

VI (Eq.(2)) to X,Zrand Zn, we have

VI(X;Zr) + VI(X;Zn)≥VI(Zr;Zn),(3)

where the difference between

VI(X;Zr) +

VI(X;Zn)

and

VI(Zr;Zn)

represents the degree

of disentanglement. By simpliﬁng Eq.(3) with the

deﬁnition of VI (Eq.1), we have

VI(X;Zr) + VI(X;Zn)−VI(Zr;Zn)

=2H(X) + 2[I(Zr;Zn)−I(X;Zr)−I(X;Zn)].

(4)

Then for a given dataset,

H(X)

is a constant pos-

itive value. By dropping H(X)and the coefﬁcient

from Eq.(4), we have

VI(X;Zr) + VI(X;Zn)−VI(Zr;Zn)

>I(Zr;Zn)−I(X;Zr)−I(X;Zn).(5)

As in Eq.(5), the robust and non-robust features

are symmetrical and interchangeable, we further

differentiate them by introducing supervised infor-

mation. Recent study shows that without inductive

biases, it is theoretically impossible to learn dis-

entangled representations (Locatello et al.,2019).

Therefore, we leverage the task label in

and do-

main label in

to supervise robust and non-robust

feature learning respectively.

Speciﬁcally, encoding

into

to predict out-

put

forms a Markov chain

X→Zr→Y

and

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DisentangledTextRepresentationLearningwithInformation-TheoreticPerspectiveforAdversarialRobustnessJiahaoZhaoWenjiMaoInstituteofAutomation,ChineseAcademyofSciences{zhaojiahao2019,wenji.mao}@ia.ac.cnAbstractAdversarialvulnerabilityremainsamajorob-stacletoconstructingreliableNLPsystems.Whenimperceptibl...

展开>> 收起<<

Disentangled Text Representation Learning with Information-Theoretic Perspective for Adversarial Robustness Jiahao Zhao Wenji Mao.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Disentangled Text Representation Learning with Information-Theoretic Perspective for Adversarial Robustness Jiahao Zhao Wenji Mao

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: