Disentangled Text Representation Learning with Information-Theoretic Perspective for Adversarial Robustness Jiahao Zhao Wenji Mao

2025-05-03 0 0 821.65KB 15 页 10玖币
侵权投诉
Disentangled Text Representation Learning with Information-Theoretic
Perspective for Adversarial Robustness
Jiahao Zhao Wenji Mao
Institute of Automation, Chinese Academy of Sciences
{zhaojiahao2019,wenji.mao}@ia.ac.cn
Abstract
Adversarial vulnerability remains a major ob-
stacle to constructing reliable NLP systems.
When imperceptible perturbations are added
to raw input text, the performance of a deep
learning model may drop dramatically under
attacks. Recent work argues the adversarial
vulnerability of the model is caused by the non-
robust features in supervised training. Thus in
this paper, we tackle the adversarial robustness
challenge from the view of disentangled rep-
resentation learning, which is able to explic-
itly disentangle robust and non-robust features
in text. Specifically, inspired by the variation
of information (VI) in information theory, we
derive a disentangled learning objective com-
posed of mutual information to represent both
the semantic representativeness of latent em-
beddings and differentiation of robust and non-
robust features. On the basis of this, we design
a disentangled learning network to estimate
these mutual information. Experiments on text
classification and entailment tasks show that
our method significantly outperforms the rep-
resentative methods under adversarial attacks,
indicating that discarding non-robust features
is critical for improving adversarial robust-
ness.
1 Introduction
Although deep neural networks have achieved great
success in a variety of Natural Language Process-
ing (NLP) tasks, recent studies show their vulnera-
bility to malicious perturbations (Goodfellow et al.,
2015;Jia and Liang,2017;Gao et al.,2018;Jin
et al.,2020). By adding imperceptible perturba-
tions (e.g. typos or synonym substitutions) to orig-
inal input text, attackers can generate adversarial
examples to deceive the model. Adversarial exam-
ples pervasively exist in typical NLP tasks, includ-
ing text classification (Jin et al.,2020), dependency
parsing (Zheng et al.,2020), machine translation
(Zhang et al.,2021) and many others. These mod-
els work well on clean data but are sensitive to im-
perceptible perturbations. Recent studies indicate
that they are likely to rely on superficial cues rather
than deeper, more difficult language phenomena,
and thus tend to make incomprehensible mistakes
under adversarial examples (Jia and Liang,2017;
Branco et al.,2021).
Tremendous efforts have been made to improve
the adversarial robustness of NLP models. Among
them, the most effective strategy is adversarial train-
ing (Li and Qiu,2021;Wang et al.,2021;Dong
et al.,2021), which minimizes the maximal ad-
versarial loss. As for the discrete nature of text,
another effective strategy is adversarial data aug-
mentation (Min et al.,2020;Zheng et al.,2020;
Ivgi and Berant,2021), which augments the train-
ing set with adversarial examples to re-train the
model. Guided by the information of perturbation
space, these two strategies utilize textual features
as a whole to make the model learn a smooth pa-
rameter landscape, so that it is more stable and
robust to adversarial perturbations.
As adversarial examples pervasively exist, previ-
ous research has studied the underlying reason for
this (Goodfellow et al.,2015;Fawzi et al.,2016;
Schmidt et al.;Tsipras et al.,2019;Ilyas et al.,
2019). One popular argument (Ilyas et al.,2019) is
that adversarial vulnerability is caused by the non-
robust features. While classifiers strive to maxi-
mize accuracy in standard supervised training, they
tend to capture any predictive correlation in the
training data and may learn predictive yet brittle
features, leading to the occurrence of adversarial ex-
amples. These non-robust features leave space for
attackers to intentionally manipulate them and trick
the model. Therefore, discarding the non-robust
features can potentially facilitate model robustness
against adversarial attacks, and this issue has not
been explored by previous research on adversarial
robustness in text domain.
To address the above issue, we take the approach
of disentangled representation learning (DRL),
arXiv:2210.14957v1 [cs.CL] 26 Oct 2022
which decomposes different factors into separate
latent spaces. In addition, to measure the depen-
dency between two random variables for disentan-
glement, we take an information-theoretic perspec-
tive with the Variation of Information (VI). Our
work is particularly inspired by the work of Cheng
et al. (2020b), which takes an information-theoretic
approach to text generation and text-style transfer.
As our focus is on disentangling robust and non-
robust features for adversarial robustness, our work
is fundamentally different from the related work in
model structure and learning objective design.
In this paper, we tackle the adversarial ro-
bustness challenge and propose an information-
theoretic Disentangled Text Representation Learn-
ing (DTRL) method. Guided with the VI in in-
formation theory, our method first derives a dis-
entangled learning objective that maximizes the
mutual information between robust/non-robust fea-
tures and input data to ensure the semantic rep-
resentativeness of latent embeddings, and mean-
while minimizes the mutual information between
robust and non-robust features to achieve disentan-
glement. On this basis, we leverage adversarial data
augmentation and design a disentangled learning
network which realizes task classifier, domain clas-
sifier and discriminator to approximate the above
mutual information. Experimental results show
that our DTRL method improves model robustness
by a large margin over the comparative methods.
The contributions of our work are as follows:
We propose a disentangled text representation
learning method, which takes an information-
theoretic perspective to explicitly disentangle
robust and non-robust features for tackling
adversarial robustness challenge.
Our method deduces a disentangled learning
objective for effective textual feature decom-
position, and constructs a disentangled learn-
ing network to approximate the mutual infor-
mation in the derived learning objective.
Experiments on text classification and entail-
ment tasks demonstrate the superiority of our
method against other representative methods,
suggesting eliminating non-robust features is
critical for adversarial robustness.
2 Related work
Textual Adversarial Defense
To defend adver-
sarial attacks, empirical and certified methods have
been proposed. Empirical methods are dominant
which mainly include adversarial training and data
augmentation. Adversarial training (Miyato et al.,
2019;Li and Qiu,2021;Wang et al.,2021;Dong
et al.,2021;Li et al.,2021) regularizes the model
with adversarial gradient back-propagating to the
embedding layer. Adversarial data augmentation
(Min et al.,2020;Zheng et al.,2020;Ivgi and Be-
rant,2021) generates adversarial examples and re-
trains the model to enhance robustness. Certified
robustness (Jia et al.,2019;Huang et al.,2019;Shi
et al.,2020) minimizes an upper bound loss of the
worst-case examples to guarantee model robust-
ness. Besides, adversarial example detection (Zhou
et al.,2019;Mozes et al.,2021;Bao et al.,2021)
identifies adversarial examples and recovers the
perturbations. Unlike these previous methods, we
enhance model robustness from the view of DRL
to eliminate non-robust features.
Disentangled Representation Learning
Disen-
tangled representation learning (DRL) encodes
different factors into separate latent spaces, each
with different semantic meanings. The DRL-based
methods are proposed mainly for image-related
tasks. Pan et al. (2021) propose a general dis-
entangled learning method based on information
bottleneck principle (Tishby et al.,2000). Recent
work also extends DRL to text generation tasks,
e.g. style-controlled text generation (Yi et al.,
2020;Cheng et al.,2020b). Different from the
DRL-based text generation work that uses encoder-
decoder framework to disentangle style and content
in text, our work develops the learning objective
and network structure to disentangle robust and
non-robust features for adversarial robustness.
Existing DRL-based methods for adversarial
robustness have solely applied in image domain
(Yang et al.,2021a,b;Kim et al.,2021), mainly
based on the VAE. Different from continuous small
perturbation pixels in image that are suitable for
generative models, text perturbations are discrete in
nature, which are hard to deal with using generative
models due to their overwhelming training costs.
With adversarial data augmentation, our method
uses a lightweight layer with cross-entropy loss for
effective disentangled representation learning.
3 Preliminary
The Variation of Information (VI) is a fundamental
metric in information theory that quantifies the in-
dependence between two random variables. Given
two random variables
U
and
V
,
VI(U;V)
is de-
fined as:
VI(U;V) = H(U) + H(V)2I(U;V),(1)
where
H(U)
and
H(V)
are the Shannon entropy,
and
I(U;V) = Ep(u,v)hlog p(u,v)
p(u)p(v)i
is the mutual
information between Uand V.
The VI is a positive, symmetric metric. It obeys
the triangle inequality (Kraskov et al.,2003), that
is, for any random variables U,Vand W:
VI(U;V) + VI(U;W)VI(V;W).(2)
Equality occurs if and only if the information of
U
is totally divided into that of Vand W.
4 Problem Definition
Given a victim model
fv
and an original input
xX
where
X
is input text set, an attack method
A
is applied to search perturbations to construct an
adversarial example
ˆxˆ
X
which fools the model
prediction (i.e.
fv(x)6=fv(ˆx)
). Adversarial at-
tacks can be regarded as data augmentation. For
random variables
X, Y pD(x, y)
where
Y
is the
set of class labels,
(x, y)
is the observed value,
D
is a dataset and
pD
is the data distribution. The
goal of adversarial robustness is to build a classifier
f(y|x)that is robust against adversarial attacks.
5 Proposed Method
The overall architecture of our proposed method is
shown in Fig.1. We first apply adversarial attacks
to augment the original textual data. We then de-
sign the disentangled learning objective to separate
features into robust and non-robust ones. Finally,
we construct the disentangled learning network to
implement the learning objective.
5.1 Adversarial Data Augmentation
As adversarial examples have different patterns
other than clean data like word frequency (Mozes
et al.,2021) and fluency (Lei et al.,2022), we use
adversarial examples to guide the non-robust fea-
tures learning. To efficiently disentangle robust
and non-robust features, we employ adversarial
data augmentation to get adversarial examples for
the extention of training set.
We denote original training set as
Dtask =
{xi, yi}N
i=1
, where
x
is input text,
y
is task label
(e.g. positive or negative),
xX
and
yY
. We
apply adversarial data augmentation to
Dtask
and
get adversarial examples
ˆxˆ
X
. We then construct
domain dataset
Ddomain ={x0
j, y0
j}M
j=1
, where
x0
is input text or adversarial example,
y0
is domain
label (e.g. natural or adversarial),
x0∈ {X, ˆ
X}
,
y0Y0and Y0is the set of domain labels.
5.2 Disentangled Learning Objective
We propose our learning objective that disentan-
gles the robust and non-robust features, and buide
the approximation method to estimate mutual in-
formation in the derived learning objective. We use
the VI in information theory to measure the depen-
dency between latent variables for disentanglement.
In contrast to the computational alternative of gen-
erative model like variational autoencoder (VAE),
our method considers the discrete nature of text
and develops an effective VI-guided disentangled
learning technique with less computational cost.
5.2.1 Learning Objective Derivation
We start from
VI(Zr;Zn)
to measure the indepen-
dence between robust features
Zr
and non-robust
features
Zn
. By applying the triangle inequality of
VI (Eq.(2)) to X,Zrand Zn, we have
VI(X;Zr) + VI(X;Zn)VI(Zr;Zn),(3)
where the difference between
VI(X;Zr) +
VI(X;Zn)
and
VI(Zr;Zn)
represents the degree
of disentanglement. By simplifing Eq.(3) with the
definition of VI (Eq.1), we have
VI(X;Zr) + VI(X;Zn)VI(Zr;Zn)
=2H(X) + 2[I(Zr;Zn)I(X;Zr)I(X;Zn)].
(4)
Then for a given dataset,
H(X)
is a constant pos-
itive value. By dropping H(X)and the coefficient
from Eq.(4), we have
VI(X;Zr) + VI(X;Zn)VI(Zr;Zn)
>I(Zr;Zn)I(X;Zr)I(X;Zn).(5)
As in Eq.(5), the robust and non-robust features
are symmetrical and interchangeable, we further
differentiate them by introducing supervised infor-
mation. Recent study shows that without inductive
biases, it is theoretically impossible to learn dis-
entangled representations (Locatello et al.,2019).
Therefore, we leverage the task label in
Y
and do-
main label in
Y0
to supervise robust and non-robust
feature learning respectively.
Specifically, encoding
X
into
Zr
to predict out-
put
Y
forms a Markov chain
XZrY
and
摘要:

DisentangledTextRepresentationLearningwithInformation-TheoreticPerspectiveforAdversarialRobustnessJiahaoZhaoWenjiMaoInstituteofAutomation,ChineseAcademyofSciences{zhaojiahao2019,wenji.mao}@ia.ac.cnAbstractAdversarialvulnerabilityremainsamajorob-stacletoconstructingreliableNLPsystems.Whenimperceptibl...

展开>> 收起<<
Disentangled Text Representation Learning with Information-Theoretic Perspective for Adversarial Robustness Jiahao Zhao Wenji Mao.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:821.65KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注