which decomposes different factors into separate
latent spaces. In addition, to measure the depen-
dency between two random variables for disentan-
glement, we take an information-theoretic perspec-
tive with the Variation of Information (VI). Our
work is particularly inspired by the work of Cheng
et al. (2020b), which takes an information-theoretic
approach to text generation and text-style transfer.
As our focus is on disentangling robust and non-
robust features for adversarial robustness, our work
is fundamentally different from the related work in
model structure and learning objective design.
In this paper, we tackle the adversarial ro-
bustness challenge and propose an information-
theoretic Disentangled Text Representation Learn-
ing (DTRL) method. Guided with the VI in in-
formation theory, our method first derives a dis-
entangled learning objective that maximizes the
mutual information between robust/non-robust fea-
tures and input data to ensure the semantic rep-
resentativeness of latent embeddings, and mean-
while minimizes the mutual information between
robust and non-robust features to achieve disentan-
glement. On this basis, we leverage adversarial data
augmentation and design a disentangled learning
network which realizes task classifier, domain clas-
sifier and discriminator to approximate the above
mutual information. Experimental results show
that our DTRL method improves model robustness
by a large margin over the comparative methods.
The contributions of our work are as follows:
•
We propose a disentangled text representation
learning method, which takes an information-
theoretic perspective to explicitly disentangle
robust and non-robust features for tackling
adversarial robustness challenge.
•
Our method deduces a disentangled learning
objective for effective textual feature decom-
position, and constructs a disentangled learn-
ing network to approximate the mutual infor-
mation in the derived learning objective.
•
Experiments on text classification and entail-
ment tasks demonstrate the superiority of our
method against other representative methods,
suggesting eliminating non-robust features is
critical for adversarial robustness.
2 Related work
Textual Adversarial Defense
To defend adver-
sarial attacks, empirical and certified methods have
been proposed. Empirical methods are dominant
which mainly include adversarial training and data
augmentation. Adversarial training (Miyato et al.,
2019;Li and Qiu,2021;Wang et al.,2021;Dong
et al.,2021;Li et al.,2021) regularizes the model
with adversarial gradient back-propagating to the
embedding layer. Adversarial data augmentation
(Min et al.,2020;Zheng et al.,2020;Ivgi and Be-
rant,2021) generates adversarial examples and re-
trains the model to enhance robustness. Certified
robustness (Jia et al.,2019;Huang et al.,2019;Shi
et al.,2020) minimizes an upper bound loss of the
worst-case examples to guarantee model robust-
ness. Besides, adversarial example detection (Zhou
et al.,2019;Mozes et al.,2021;Bao et al.,2021)
identifies adversarial examples and recovers the
perturbations. Unlike these previous methods, we
enhance model robustness from the view of DRL
to eliminate non-robust features.
Disentangled Representation Learning
Disen-
tangled representation learning (DRL) encodes
different factors into separate latent spaces, each
with different semantic meanings. The DRL-based
methods are proposed mainly for image-related
tasks. Pan et al. (2021) propose a general dis-
entangled learning method based on information
bottleneck principle (Tishby et al.,2000). Recent
work also extends DRL to text generation tasks,
e.g. style-controlled text generation (Yi et al.,
2020;Cheng et al.,2020b). Different from the
DRL-based text generation work that uses encoder-
decoder framework to disentangle style and content
in text, our work develops the learning objective
and network structure to disentangle robust and
non-robust features for adversarial robustness.
Existing DRL-based methods for adversarial
robustness have solely applied in image domain
(Yang et al.,2021a,b;Kim et al.,2021), mainly
based on the VAE. Different from continuous small
perturbation pixels in image that are suitable for
generative models, text perturbations are discrete in
nature, which are hard to deal with using generative
models due to their overwhelming training costs.
With adversarial data augmentation, our method
uses a lightweight layer with cross-entropy loss for
effective disentangled representation learning.
3 Preliminary
The Variation of Information (VI) is a fundamental
metric in information theory that quantifies the in-
dependence between two random variables. Given