
Log-linear Guardedness and its Implications
Shauli Ravfogel1,2 Yoav Goldberg1,2 Ryan Cotterell3
1Bar-Ilan University 2Allen Institute for Artificial Intelligence 3ETH Zürich
{shauli.ravfogel,yoav.goldberg}@gmail.com ryan.cotterell@inf.ethz.ch
Abstract
Methods for erasing human-interpretable con-
cepts from neural representations that assume
linearity have been found to be tractable and
useful. However, the impact of this removal on
the behavior of downstream classifiers trained
on the modified representations is not fully
understood. In this work, we formally define
the notion of log-linear guardedness as the
inability of an adversary to predict the concept
directly from the representation, and study
its implications. We show that, in the binary
case, under certain assumptions, a downstream
log-linear model cannot recover the erased
concept. However, we demonstrate that a
multiclass log-linear model can be constructed
that indirectly recovers the concept in some
cases, pointing to the inherent limitations of
log-linear guardedness as a downstream bias
mitigation technique. These findings shed light
on the theoretical limitations of linear erasure
methods and highlight the need for further
research on the connections between intrinsic
and extrinsic bias in neural models.
https://github.com/rycolab/
guardedness
1 Introduction
Neural models of text have been shown to represent
human-interpretable concepts, e.g., those related
to the linguistic notion of morphology (Vylomova
et al.,2017), syntax (Linzen et al.,2016), semantics
(Belinkov et al.,2017), as well as extra-linguistic
notions, e.g., gender distinctions (Caliskan et al.,
2017). Identifying and erasing such concepts
from neural representations is known as concept
erasure. Linear concept erasure in particular has
gained popularity due to its potential for obtaining
formal guarantees and its empirical effectiveness
(Bolukbasi et al.,2016;Dev and Phillips,2019;
Ravfogel et al.,2020;Dev et al.,2021;Kaneko and
Bollegala,2021;Shao et al.,2023b,a;Kleindessner
et al.,2023;Belrose et al.,2023).
A common instantiation of concept erasure
is removing a concept (e.g., gender) from a
representation (e.g., the last hidden representation
of a transformer-based language model) such
that it cannot be predicted by a log-linear model.
Then, one fits a secondary log-linear model for a
downstream task over the erased representations.
For example, one may fit a log-linear sentiment
analyzer to predict sentiment from gender-erased
representations. The hope behind such a pipeline
is that, because the concept of gender was erased
from the representations, the predictions made
by the log-linear sentiment analyzer are oblivious
to gender. Previous work (Ravfogel et al.,2020;
Elazar et al.,2021;Jacovi et al.,2021;Ravfogel
et al.,2022a) has implicitly or explicitly relied
on this assumption that erasing concepts from
representations would also result in a downstream
classifier that was oblivious to the target concept.
In this paper, we formally analyze the effect
concept erasure has on a downstream classifier.
We start by formalizing concept erasure using Xu
et al.’s (2020)
V
-information.
1
We then spell out
the related notion of guardedness as the inability
to predict a given concept from concept-erased rep-
resentations using a specific family of classifiers.
Formally, if
V
is the family of distributions real-
izable by a log-linear model, then we say that the
representations are guarded against gender with re-
spect to
V
. The theoretical treatment in our paper
specifically focuses on log-linear guardedness,
which we take to mean the inability of a log-linear
model to recover the erased concept from the rep-
resentations. We are able to prove that when the
downstream classifier is binary valued, such as a
binary sentiment classifier, its prediction indeed
cannot leak information about the erased concept
(§ 3.2) under certain assumptions. On the con-
trary, in the case of multiclass classification with a
log-linear model, we show that predictions can po-
tentially leak a substantial amount of information
about the erased concept, thereby recovering the
guarded information completely.
The theoretical analysis is supported by experi-
ments on commonly used linear erasure techniques
1We also consider a definition based on accuracy.
arXiv:2210.10012v5 [cs.LG] 10 May 2024