Log-linear Guardedness and its Implications Shauli Ravfogel12Yoav Goldberg12Ryan Cotterell3 1Bar-Ilan University2Allen Institute for Artificial Intelligence3ETH Zürich

2025-05-02 0 0 612.4KB 17 页 10玖币
侵权投诉
Log-linear Guardedness and its Implications
Shauli Ravfogel1,2 Yoav Goldberg1,2 Ryan Cotterell3
1Bar-Ilan University 2Allen Institute for Artificial Intelligence 3ETH Zürich
{shauli.ravfogel,yoav.goldberg}@gmail.com ryan.cotterell@inf.ethz.ch
Abstract
Methods for erasing human-interpretable con-
cepts from neural representations that assume
linearity have been found to be tractable and
useful. However, the impact of this removal on
the behavior of downstream classifiers trained
on the modified representations is not fully
understood. In this work, we formally define
the notion of log-linear guardedness as the
inability of an adversary to predict the concept
directly from the representation, and study
its implications. We show that, in the binary
case, under certain assumptions, a downstream
log-linear model cannot recover the erased
concept. However, we demonstrate that a
multiclass log-linear model can be constructed
that indirectly recovers the concept in some
cases, pointing to the inherent limitations of
log-linear guardedness as a downstream bias
mitigation technique. These findings shed light
on the theoretical limitations of linear erasure
methods and highlight the need for further
research on the connections between intrinsic
and extrinsic bias in neural models.
https://github.com/rycolab/
guardedness
1 Introduction
Neural models of text have been shown to represent
human-interpretable concepts, e.g., those related
to the linguistic notion of morphology (Vylomova
et al.,2017), syntax (Linzen et al.,2016), semantics
(Belinkov et al.,2017), as well as extra-linguistic
notions, e.g., gender distinctions (Caliskan et al.,
2017). Identifying and erasing such concepts
from neural representations is known as concept
erasure. Linear concept erasure in particular has
gained popularity due to its potential for obtaining
formal guarantees and its empirical effectiveness
(Bolukbasi et al.,2016;Dev and Phillips,2019;
Ravfogel et al.,2020;Dev et al.,2021;Kaneko and
Bollegala,2021;Shao et al.,2023b,a;Kleindessner
et al.,2023;Belrose et al.,2023).
A common instantiation of concept erasure
is removing a concept (e.g., gender) from a
representation (e.g., the last hidden representation
of a transformer-based language model) such
that it cannot be predicted by a log-linear model.
Then, one fits a secondary log-linear model for a
downstream task over the erased representations.
For example, one may fit a log-linear sentiment
analyzer to predict sentiment from gender-erased
representations. The hope behind such a pipeline
is that, because the concept of gender was erased
from the representations, the predictions made
by the log-linear sentiment analyzer are oblivious
to gender. Previous work (Ravfogel et al.,2020;
Elazar et al.,2021;Jacovi et al.,2021;Ravfogel
et al.,2022a) has implicitly or explicitly relied
on this assumption that erasing concepts from
representations would also result in a downstream
classifier that was oblivious to the target concept.
In this paper, we formally analyze the effect
concept erasure has on a downstream classifier.
We start by formalizing concept erasure using Xu
et al.s (2020)
V
-information.
1
We then spell out
the related notion of guardedness as the inability
to predict a given concept from concept-erased rep-
resentations using a specific family of classifiers.
Formally, if
V
is the family of distributions real-
izable by a log-linear model, then we say that the
representations are guarded against gender with re-
spect to
V
. The theoretical treatment in our paper
specifically focuses on log-linear guardedness,
which we take to mean the inability of a log-linear
model to recover the erased concept from the rep-
resentations. We are able to prove that when the
downstream classifier is binary valued, such as a
binary sentiment classifier, its prediction indeed
cannot leak information about the erased concept
3.2) under certain assumptions. On the con-
trary, in the case of multiclass classification with a
log-linear model, we show that predictions can po-
tentially leak a substantial amount of information
about the erased concept, thereby recovering the
guarded information completely.
The theoretical analysis is supported by experi-
ments on commonly used linear erasure techniques
1We also consider a definition based on accuracy.
arXiv:2210.10012v5 [cs.LG] 10 May 2024
5). While previous authors (Goldfarb-Tarrant
et al. 2021,Orgad et al. 2022,inter alia) have em-
pirically studied concept erasure’s effect on down-
stream classifiers, to the best of our knowledge, we
are the first to study it theoretically. Taken together,
these findings suggest that log-linear guardedness
may have limitations when it comes to preventing
information leakage about concepts and should
be assessed with extreme care, even when the
downstream classifier is merely a log-linear model.
2 Information-Theoretic Guardedness
In this section, we present an information-theoretic
approach to guardedness, which we couch in terms
of V-information (Xu et al.,2020).
2.1 Preliminaries
We first explain the concept erasure paradigm (Rav-
fogel et al.,2022a), upon which our work is based.
Let
X
be a representation-valued random variable.
In our setup, we assume representations are real-
valued, i.e., they live in
RD
. Next, let
Z
be a binary-
valued random variable that denotes a protected
attribute, e.g., binary gender.
2
We denote the two
binary values of
Z
by
Zdef
={⊥,⊤}
. We assume the
existence of a guarding function
h:RDRD
that, when applied to the representations, removes
the ability to predict a concept
Z
given concept
by a specific family of models. Furthermore, we
define the random variable
b
Y = t(h(X))
where
t:RD→ Y def
={0,...,|Y|}
is a function
3
that cor-
responds to a linear classifier for a downstream task.
For instance,
t
may correspond to a linear classifier
that predicts the sentiment of a representation.
Our discussion in this paper focuses on the case
when the function
t
is derived from the argmax of a
log-linear model, i.e., in the binary case we define
b
Ys conditional distribution given h(X)as
p(b
Y = y|h(X) = h(x)) = (1,if y=y
0,else
(1)
where
θRD
is a parameter column vector,
ϕ
Ris a scalar bias term, and
y=(1,if θh(x) + ϕ > 0
0,else (2)
2
Not all concepts are binary, but our analysis in § 2makes
use of this simplifying assumption.
3The elements of Yare denoted y.
And, in the multivariate case we define
b
Y
s condi-
tional distribution given h(X)as
p(b
Y = y|h(X) = h(x)) = (1,if y=y
0,else
(3)
where
y= argmaxy∈Y (Θh(x) + ϕ)y
and
ΘyRD
denotes the
yth
column of
ΘRD×K
,
a parameter matrix, and
ϕRK
is the bias term.
Note Kis the number of classes.
2.2 V-Information
Intuitively, a set of representations is guarded if
it is not possible to predict a protected attribute
z∈ Z
from a representation
xRD
using a
specific predictive family. As a first attempt,
we naturally formalize predictability in terms of
mutual information. In this case, we say that
Z
is
not predictable from
X
if and only if
I(X; Z) = 0
.
However, the focus of this paper is on linear guard-
edness, and, thus, we need a weaker condition than
simply having the mutual information
I(X; Z) = 0
.
We fall back on Xu et al.s (2020) framework of
V
-information, which introduces a generalized
version of mutual information. In their framework,
they restrict the predictor to a family of functions
V, e.g., the set of all log-linear models.
We now develop the information-theoretic back-
ground to discuss
V
-information. The entropy of
a random variable is defined as
H(Z) def
=E
zp(Z)
log p(z)(4)
Xu et al. (2020) analogously define the conditional
V-entropy as follows
HV(Z |X)def
=sup
q∈V
E
(x,z)p(X,Z)
log q(z|x)
(5)
The
V-entropy
is a special case of Eq. (5) without
conditioning on another random variable, i.e.,
HV(Z) def
=sup
q∈V
E
zp(Z)
log q(z)(6)
Xu et al. (2020) further define the
V
-information,
a generalization of mutual information, as follows
IV(XZ) def
= HV(Z) HV(Z |X)(7)
In words, Eq. (7) is the best approximation of
the mutual information realizable by a classifier
belonging to the predictive family
V
. Furthermore,
in the case of log-linear models, Eq. (7) can be
approximated empirically by calculating the neg-
ative log-likelihood loss of the classifier on a given
set of examples, as
HV(Z)
is the entropy of the
label distribution, and
HV(Z |X)
is the minimum
achievable value of the cross-entropy loss.
2.3 Guardedness
Having defined
V
-information, we can now for-
mally define guardedness as the condition where
the V-information is small.
Definition 2.1 (
V
-Guardedness).Let
X
be a
representation-valued random variable and
let
Z
be an attribute-valued random variable.
Moreover, let
V
be a predictive family. A guarding
function
h ε
-guards
X
with respect to
Z
over
V
if IV(h(X)Z) < ε.
Definition 2.2 (Empirical
V
-Guardedness).Let
D={(xn, zn)}N
n=1
where
(xn, zn)p(X,Z)
.
Let
e
X
and
e
Z
be random variables over
RD
and
Z
, respectively, whose distribution corresponds to
the marginals of the empirical distribution over
D
.
We say that a function
h(·)
empirically
ε
-guards
D
with respect to the family
V
if
IV(h(e
X)e
Z) < ε
.
In words, according to Definition 2.2, a dataset
is log-linearly guarded if no linear classifier can
perform better than the trivial classifier that com-
pletely ignores
X
and always predicts
Z
according
to the proportions of each label. The commonly
used algorithms that have been proposed for linear
subspace erasure can be seen as approximating the
condition we call log-linear guardedness (Ravfogel
et al.,2020,2022a,b). Our experimental results
focus on empirical guardedness, which pertains
to practically measuring guardedness on a finite
dataset. However, determining the precise bounds
and guarantees of empirical guardedness is left as
an avenue for future research.
3 Theoretical Analysis
In the following sections, we study the implica-
tions of guardedness on subsequent linear classi-
fiers. Specifically, if we construct a third random
variable
b
Y = t(h(X))
where
t:RD→ Y
is a
function, what is the degree to which
b
Y
can reveal
information about
Z
? As a practical instance of this
problem, suppose we impose
ε
-guardedness on the
last hidden representations of a transformer model,
i.e.,
X
in our formulation, and then fit a linear
classifier
t
over the guarded representations
h(X)
to predict sentiment. Can the predictions of the
sentiment classifier indirectly leak information on
gender? For expressive
V
, the data-processing in-
equality (Cover and Thomas,2006, §2.8) applied to
the Markov chain
Xb
YZ
tells us the answer
is no. The reason is that, in this case,
V
-information
is equivalent to mutual information and the data
processing inequality tells us such leakage is not
possible. However, the data processing inequal-
ity does not generally apply to
V
-information (Xu
et al.,2020). Thus, it is possible to find such a pre-
dictor
t
for less expressive
V
. Surprisingly, when
|Y| = 2
, we are able to prove that constructing such
a
t
that leaks information is impossible under a cer-
tain restriction on the family of log-linear models.
3.1 Problem Formulation
We first consider the case where |Y| = 2.
3.2 A Binary Downstream Classifier
We begin by asking whether the predictions of a
binary log-linear model trained over the guarded
set of representations can leak information on the
protected attribute. Our analysis relies on the fol-
lowing simplified family of log-linear models.
Definition 3.1 (Discretized Log-Linear Models).
The family of discretized binary log-linear models
with parameter δ(0,1) is defined as
Vδdef
=ff(0) = ρδ(σ(αx+γ))
f(1) = ρδ(1 σ(αx+γ))(8)
with
αRD
,
γR
,
σ
being the logistic function,
and where we define the
δ
-discretization function as
ρδ(p)def
=(δ, if p1
2
1δ, else (9)
In words,
ρδ
is a function that maps the probability
value to one of two possible values. Note that ties
must be broken arbitrarily in the case that
p=1
2
to ensure a valid probability distribution.
Our analysis is based on the following simple
observation (see Lemma A.1 in the Appendix) that
the composition of two
δ
-discretized log-linear
models is itself a
δ
-discretized log-linear model.
Using this fact, we show that when
|Y| =|Z| = 2
,
and the predictive family is the set of
δ
-discretized
binary log-linear models,
ε
-guarded representa-
tions
h(X)
cannot leak information through a
downstream classifier.
Theorem 3.2. Let
Vδ
be the family of
δ
-
discretized log-linear models, and let
X
be a
(a) Log-linearly guarded data in
R2
with axis-aligned clus-
ters.
1
2
3
4
5
9
6
8
7
θ1
θ2
(b) Log-inearly guarded data in
R2
with clusters that are not
axis-aligned.
Figure 1: Construction of a log-linear model that breaks log-linear guardedness.
representation-valued random variable. Define
b
Y
as in Eq. (1), then
IVδ(h(X)Z) < ε
implies
IVδ(b
YZ) < ε.
Proof. Define the hard thresholding function
τ(x)def
=(1,if x > 0
0,else (10)
We assume, by way of contradiction, that
IVδ(h(X)Z) < ε
, but
IVδ(b
YZ) > ε
. We
start by algebraically manipulating IVδ(b
YZ):
IVδ(b
YZ) = HVδ(Z) HVδ(Z |b
Y)
= HVδ(Z) + sup
q∈Vδ
E
(z,y)p
log q(z|y)(11)
= HVδ(Z) + sup
q∈Vδ
E
(z,x)p
log q(z|τ(θh(x) + ϕ))
for some
θ
and
ϕ
as in the definition of
t
in
Eq. (1). Now, by Lemma A.1, we note that, for all
q∈ Vδ
, there exists a classifier
r∈ Vδ
such that
r(z|h(x)) = q(z|τ(θh(x)+ϕ))
. This implies
that
IVδ(h(X)Z) IVδ(b
YZ) > ε
,
4
contra-
dicting the assumption that
IVδ(h(X)Z) < ε
.
Thus, IVδ(b
YZ) < ε, as desired.
3.3 A Multiclass Downstream Classifier
The above discussion shows that when both
Z
and
Y
are binary,
ε
-log-linear guardedness with respect
to the family of discretized log-linear models (Def-
inition 3.1) implies limited leakage of information
4
Lemma A.1 only guarantees that a classifier of the form
q(z|τ(θh(x) + ϕ)
, where
q∈ Vδ
, can be converted
into a classifier
r(z|h(x)) ∈ Vδ
. However, we have no
proof of the opposite implication. Hence, we have only shown
IVδ(h(X)Z) IVδ(
b
YZ).
about
Z
from
b
Y
. It was previously implied (Rav-
fogel et al.,2020;Elazar et al.,2021) that linear
concept erasure prevents information leakage about
Z
through the labeling of a log-linear classifier
b
Y
,
i.e., it was assumed that Theorem 3.2 i3.2 can
be generalized to the multiclass case. Specifically,
it was argued that a subsequent linear layer, such as
the linear language-modeling head, would not be
able to recover the information because it is linear.
In this paper, however, we note a key flaw in this
argument. If the data is log-linearly guarded, then
it is easy to see that the logits, which are a linear
transformation of the guarded representation, can-
not encode the information. However, multiclass
classification is usually performed by a softmax
classifier, which adds a non-linearity. Note that
the decision boundary of the softmax classifier for
every pair of labels is linear since class
i
will have
higher softmax probability than class
j
if, and only
if, (θiθj)x>0.
Next, we demonstrate that this is enough to break
guardedness. We start with an example. Consider
the data in
R2
presented in Fig. 1(a), where the dis-
tribution
p(X,Z)
has 4 distinct clusters, each with
a different label from
Z
, corresponding to Voronoi
regions (Voronoi,1908) formed by the intersection
of the axes. The red clusters correspond to
Z =
and the blue clusters correspond to
Z =
. The
data is taken to be log-linearly guarded with re-
spect to
Z
.
5
Importantly, we note that knowledge
of the quadrant (i.e., the value of
Y
), renders
Z
recoverable by a 4-class log-linear model.
5
Information-theoretic guardedness depends on the den-
sity over
p(X)
, which is not depicted in the illustrations in
Fig. 1(a).
摘要:

Log-linearGuardednessanditsImplicationsShauliRavfogel1,2YoavGoldberg1,2RyanCotterell31Bar-IlanUniversity2AllenInstituteforArtificialIntelligence3ETHZürich{shauli.ravfogel,yoav.goldberg}@gmail.comryan.cotterell@inf.ethz.chAbstractMethodsforerasinghuman-interpretablecon-ceptsfromneuralrepresentationst...

展开>> 收起<<
Log-linear Guardedness and its Implications Shauli Ravfogel12Yoav Goldberg12Ryan Cotterell3 1Bar-Ilan University2Allen Institute for Artificial Intelligence3ETH Zürich.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:612.4KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注