Log-linear Guardedness and its Implications Shauli Ravfogel12Yoav Goldberg12Ryan Cotterell3 1Bar-Ilan University2Allen Institute for Artificial Intelligence3ETH Zürich

2025-05-02 0 0 612.4KB 17 页 10玖币

侵权投诉

Log-linear Guardedness and its Implications

Shauli Ravfogel1,2 Yoav Goldberg1,2 Ryan Cotterell3

1Bar-Ilan University 2Allen Institute for Artiﬁcial Intelligence 3ETH Zürich

{shauli.ravfogel,yoav.goldberg}@gmail.com ryan.cotterell@inf.ethz.ch

Abstract

Methods for erasing human-interpretable con-

cepts from neural representations that assume

linearity have been found to be tractable and

useful. However, the impact of this removal on

the behavior of downstream classiﬁers trained

on the modiﬁed representations is not fully

understood. In this work, we formally deﬁne

the notion of log-linear guardedness as the

inability of an adversary to predict the concept

directly from the representation, and study

its implications. We show that, in the binary

case, under certain assumptions, a downstream

log-linear model cannot recover the erased

concept. However, we demonstrate that a

multiclass log-linear model can be constructed

that indirectly recovers the concept in some

cases, pointing to the inherent limitations of

log-linear guardedness as a downstream bias

mitigation technique. These ﬁndings shed light

on the theoretical limitations of linear erasure

methods and highlight the need for further

research on the connections between intrinsic

and extrinsic bias in neural models.

https://github.com/rycolab/

guardedness

1 Introduction

Neural models of text have been shown to represent

human-interpretable concepts, e.g., those related

to the linguistic notion of morphology (Vylomova

et al.,2017), syntax (Linzen et al.,2016), semantics

(Belinkov et al.,2017), as well as extra-linguistic

notions, e.g., gender distinctions (Caliskan et al.,

2017). Identifying and erasing such concepts

from neural representations is known as concept

erasure. Linear concept erasure in particular has

gained popularity due to its potential for obtaining

formal guarantees and its empirical effectiveness

(Bolukbasi et al.,2016;Dev and Phillips,2019;

Ravfogel et al.,2020;Dev et al.,2021;Kaneko and

Bollegala,2021;Shao et al.,2023b,a;Kleindessner

et al.,2023;Belrose et al.,2023).

A common instantiation of concept erasure

is removing a concept (e.g., gender) from a

representation (e.g., the last hidden representation

of a transformer-based language model) such

that it cannot be predicted by a log-linear model.

Then, one ﬁts a secondary log-linear model for a

downstream task over the erased representations.

For example, one may ﬁt a log-linear sentiment

analyzer to predict sentiment from gender-erased

representations. The hope behind such a pipeline

is that, because the concept of gender was erased

from the representations, the predictions made

by the log-linear sentiment analyzer are oblivious

to gender. Previous work (Ravfogel et al.,2020;

Elazar et al.,2021;Jacovi et al.,2021;Ravfogel

et al.,2022a) has implicitly or explicitly relied

on this assumption that erasing concepts from

representations would also result in a downstream

classiﬁer that was oblivious to the target concept.

In this paper, we formally analyze the effect

concept erasure has on a downstream classiﬁer.

We start by formalizing concept erasure using Xu

et al.’s (2020)

-information.

We then spell out

the related notion of guardedness as the inability

to predict a given concept from concept-erased rep-

resentations using a speciﬁc family of classiﬁers.

Formally, if

is the family of distributions real-

izable by a log-linear model, then we say that the

representations are guarded against gender with re-

spect to

. The theoretical treatment in our paper

speciﬁcally focuses on log-linear guardedness,

which we take to mean the inability of a log-linear

model to recover the erased concept from the rep-

resentations. We are able to prove that when the

downstream classiﬁer is binary valued, such as a

binary sentiment classiﬁer, its prediction indeed

cannot leak information about the erased concept

(§ 3.2) under certain assumptions. On the con-

trary, in the case of multiclass classiﬁcation with a

log-linear model, we show that predictions can po-

tentially leak a substantial amount of information

about the erased concept, thereby recovering the

guarded information completely.

The theoretical analysis is supported by experi-

ments on commonly used linear erasure techniques

1We also consider a deﬁnition based on accuracy.

arXiv:2210.10012v5 [cs.LG] 10 May 2024

(§ 5). While previous authors (Goldfarb-Tarrant

et al. 2021,Orgad et al. 2022,inter alia) have em-

pirically studied concept erasure’s effect on down-

stream classiﬁers, to the best of our knowledge, we

are the ﬁrst to study it theoretically. Taken together,

these ﬁndings suggest that log-linear guardedness

may have limitations when it comes to preventing

information leakage about concepts and should

be assessed with extreme care, even when the

downstream classiﬁer is merely a log-linear model.

2 Information-Theoretic Guardedness

In this section, we present an information-theoretic

approach to guardedness, which we couch in terms

of V-information (Xu et al.,2020).

2.1 Preliminaries

We ﬁrst explain the concept erasure paradigm (Rav-

fogel et al.,2022a), upon which our work is based.

Let

be a representation-valued random variable.

In our setup, we assume representations are real-

valued, i.e., they live in

. Next, let

be a binary-

valued random variable that denotes a protected

attribute, e.g., binary gender.

We denote the two

binary values of

Zdef

={⊥,⊤}

. We assume the

existence of a guarding function

h:RD→RD

that, when applied to the representations, removes

the ability to predict a concept

given concept

by a speciﬁc family of models. Furthermore, we

deﬁne the random variable

Y = t(h(X))

where

t:RD→ Y def

={0,...,|Y|}

is a function

that cor-

responds to a linear classiﬁer for a downstream task.

For instance,

may correspond to a linear classiﬁer

that predicts the sentiment of a representation.

Our discussion in this paper focuses on the case

when the function

is derived from the argmax of a

log-linear model, i.e., in the binary case we deﬁne

Y’s conditional distribution given h(X)as

p(b

Y = y|h(X) = h(x)) = (1,if y=y∗

0,else

(1)

where

θ∈RD

is a parameter column vector,

ϕ∈

Ris a scalar bias term, and

y∗=(1,if θ⊤h(x) + ϕ > 0

0,else (2)

Not all concepts are binary, but our analysis in § 2makes

use of this simplifying assumption.

3The elements of Yare denoted y.

And, in the multivariate case we deﬁne

’s condi-

tional distribution given h(X)as

p(b

Y = y|h(X) = h(x)) = (1,if y=y∗

0,else

(3)

where

y∗= argmaxy′∈Y (Θ⊤h(x) + ϕ)y′

and

Θy∈RD

denotes the

yth

column of

Θ∈RD×K

a parameter matrix, and

ϕ∈RK

is the bias term.

Note Kis the number of classes.

2.2 V-Information

Intuitively, a set of representations is guarded if

it is not possible to predict a protected attribute

z∈ Z

from a representation

x∈RD

using a

speciﬁc predictive family. As a ﬁrst attempt,

we naturally formalize predictability in terms of

mutual information. In this case, we say that

not predictable from

if and only if

I(X; Z) = 0

However, the focus of this paper is on linear guard-

edness, and, thus, we need a weaker condition than

simply having the mutual information

I(X; Z) = 0

We fall back on Xu et al.’s (2020) framework of

-information, which introduces a generalized

version of mutual information. In their framework,

they restrict the predictor to a family of functions

V, e.g., the set of all log-linear models.

We now develop the information-theoretic back-

ground to discuss

-information. The entropy of

a random variable is deﬁned as

H(Z) def

=−E

z∼p(Z)

log p(z)(4)

Xu et al. (2020) analogously deﬁne the conditional

V-entropy as follows

HV(Z |X)def

=−sup

q∈V

(x,z)∼p(X,Z)

log q(z|x)

(5)

The

V-entropy

is a special case of Eq. (5) without

conditioning on another random variable, i.e.,

HV(Z) def

=−sup

q∈V

z∼p(Z)

log q(z)(6)

Xu et al. (2020) further deﬁne the

-information,

a generalization of mutual information, as follows

IV(X→Z) def

= HV(Z) −HV(Z |X)(7)

In words, Eq. (7) is the best approximation of

the mutual information realizable by a classiﬁer

belonging to the predictive family

. Furthermore,

in the case of log-linear models, Eq. (7) can be

approximated empirically by calculating the neg-

ative log-likelihood loss of the classiﬁer on a given

set of examples, as

HV(Z)

is the entropy of the

label distribution, and

HV(Z |X)

is the minimum

achievable value of the cross-entropy loss.

2.3 Guardedness

Having deﬁned

-information, we can now for-

mally deﬁne guardedness as the condition where

the V-information is small.

Deﬁnition 2.1 (

-Guardedness).Let

be a

representation-valued random variable and

let

be an attribute-valued random variable.

Moreover, let

be a predictive family. A guarding

function

h ε

-guards

with respect to

over

if IV(h(X)→Z) < ε.

Deﬁnition 2.2 (Empirical

-Guardedness).Let

D={(xn, zn)}N

n=1

where

(xn, zn)∼p(X,Z)

Let

and

be random variables over

and

, respectively, whose distribution corresponds to

the marginals of the empirical distribution over

We say that a function

h(·)

empirically

-guards

with respect to the family

IV(h(e

X)→e

Z) < ε

In words, according to Deﬁnition 2.2, a dataset

is log-linearly guarded if no linear classiﬁer can

perform better than the trivial classiﬁer that com-

pletely ignores

and always predicts

according

to the proportions of each label. The commonly

used algorithms that have been proposed for linear

subspace erasure can be seen as approximating the

condition we call log-linear guardedness (Ravfogel

et al.,2020,2022a,b). Our experimental results

focus on empirical guardedness, which pertains

to practically measuring guardedness on a ﬁnite

dataset. However, determining the precise bounds

and guarantees of empirical guardedness is left as

an avenue for future research.

3 Theoretical Analysis

In the following sections, we study the implica-

tions of guardedness on subsequent linear classi-

ﬁers. Speciﬁcally, if we construct a third random

variable

Y = t(h(X))

where

t:RD→ Y

is a

function, what is the degree to which

can reveal

information about

? As a practical instance of this

problem, suppose we impose

-guardedness on the

last hidden representations of a transformer model,

i.e.,

in our formulation, and then ﬁt a linear

classiﬁer

over the guarded representations

h(X)

to predict sentiment. Can the predictions of the

sentiment classiﬁer indirectly leak information on

gender? For expressive

, the data-processing in-

equality (Cover and Thomas,2006, §2.8) applied to

the Markov chain

X→b

Y→Z

tells us the answer

is no. The reason is that, in this case,

-information

is equivalent to mutual information and the data

processing inequality tells us such leakage is not

possible. However, the data processing inequal-

ity does not generally apply to

-information (Xu

et al.,2020). Thus, it is possible to ﬁnd such a pre-

dictor

for less expressive

. Surprisingly, when

|Y| = 2

, we are able to prove that constructing such

that leaks information is impossible under a cer-

tain restriction on the family of log-linear models.

3.1 Problem Formulation

We ﬁrst consider the case where |Y| = 2.

3.2 A Binary Downstream Classiﬁer

We begin by asking whether the predictions of a

binary log-linear model trained over the guarded

set of representations can leak information on the

protected attribute. Our analysis relies on the fol-

lowing simpliﬁed family of log-linear models.

Deﬁnition 3.1 (Discretized Log-Linear Models).

The family of discretized binary log-linear models

with parameter δ∈(0,1) is deﬁned as

Vδdef

=ff(0) = ρδ(σ(α⊤x+γ))

f(1) = ρδ(1 −σ(α⊤x+γ))(8)

with

α∈RD

γ∈R

being the logistic function,

and where we deﬁne the

-discretization function as

ρδ(p)def

=(δ, if p≥1

1−δ, else (9)

In words,

ρδ

is a function that maps the probability

value to one of two possible values. Note that ties

must be broken arbitrarily in the case that

p=1

to ensure a valid probability distribution.

Our analysis is based on the following simple

observation (see Lemma A.1 in the Appendix) that

the composition of two

-discretized log-linear

models is itself a

-discretized log-linear model.

Using this fact, we show that when

|Y| =|Z| = 2

and the predictive family is the set of

-discretized

binary log-linear models,

-guarded representa-

tions

h(X)

cannot leak information through a

downstream classiﬁer.

Theorem 3.2. Let

Vδ

be the family of

discretized log-linear models, and let

be a

(a) Log-linearly guarded data in

with axis-aligned clus-

ters.

θ1

θ2

(b) Log-inearly guarded data in

with clusters that are not

axis-aligned.

Figure 1: Construction of a log-linear model that breaks log-linear guardedness.

representation-valued random variable. Deﬁne

as in Eq. (1), then

IVδ(h(X)→Z) < ε

implies

IVδ(b

Y→Z) < ε.

Proof. Deﬁne the hard thresholding function

τ(x)def

=(1,if x > 0

0,else (10)

We assume, by way of contradiction, that

IVδ(h(X)→Z) < ε

, but

IVδ(b

Y→Z) > ε

. We

start by algebraically manipulating IVδ(b

Y→Z):

IVδ(b

Y→Z) = HVδ(Z) −HVδ(Z |b

= HVδ(Z) + sup

q∈Vδ

(z,y)∼p

log q(z|y)(11)

= HVδ(Z) + sup

q∈Vδ

(z,x)∼p

log q(z|τ(θ⊤h(x) + ϕ))

for some

and

as in the deﬁnition of

Eq. (1). Now, by Lemma A.1, we note that, for all

q∈ Vδ

, there exists a classiﬁer

r∈ Vδ

such that

r(z|h(x)) = q(z|τ(θ⊤h(x)+ϕ))

. This implies

that

IVδ(h(X)→Z) ≥IVδ(b

Y→Z) > ε

contra-

dicting the assumption that

IVδ(h(X)→Z) < ε

Thus, IVδ(b

Y→Z) < ε, as desired. ■

3.3 A Multiclass Downstream Classiﬁer

The above discussion shows that when both

and

are binary,

-log-linear guardedness with respect

to the family of discretized log-linear models (Def-

inition 3.1) implies limited leakage of information

Lemma A.1 only guarantees that a classiﬁer of the form

q(z|τ(θ⊤h(x) + ϕ)

, where

q∈ Vδ

, can be converted

into a classiﬁer

r(z|h(x)) ∈ Vδ

. However, we have no

proof of the opposite implication. Hence, we have only shown

IVδ(h(X)→Z) ≥IVδ(

Y→Z).

about

from

. It was previously implied (Rav-

fogel et al.,2020;Elazar et al.,2021) that linear

concept erasure prevents information leakage about

through the labeling of a log-linear classiﬁer

i.e., it was assumed that Theorem 3.2 in§3.2 can

be generalized to the multiclass case. Speciﬁcally,

it was argued that a subsequent linear layer, such as

the linear language-modeling head, would not be

able to recover the information because it is linear.

In this paper, however, we note a key ﬂaw in this

argument. If the data is log-linearly guarded, then

it is easy to see that the logits, which are a linear

transformation of the guarded representation, can-

not encode the information. However, multiclass

classiﬁcation is usually performed by a softmax

classiﬁer, which adds a non-linearity. Note that

the decision boundary of the softmax classiﬁer for

every pair of labels is linear since class

will have

higher softmax probability than class

if, and only

if, (θi−θj)⊤x>0.

Next, we demonstrate that this is enough to break

guardedness. We start with an example. Consider

the data in

presented in Fig. 1(a), where the dis-

tribution

p(X,Z)

has 4 distinct clusters, each with

a different label from

, corresponding to Voronoi

regions (Voronoi,1908) formed by the intersection

of the axes. The red clusters correspond to

Z = ⊤

and the blue clusters correspond to

Z = ⊥

. The

data is taken to be log-linearly guarded with re-

spect to

Importantly, we note that knowledge

of the quadrant (i.e., the value of

), renders

recoverable by a 4-class log-linear model.

Information-theoretic guardedness depends on the den-

sity over

p(X)

, which is not depicted in the illustrations in

Fig. 1(a).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Log-linearGuardednessanditsImplicationsShauliRavfogel1,2YoavGoldberg1,2RyanCotterell31Bar-IlanUniversity2AllenInstituteforArtificialIntelligence3ETHZürich{shauli.ravfogel,yoav.goldberg}@gmail.comryan.cotterell@inf.ethz.chAbstractMethodsforerasinghuman-interpretablecon-ceptsfromneuralrepresentationst...

展开>> 收起<<

Log-linear Guardedness and its Implications Shauli Ravfogel12Yoav Goldberg12Ryan Cotterell3 1Bar-Ilan University2Allen Institute for Artificial Intelligence3ETH Zürich.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Log-linear Guardedness and its Implications Shauli Ravfogel12Yoav Goldberg12Ryan Cotterell3 1Bar-Ilan University2Allen Institute for Artificial Intelligence3ETH Zürich

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: