The Privacy Issue of Counterfactual Explanations Explanation Linkage Attacks

2025-05-06 0 0 1.07MB 16 页 10玖币

侵权投诉

THE PRIVACY ISSUE OF COUNTERFACTUAL EXPLANATIONS:

EXPLANATION LINKAGE ATTACKS

A PREPRINT

Soﬁe Goethals

Department of Engineering Management

University of Antwerp

Antwerp, Belgium

sofie.goethals@uantwerpen.be

Kenneth Sörensen

Department of Engineering Management

University of Antwerp, Belgium

David Martens

Department of Engineering Management

University of Antwerp, Belgium

October 24, 2022

ABSTRACT

Black-box machine learning models are being used in more and more high-stakes domains, which

creates a growing need for Explainable AI (XAI). Unfortunately, the use of XAI in machine learning

introduces new privacy risks, which currently remain largely unnoticed. We introduce the explanation

linkage attack, which can occur when deploying instance-based strategies to ﬁnd counterfactual

explanations. To counter such an attack, we propose

-anonymous counterfactual explanations and

introduce pureness as a new metric to evaluate the validity of these

-anonymous counterfactual

explanations. Our results show that making the explanations, rather than the whole dataset,

anonymous, is beneﬁcial for the quality of the explanations.

Keywords Explainable AI ·Counterfactual Explanations ·Privacy ·k-anonymity

1 Introduction

Black-box models are used for decisions in more and more high-stakes domains such as ﬁnance, healthcare and justice,

increasing the need to explain these decisions and to make sure that they are aligned with how we want the decisions to

be made Molnar [2020]. As a result, the interest in interpretability techniques for machine learning and the development

of various techniques has soared [Molnar, 2020]. At the moment, however, there is no consensus on which technique is

best for which speciﬁc use case. Within the ﬁeld of Explainable AI (XAI), we focus on a popular local explanation

technique: counterfactual explanations [Martens and Provost, 2014, Wachter et al., 2017].

Counterfactual explanations, which are used to explain predictions of individual instances, are deﬁned as the smallest

change to the feature values of an instance that alters its prediction [Martens and Provost, 2014, Molnar, 2020]. Factual

instances are the original instances that are explained and the counterfactual instance is the original instance with the

updated values from the explanation. An example of a factual instance,counterfactual instance and counterfactual

explanation for a credit scoring context can be seen in Figure 1. Lisa is the factual instance here, whose credit gets

rejected. Fiona, a nearby instance in the training set whose credit was accepted, is selected as counterfactual instance

by the algorithm and based on Fiona,Lisa receives a counterfactual explanation that states which features to change

to receive a positive credit decision. These explanations can serve multiple objectives: they can be used for model

debugging by data scientists or model experts, to justify decisions to end users or provide actionable recourse, to detect

bias in the model, to increase social acceptance, to comply with GDPR, etc. [Aïvodji et al., 2020, Molnar, 2020].

arXiv:2210.12051v1 [cs.LG] 21 Oct 2022

The Privacy Issue of Counterfactual Explanations: Explanation Linkage Attacks A PREPRINT

Factual instance

Identiﬁer Quasi-Identiﬁers Private attributes Model prediction

Name Age Gender City Salary Relationship status Credit decision

Lisa 21 FBrussels $50K Single Reject

Counterfactual explanation=

If you would be

three years older

, lived in

Antwerp

and your

income would be

$10K

higher, you would have received a posi-

tive credit decision

Counterfactual instance

Identiﬁer Quasi-Identiﬁers Private attributes Model prediction

Name Age Gender City Salary Relationship status Credit decision

Fiona 24 FAntwerp $60K Single Accept

Figure 1: Example of a counterfactual explanation

At the same time, there is a growing concern about the potential privacy risks of machine learning [Liu et al., 2021]. Privacy is

recognized as a human right and deﬁned by Oxford Dictionary as “a state of being free from the attention of the public”.

In a

privacy attack, the goal of an adversary is to gain knowledge that was not intended to be shared [Liu et al., 2021, Rigaki and Garcia,

2020]. Different kinds of privacy attacks exist: both the training data, where the adversary tries to infer membership in a membership

inference attack or speciﬁc attributes of an input sample in an attribute inference attack, as well as the model, in a model extraction

attack, can be the target [Fredrikson et al., 2015, Rigaki and Garcia, 2020].

Unfortunately, there exists an inherent tension between explainability and privacy as the usage of Explainable AI can increase these

privacy risks [Aïvodji et al., 2020]: model explanations offer users information about how the model made a decision about their data

instance. Consequently, they leak information about the model and the data instances that were used to train the model. In this paper,

we introduce a new kind of privacy attack based on counterfactual explanations and we call this an explanation linkage attack. A

linkage attack attempts to identify anonymized individuals by combining the data with background information. An explanation

linkage attack attempts to link the counterfactual explanation with background information to identify the counterfactual instance.

We illustrate an example of an explanation linkage attack in Section 2. Unfortunately, the introduction of these attacks indicates

that an attempt to make an AI system safer by making it more transparent can have the opposite effect [Sokol and Flach, 2019].

Other researchers [Budig et al., Patel et al., 2020] also conﬁrm the trade-off between privacy and explainability and emphasise that

assessing this trade-off for minority groups is an important direction for future research [Patel et al., 2020].

Our contributions are as follows:

•

We introduce a new kind of privacy attack, the explanation linkage attack, that can occur when using counterfactual

explanations that are grounded in instances from the training set.

•

As a solution for this problem, we propose

-anonymous counterfactual explanations and develop an algorithm to generate

these.

•

We evaluate how

-anonymizing the counterfactual explanations inﬂuences the quality of these explanations, and introduce

pureness as a new metric to evaluate the validity of these explanations.

•

We show the trade-off between transparency,fairness and privacy when using

-anonymous explanations: when we add

more privacy constraints, the quality of the explanations and thus the transparency decreases. This effect on the explanation

quality is larger for minority groups, as they tend to be harder to anonymize, and this can impact the fairness.

2 Problem statement

We introduce the privacy problem of counterfactual explanations that are grounded in instances of the training set, and illustrate this

problem by using a simple toy dataset. This dataset contains individuals that are described by a set of identiﬁers, quasi-identiﬁers

and private attributes [Sweeney, 2002a]. Identiﬁers are attributes such as name, phone or social security number and need to be

suppressed in any case as they often do not have predictive value and can uniquely identify a person. Quasi-identiﬁers are attributes

such as age, zip code or gender that can hold some predictive value. They are assumed to be public information; however, even though

they cannot uniquely identify a person, their combination might. It has been shown that 87% of US citizens can be re-identiﬁed by

the combination of their zip code, gender and date of birth [Sweeney, 2000]. Private attributes are attributes that are not publicly

known. We assume that an adversary will try to get access to the private attributes of a user in the dataset, and a possible avenue to

achieve this goal, is by asking for counterfactual explanations. Assume the following factual instance Lisa in Table 1:

1https://www.oxfordlearnersdictionaries.com/definition/american_english/privacy

The Privacy Issue of Counterfactual Explanations: Explanation Linkage Attacks A PREPRINT

Identiﬁer Quasi-identiﬁers Private attributes Model prediction

Name Age Gender City Salary Relationship status Credit decision

Lisa 21 F Brussels $50K Single Reject

Table 1: Factual instance Lisa

Name

is the identiﬁer that is deleted from the dataset, but, as mentioned, people can often be identiﬁed by their unique combination

of quasi-identiﬁers.

Age

Gender

and

City

are the quasi-identiﬁers in this dataset that are assumed to be public knowledge for every

adversary. A possible reasoning behind this, is that the adversary acquired access to a voter registration list as in Sweeney [2000].

Salary

and

Relationship

are private attributes that one does not want to be public information, and the target attribute in this

dataset is whether the individual will be awarded credit or not. Lisa is predicted by the machine learning model as not creditworthy

and her credit gets rejected. Logically, Lisa wants to know the easiest way to get her credit application accepted, so she asks for a

counterfactual explanation, the smallest change to her feature values that result in a different prediction outcome.

Identiﬁer Quasi-identiﬁers Private attributes Model prediction

Name Age Gender City Salary Relationship status Credit decision

Alfred 25 M Brussels $50K Single Reject

Boris 23 M Antwerp $40K Separated Reject

Casper 34 M Brussels $30K Cohabiting Reject

Derek 47 M Antwerp $100K Married Accept

Edward 70 M Brussels $90K Single Accept

Fiona 24 F Antwerp $60K Single Accept

Gina 27 F Antwerp $80K Married Accept

Hilda 38 F Brussels $60K Widowed Reject

Ingrid 26 F Antwerp $60K Single Reject

Jade 50 F Brussels $100K Married Accept

Table 2: Training set

In our set-up, the counterfactual algorithm looks for the instance in the training set that is nearest to Lisa and has a different prediction

outcome (the nearest unlike neighbor). The training set, with the nearest unlike neighbor highlighted, is shown in Table 2. Fiona

has similar attribute values as Lisa, but is 24 years old instead of 21, lives in Antwerp instead of Brussels and earns $60K instead

of $50K. When Fiona is used as counterfactual instance by the explanation algorithm, Lisa would receive the explanation: ‘If you

would be 3 years older, lived in Antwerp and your income was $10K higher, then you would have received the loan’. Based on her

combined knowledge of the explanation and her own attributes, Lisa can now deduce that

F iona

is the counterfactual instance, as

there is only one person in this dataset with this combination of quasi-identiﬁers (a 24-year old woman living in Antwerp). Therefore,

Lisa can deduce the private attributes of Fiona, namely Fiona’s income and relationship status, which is undesirable.

Obviously, this is just a toy example, but we envision many real-world settings where this situation could occur. For instance, when

end users receive a negative decision, made by a high-risk AI system: these systems are deﬁned by the EU’s AI Act, which categorizes

the risk of AI systems usage into four levels [European Commission, 2021]. Among others, they include employment, educational

training, law enforcement, migration and essential public services such as credit scoring. Article 13(1) states: “High-risk AI systems

shall be designed and developed in such a way to ensure that their operation is sufﬁciently transparent to enable users to interpret

the system’s output and use it appropriately.” These systems are thus obliged to provide some form of transparency and guidance to

its users, which could be done by providing counterfactual explanations or any other transparency technique. Most of these settings

use private attributes as input for their decisions, so it is important to make sure that the used transparency techniques do not reveal

private information about other decision subjects. For example, in decisions about educational training or employment, someone’s

grades could be revealed, or in credit scoring, the income of other decision subjects could be disclosed.

This privacy risk only occurs when the counterfactual algorithm uses instance-based strategies to ﬁnd the counterfactual explanations.

These counterfactuals correspond to the nearest unlike neighbor and are also called native counterfactuals [Brughmans and Martens,

2021, Keane and Smyth, 2020]. Other counterfactual algorithms use perturbation where synthetic counterfactuals are generated by

perturbing the factual instance and labelling it with the machine learning model, without reference to known cases in the training

set [Keane and Smyth, 2020]. These techniques are also vulnerable to privacy attacks such as model extraction but we focus on

counterfactual algorithms that return real instances: several algorithms do this, as this substantially decreases the run time while

also increasing desirable properties of the explanations such as plausibility [Brughmans and Martens, 2021]. Plausibility measures

how realistic the counterfactual explanation is with respect to the data manifold, which is a desirable property[Guidotti, 2022], and

Brughmans and Martens [2021] show that the techniques resulting in an actual instance have the best plausibility results. Furthermore,

it is argued that counterfactual instances that are plausible, are more robust and thus are less vulnerable to the uncertainty of the

classiﬁcation model or changes over time [Artelt et al., 2021, Brughmans and Martens, 2021, Pawelczyk et al., 2020]. This shows

that for some use cases it can be very useful to use real data points as counterfactuals instead of synthetic ones as for the latter

The Privacy Issue of Counterfactual Explanations: Explanation Linkage Attacks A PREPRINT

the risk of generating implausible counterfactual explanations can be quite high [Laugel et al., 2019]. Algorithms that use these

native counterfactual explanations include NICE (without optimization setting) [Brughmans and Martens, 2021], the WIT tool with

NNCE [Wexler et al., 2019], FACE [Poyiadzi et al., 2020] and certain settings of CBR [Keane and Smyth, 2020].

3 Proposed solution

As a solution, we propose to make the counterfactual explanations

-anonymous.

-anonymity is a property that captures the

protection of released data against possible reidentiﬁcation by stating that the released data should be indistinguishable between

data subjects [Van Tilborg and Jajodia, 2014].

3.1 What is k-anonymity?

Before

-anonymity was introduced, data that looked anonymous was often freely shared after removing explicit identiﬁers such as

name and address, incorrectly believing that individuals in those datasets could not be identiﬁed. Contrary to these beliefs, we have

seen that people can often be identiﬁed through their unique combination of quasi-identiﬁers.

Consider a database that holds private information about individuals, where each individual is described by a set of identiﬁers,

quasi-identiﬁers, and private attributes.

-anonymity characterises the degree of privacy, where the information for each person in the

dataset cannot be distinguished from at least

k−1

other individuals whose information was also released [Sweeney, 2002b]. A

group of individuals that cannot be distinguished from each other and thus have the same values of quasi-identiﬁers are named an

equivalence class.

Usually

-anonymity is applied on the whole dataset: the quasi-identiﬁers of the data records are suppressed or generalised in such a

way that one record is not distinguishable from at least

k−1

other data records in that dataset [Meyerson and Williams, 2004]. In

this way, the privacy of individuals is protected to some extent by “hiding in the crowd" as private data can now only be linked to

a set of individuals of at least size

[Gionis and Tassa, 2008]. However, by generalising or suppressing attribute values, the data

becomes less useful, so the problem studied is to make a dataset

-anonymous with minimal loss of information [Gionis and Tassa,

2008, Xu et al., 2006a]. We will measure the loss in information value with the Normalized Certainty Penalty (NCP) and explain this

metric in Section 4.

3.2 Application to our problem

k-anonymity

Dataset Counterfactual explanation

Input Dataset

Dataset

Factual instance

Counterfactual explanation

Machine learning model

Deﬁned over Dataset Counterfactual explanation

Method Mondrian2, Dataﬂy3,.. CF-K

Risk

Identifying instances in the dataset

based on their combination

of quasi-identiﬁers and inferring

their private attributes

Identifying the counterfactual instance

based on its combination

of quasi-identiﬁers and inferring

its private attributes

Evaluation metrics Degree of privacy

Information loss

Degree of privacy

Information loss

Counterfactual validity

Table 3: Comparison between the original problem setting of k-anonymity and our problem setting.

Our application differs from the original set-up of

-anonymity as it is focused on making counterfactual explanations anonymous

and not the whole dataset. The original application has to be used in situations where the whole dataset is made public. We highlight

this difference in Table 3. A counterfactual instance is deﬁned as

-anonymous if the combination of quasi-identiﬁers can belong to at

least

individuals in the training set, and consequently, a counterfactual explanation is deﬁned as

-anonymous if the counterfactual

instance on which it is based, is

-anonymous. We implement this by looking for close neighbours of Fiona, that have similar values

of quasi-identiﬁers, and that also have the desired prediction outcome. In this case, the closest neighbor to Fiona that has the desired

prediction outcome is Gina, as can be seen in Table 2. Next, we generalise the quasi-identiﬁers of the counterfactual instance so

1LeFevre et al. [2006a]

2Sweeney [2002b]

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

THEPRIVACYISSUEOFCOUNTERFACTUALEXPLANATIONS:EXPLANATIONLINKAGEATTACKSAPREPRINTSoeGoethalsDepartmentofEngineeringManagementUniversityofAntwerpAntwerp,Belgiumsofie.goethals@uantwerpen.beKennethSörensenDepartmentofEngineeringManagementUniversityofAntwerp,BelgiumDavidMartensDepartmentofEngineeringManag...

展开>> 收起<<

The Privacy Issue of Counterfactual Explanations Explanation Linkage Attacks.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

The Privacy Issue of Counterfactual Explanations Explanation Linkage Attacks

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: