The Privacy Issue of Counterfactual Explanations Explanation Linkage Attacks

2025-05-06 0 0 1.07MB 16 页 10玖币
侵权投诉
THE PRIVACY ISSUE OF COUNTERFACTUAL EXPLANATIONS:
EXPLANATION LINKAGE ATTACKS
A PREPRINT
Sofie Goethals
Department of Engineering Management
University of Antwerp
Antwerp, Belgium
sofie.goethals@uantwerpen.be
Kenneth Sörensen
Department of Engineering Management
University of Antwerp, Belgium
David Martens
Department of Engineering Management
University of Antwerp, Belgium
October 24, 2022
ABSTRACT
Black-box machine learning models are being used in more and more high-stakes domains, which
creates a growing need for Explainable AI (XAI). Unfortunately, the use of XAI in machine learning
introduces new privacy risks, which currently remain largely unnoticed. We introduce the explanation
linkage attack, which can occur when deploying instance-based strategies to find counterfactual
explanations. To counter such an attack, we propose
k
-anonymous counterfactual explanations and
introduce pureness as a new metric to evaluate the validity of these
k
-anonymous counterfactual
explanations. Our results show that making the explanations, rather than the whole dataset,
k
-
anonymous, is beneficial for the quality of the explanations.
Keywords Explainable AI ·Counterfactual Explanations ·Privacy ·k-anonymity
1 Introduction
Black-box models are used for decisions in more and more high-stakes domains such as finance, healthcare and justice,
increasing the need to explain these decisions and to make sure that they are aligned with how we want the decisions to
be made Molnar [2020]. As a result, the interest in interpretability techniques for machine learning and the development
of various techniques has soared [Molnar, 2020]. At the moment, however, there is no consensus on which technique is
best for which specific use case. Within the field of Explainable AI (XAI), we focus on a popular local explanation
technique: counterfactual explanations [Martens and Provost, 2014, Wachter et al., 2017].
Counterfactual explanations, which are used to explain predictions of individual instances, are defined as the smallest
change to the feature values of an instance that alters its prediction [Martens and Provost, 2014, Molnar, 2020]. Factual
instances are the original instances that are explained and the counterfactual instance is the original instance with the
updated values from the explanation. An example of a factual instance,counterfactual instance and counterfactual
explanation for a credit scoring context can be seen in Figure 1. Lisa is the factual instance here, whose credit gets
rejected. Fiona, a nearby instance in the training set whose credit was accepted, is selected as counterfactual instance
by the algorithm and based on Fiona,Lisa receives a counterfactual explanation that states which features to change
to receive a positive credit decision. These explanations can serve multiple objectives: they can be used for model
debugging by data scientists or model experts, to justify decisions to end users or provide actionable recourse, to detect
bias in the model, to increase social acceptance, to comply with GDPR, etc. [Aïvodji et al., 2020, Molnar, 2020].
arXiv:2210.12051v1 [cs.LG] 21 Oct 2022
The Privacy Issue of Counterfactual Explanations: Explanation Linkage Attacks A PREPRINT
Factual instance
Identifier Quasi-Identifiers Private attributes Model prediction
Name Age Gender City Salary Relationship status Credit decision
Lisa 21 FBrussels $50K Single Reject
Counterfactual explanation=
If you would be
three years older
, lived in
Antwerp
and your
income would be
$10K
higher, you would have received a posi-
tive credit decision
Counterfactual instance
Identifier Quasi-Identifiers Private attributes Model prediction
Name Age Gender City Salary Relationship status Credit decision
Fiona 24 FAntwerp $60K Single Accept
Figure 1: Example of a counterfactual explanation
At the same time, there is a growing concern about the potential privacy risks of machine learning [Liu et al., 2021]. Privacy is
recognized as a human right and defined by Oxford Dictionary as “a state of being free from the attention of the public”.
1
In a
privacy attack, the goal of an adversary is to gain knowledge that was not intended to be shared [Liu et al., 2021, Rigaki and Garcia,
2020]. Different kinds of privacy attacks exist: both the training data, where the adversary tries to infer membership in a membership
inference attack or specific attributes of an input sample in an attribute inference attack, as well as the model, in a model extraction
attack, can be the target [Fredrikson et al., 2015, Rigaki and Garcia, 2020].
Unfortunately, there exists an inherent tension between explainability and privacy as the usage of Explainable AI can increase these
privacy risks [Aïvodji et al., 2020]: model explanations offer users information about how the model made a decision about their data
instance. Consequently, they leak information about the model and the data instances that were used to train the model. In this paper,
we introduce a new kind of privacy attack based on counterfactual explanations and we call this an explanation linkage attack. A
linkage attack attempts to identify anonymized individuals by combining the data with background information. An explanation
linkage attack attempts to link the counterfactual explanation with background information to identify the counterfactual instance.
We illustrate an example of an explanation linkage attack in Section 2. Unfortunately, the introduction of these attacks indicates
that an attempt to make an AI system safer by making it more transparent can have the opposite effect [Sokol and Flach, 2019].
Other researchers [Budig et al., Patel et al., 2020] also confirm the trade-off between privacy and explainability and emphasise that
assessing this trade-off for minority groups is an important direction for future research [Patel et al., 2020].
Our contributions are as follows:
We introduce a new kind of privacy attack, the explanation linkage attack, that can occur when using counterfactual
explanations that are grounded in instances from the training set.
As a solution for this problem, we propose
k
-anonymous counterfactual explanations and develop an algorithm to generate
these.
We evaluate how
k
-anonymizing the counterfactual explanations influences the quality of these explanations, and introduce
pureness as a new metric to evaluate the validity of these explanations.
We show the trade-off between transparency,fairness and privacy when using
k
-anonymous explanations: when we add
more privacy constraints, the quality of the explanations and thus the transparency decreases. This effect on the explanation
quality is larger for minority groups, as they tend to be harder to anonymize, and this can impact the fairness.
2 Problem statement
We introduce the privacy problem of counterfactual explanations that are grounded in instances of the training set, and illustrate this
problem by using a simple toy dataset. This dataset contains individuals that are described by a set of identifiers, quasi-identifiers
and private attributes [Sweeney, 2002a]. Identifiers are attributes such as name, phone or social security number and need to be
suppressed in any case as they often do not have predictive value and can uniquely identify a person. Quasi-identifiers are attributes
such as age, zip code or gender that can hold some predictive value. They are assumed to be public information; however, even though
they cannot uniquely identify a person, their combination might. It has been shown that 87% of US citizens can be re-identified by
the combination of their zip code, gender and date of birth [Sweeney, 2000]. Private attributes are attributes that are not publicly
known. We assume that an adversary will try to get access to the private attributes of a user in the dataset, and a possible avenue to
achieve this goal, is by asking for counterfactual explanations. Assume the following factual instance Lisa in Table 1:
1https://www.oxfordlearnersdictionaries.com/definition/american_english/privacy
2
The Privacy Issue of Counterfactual Explanations: Explanation Linkage Attacks A PREPRINT
Identifier Quasi-identifiers Private attributes Model prediction
Name Age Gender City Salary Relationship status Credit decision
Lisa 21 F Brussels $50K Single Reject
Table 1: Factual instance Lisa
Name
is the identifier that is deleted from the dataset, but, as mentioned, people can often be identified by their unique combination
of quasi-identifiers.
Age
,
Gender
and
City
are the quasi-identifiers in this dataset that are assumed to be public knowledge for every
adversary. A possible reasoning behind this, is that the adversary acquired access to a voter registration list as in Sweeney [2000].
Salary
and
Relationship
are private attributes that one does not want to be public information, and the target attribute in this
dataset is whether the individual will be awarded credit or not. Lisa is predicted by the machine learning model as not creditworthy
and her credit gets rejected. Logically, Lisa wants to know the easiest way to get her credit application accepted, so she asks for a
counterfactual explanation, the smallest change to her feature values that result in a different prediction outcome.
Identifier Quasi-identifiers Private attributes Model prediction
Name Age Gender City Salary Relationship status Credit decision
Alfred 25 M Brussels $50K Single Reject
Boris 23 M Antwerp $40K Separated Reject
Casper 34 M Brussels $30K Cohabiting Reject
Derek 47 M Antwerp $100K Married Accept
Edward 70 M Brussels $90K Single Accept
Fiona 24 F Antwerp $60K Single Accept
Gina 27 F Antwerp $80K Married Accept
Hilda 38 F Brussels $60K Widowed Reject
Ingrid 26 F Antwerp $60K Single Reject
Jade 50 F Brussels $100K Married Accept
Table 2: Training set
In our set-up, the counterfactual algorithm looks for the instance in the training set that is nearest to Lisa and has a different prediction
outcome (the nearest unlike neighbor). The training set, with the nearest unlike neighbor highlighted, is shown in Table 2. Fiona
has similar attribute values as Lisa, but is 24 years old instead of 21, lives in Antwerp instead of Brussels and earns $60K instead
of $50K. When Fiona is used as counterfactual instance by the explanation algorithm, Lisa would receive the explanation: ‘If you
would be 3 years older, lived in Antwerp and your income was $10K higher, then you would have received the loan’. Based on her
combined knowledge of the explanation and her own attributes, Lisa can now deduce that
F iona
is the counterfactual instance, as
there is only one person in this dataset with this combination of quasi-identifiers (a 24-year old woman living in Antwerp). Therefore,
Lisa can deduce the private attributes of Fiona, namely Fionas income and relationship status, which is undesirable.
Obviously, this is just a toy example, but we envision many real-world settings where this situation could occur. For instance, when
end users receive a negative decision, made by a high-risk AI system: these systems are defined by the EU’s AI Act, which categorizes
the risk of AI systems usage into four levels [European Commission, 2021]. Among others, they include employment, educational
training, law enforcement, migration and essential public services such as credit scoring. Article 13(1) states: “High-risk AI systems
shall be designed and developed in such a way to ensure that their operation is sufficiently transparent to enable users to interpret
the system’s output and use it appropriately. These systems are thus obliged to provide some form of transparency and guidance to
its users, which could be done by providing counterfactual explanations or any other transparency technique. Most of these settings
use private attributes as input for their decisions, so it is important to make sure that the used transparency techniques do not reveal
private information about other decision subjects. For example, in decisions about educational training or employment, someone’s
grades could be revealed, or in credit scoring, the income of other decision subjects could be disclosed.
This privacy risk only occurs when the counterfactual algorithm uses instance-based strategies to find the counterfactual explanations.
These counterfactuals correspond to the nearest unlike neighbor and are also called native counterfactuals [Brughmans and Martens,
2021, Keane and Smyth, 2020]. Other counterfactual algorithms use perturbation where synthetic counterfactuals are generated by
perturbing the factual instance and labelling it with the machine learning model, without reference to known cases in the training
set [Keane and Smyth, 2020]. These techniques are also vulnerable to privacy attacks such as model extraction but we focus on
counterfactual algorithms that return real instances: several algorithms do this, as this substantially decreases the run time while
also increasing desirable properties of the explanations such as plausibility [Brughmans and Martens, 2021]. Plausibility measures
how realistic the counterfactual explanation is with respect to the data manifold, which is a desirable property[Guidotti, 2022], and
Brughmans and Martens [2021] show that the techniques resulting in an actual instance have the best plausibility results. Furthermore,
it is argued that counterfactual instances that are plausible, are more robust and thus are less vulnerable to the uncertainty of the
classification model or changes over time [Artelt et al., 2021, Brughmans and Martens, 2021, Pawelczyk et al., 2020]. This shows
that for some use cases it can be very useful to use real data points as counterfactuals instead of synthetic ones as for the latter
3
The Privacy Issue of Counterfactual Explanations: Explanation Linkage Attacks A PREPRINT
the risk of generating implausible counterfactual explanations can be quite high [Laugel et al., 2019]. Algorithms that use these
native counterfactual explanations include NICE (without optimization setting) [Brughmans and Martens, 2021], the WIT tool with
NNCE [Wexler et al., 2019], FACE [Poyiadzi et al., 2020] and certain settings of CBR [Keane and Smyth, 2020].
3 Proposed solution
As a solution, we propose to make the counterfactual explanations
k
-anonymous.
k
-anonymity is a property that captures the
protection of released data against possible reidentification by stating that the released data should be indistinguishable between
k
data subjects [Van Tilborg and Jajodia, 2014].
3.1 What is k-anonymity?
Before
k
-anonymity was introduced, data that looked anonymous was often freely shared after removing explicit identifiers such as
name and address, incorrectly believing that individuals in those datasets could not be identified. Contrary to these beliefs, we have
seen that people can often be identified through their unique combination of quasi-identifiers.
Consider a database that holds private information about individuals, where each individual is described by a set of identifiers,
quasi-identifiers, and private attributes.
k
-anonymity characterises the degree of privacy, where the information for each person in the
dataset cannot be distinguished from at least
k1
other individuals whose information was also released [Sweeney, 2002b]. A
group of individuals that cannot be distinguished from each other and thus have the same values of quasi-identifiers are named an
equivalence class.
Usually
k
-anonymity is applied on the whole dataset: the quasi-identifiers of the data records are suppressed or generalised in such a
way that one record is not distinguishable from at least
k1
other data records in that dataset [Meyerson and Williams, 2004]. In
this way, the privacy of individuals is protected to some extent by “hiding in the crowd" as private data can now only be linked to
a set of individuals of at least size
k
[Gionis and Tassa, 2008]. However, by generalising or suppressing attribute values, the data
becomes less useful, so the problem studied is to make a dataset
k
-anonymous with minimal loss of information [Gionis and Tassa,
2008, Xu et al., 2006a]. We will measure the loss in information value with the Normalized Certainty Penalty (NCP) and explain this
metric in Section 4.
3.2 Application to our problem
k-anonymity
Dataset Counterfactual explanation
Input Dataset
Dataset
Factual instance
Counterfactual explanation
Machine learning model
Defined over Dataset Counterfactual explanation
Method Mondrian2, Datafly3,.. CF-K
Risk
Identifying instances in the dataset
based on their combination
of quasi-identifiers and inferring
their private attributes
Identifying the counterfactual instance
based on its combination
of quasi-identifiers and inferring
its private attributes
Evaluation metrics Degree of privacy
Information loss
Degree of privacy
Information loss
Counterfactual validity
Table 3: Comparison between the original problem setting of k-anonymity and our problem setting.
Our application differs from the original set-up of
k
-anonymity as it is focused on making counterfactual explanations anonymous
and not the whole dataset. The original application has to be used in situations where the whole dataset is made public. We highlight
this difference in Table 3. A counterfactual instance is defined as
k
-anonymous if the combination of quasi-identifiers can belong to at
least
k
individuals in the training set, and consequently, a counterfactual explanation is defined as
k
-anonymous if the counterfactual
instance on which it is based, is
k
-anonymous. We implement this by looking for close neighbours of Fiona, that have similar values
of quasi-identifiers, and that also have the desired prediction outcome. In this case, the closest neighbor to Fiona that has the desired
prediction outcome is Gina, as can be seen in Table 2. Next, we generalise the quasi-identifiers of the counterfactual instance so
1LeFevre et al. [2006a]
2Sweeney [2002b]
4
摘要:

THEPRIVACYISSUEOFCOUNTERFACTUALEXPLANATIONS:EXPLANATIONLINKAGEATTACKSAPREPRINTSoeGoethalsDepartmentofEngineeringManagementUniversityofAntwerpAntwerp,Belgiumsofie.goethals@uantwerpen.beKennethSörensenDepartmentofEngineeringManagementUniversityofAntwerp,BelgiumDavidMartensDepartmentofEngineeringManag...

展开>> 收起<<
The Privacy Issue of Counterfactual Explanations Explanation Linkage Attacks.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.07MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注