
The Privacy Issue of Counterfactual Explanations: Explanation Linkage Attacks A PREPRINT
Identifier Quasi-identifiers Private attributes Model prediction
Name Age Gender City Salary Relationship status Credit decision
Lisa 21 F Brussels $50K Single Reject
Table 1: Factual instance Lisa
Name
is the identifier that is deleted from the dataset, but, as mentioned, people can often be identified by their unique combination
of quasi-identifiers.
Age
,
Gender
and
City
are the quasi-identifiers in this dataset that are assumed to be public knowledge for every
adversary. A possible reasoning behind this, is that the adversary acquired access to a voter registration list as in Sweeney [2000].
Salary
and
Relationship
are private attributes that one does not want to be public information, and the target attribute in this
dataset is whether the individual will be awarded credit or not. Lisa is predicted by the machine learning model as not creditworthy
and her credit gets rejected. Logically, Lisa wants to know the easiest way to get her credit application accepted, so she asks for a
counterfactual explanation, the smallest change to her feature values that result in a different prediction outcome.
Identifier Quasi-identifiers Private attributes Model prediction
Name Age Gender City Salary Relationship status Credit decision
Alfred 25 M Brussels $50K Single Reject
Boris 23 M Antwerp $40K Separated Reject
Casper 34 M Brussels $30K Cohabiting Reject
Derek 47 M Antwerp $100K Married Accept
Edward 70 M Brussels $90K Single Accept
Fiona 24 F Antwerp $60K Single Accept
Gina 27 F Antwerp $80K Married Accept
Hilda 38 F Brussels $60K Widowed Reject
Ingrid 26 F Antwerp $60K Single Reject
Jade 50 F Brussels $100K Married Accept
Table 2: Training set
In our set-up, the counterfactual algorithm looks for the instance in the training set that is nearest to Lisa and has a different prediction
outcome (the nearest unlike neighbor). The training set, with the nearest unlike neighbor highlighted, is shown in Table 2. Fiona
has similar attribute values as Lisa, but is 24 years old instead of 21, lives in Antwerp instead of Brussels and earns $60K instead
of $50K. When Fiona is used as counterfactual instance by the explanation algorithm, Lisa would receive the explanation: ‘If you
would be 3 years older, lived in Antwerp and your income was $10K higher, then you would have received the loan’. Based on her
combined knowledge of the explanation and her own attributes, Lisa can now deduce that
F iona
is the counterfactual instance, as
there is only one person in this dataset with this combination of quasi-identifiers (a 24-year old woman living in Antwerp). Therefore,
Lisa can deduce the private attributes of Fiona, namely Fiona’s income and relationship status, which is undesirable.
Obviously, this is just a toy example, but we envision many real-world settings where this situation could occur. For instance, when
end users receive a negative decision, made by a high-risk AI system: these systems are defined by the EU’s AI Act, which categorizes
the risk of AI systems usage into four levels [European Commission, 2021]. Among others, they include employment, educational
training, law enforcement, migration and essential public services such as credit scoring. Article 13(1) states: “High-risk AI systems
shall be designed and developed in such a way to ensure that their operation is sufficiently transparent to enable users to interpret
the system’s output and use it appropriately.” These systems are thus obliged to provide some form of transparency and guidance to
its users, which could be done by providing counterfactual explanations or any other transparency technique. Most of these settings
use private attributes as input for their decisions, so it is important to make sure that the used transparency techniques do not reveal
private information about other decision subjects. For example, in decisions about educational training or employment, someone’s
grades could be revealed, or in credit scoring, the income of other decision subjects could be disclosed.
This privacy risk only occurs when the counterfactual algorithm uses instance-based strategies to find the counterfactual explanations.
These counterfactuals correspond to the nearest unlike neighbor and are also called native counterfactuals [Brughmans and Martens,
2021, Keane and Smyth, 2020]. Other counterfactual algorithms use perturbation where synthetic counterfactuals are generated by
perturbing the factual instance and labelling it with the machine learning model, without reference to known cases in the training
set [Keane and Smyth, 2020]. These techniques are also vulnerable to privacy attacks such as model extraction but we focus on
counterfactual algorithms that return real instances: several algorithms do this, as this substantially decreases the run time while
also increasing desirable properties of the explanations such as plausibility [Brughmans and Martens, 2021]. Plausibility measures
how realistic the counterfactual explanation is with respect to the data manifold, which is a desirable property[Guidotti, 2022], and
Brughmans and Martens [2021] show that the techniques resulting in an actual instance have the best plausibility results. Furthermore,
it is argued that counterfactual instances that are plausible, are more robust and thus are less vulnerable to the uncertainty of the
classification model or changes over time [Artelt et al., 2021, Brughmans and Martens, 2021, Pawelczyk et al., 2020]. This shows
that for some use cases it can be very useful to use real data points as counterfactuals instead of synthetic ones as for the latter
3