
It is also less susceptible to inconsistencies related to the
selection of representative inputs than other methods based
on comparing model predictions. At its core, Zest samples
N
images form the training set to use as reference inputs and then
generates
N
corresponding LIME linear regression models by
perturbing super-pixels representing continuous pixel patches,
and querying the target model with the perturbed inputs. These
local models are then aggregated into signatures over which a
distance metric such as Cosine similarity, or the
L1/2/∞
norm
of the difference, is applied.
B. Black-box Attacks
Many common adversarial example generation techniques,
[
7
], [
8
], [
9
], are developed in a white-box scenario, where the
adversary can compute gradients on the loss of the model with
respect to its weights. They generally lead to effective attacks,
able to alter the prediction of the model with extremely limited
perturbation budgets, which are very useful in estimating the
robustness of both existing models and proposed defensive
systems. However, they are generally not applicable in realistic
scenarios where the adversary’s only interface with the victim
model is through an API designed by the model owner, which
usually returns only the classifier’s output and is provided under
a pay-per-query model. Efforts to circumvent these limitations
led to the development of two main strategies: gradient-free
methods, and transfer attacks.
Examples of the first class include methods based on gener-
ating adversarial examples through zeroth-order optimization
techniques, such as ZOO [
10
] and AutoZOOM [
3
]. Another
commonly used technique, HopSkipJump [
4
], focuses on
estimating the direction of the gradient by analyzing the outputs
of the model in the proximity of the decision boundary. Square
Attack [
11
], on the other hand, approaches the problem by
adapting a randomized search scheme where each perturbation
is confined to a small square of pixels. These methods are
generally characterized by a rather large amount of queries to
the victim model for each adversarial example generated.
The second, studied by [
5
], [
12
], [
13
], revolves around
using one or multiple local proxy (or surrogate) models to
compute a set of adversarial examples, and then using these
locally generated points to attack the victim model. The use
of proxy models implies that the adversary is not limited
in which technique to use to generate the evasive samples,
and can leverage the full power of gradient based methods.
This approach allows the attacker to minimize the number
of queries to the victim model, as they are free to generate
a vast quantity of evasive points locally without repeatedly
generating large query volumes. However, the rate at which
the generated samples transfer successfully to the victim is
highly variable with the proxy model used in the generation
phase, and generally hard to predict.
III. THREAT MODEL
In this work we are interested in a realistic scenario where
the adversary wishes to craft evasive samples for a victim
model while having limited, query-only, access to it, often
referred to in literature as black-box. This means that the
adversary can send requests to the victim model, and retrieve
the classifier output scores for each query. The number of
queries the adversary can make is strictly limited by their
resources, as they will have to pay a fixed cost to run each
query. This is an extremely common situation, as most deployed
machine learning systems provide access through a payed API
system, which generally returns only the output scores for each
query
1
. While there are a multitude of works in the adversarial
ML literature exploring a variety of other threat models, we
argue that this represents one of the most realistic and wide-
spread scenarios.
A. Problem statement
Let us assume the adversary is in control of a number
n
of
different classification models, proxies
{p1, ..., pn}
, trained on
a similar data distribution as the victim model
f
. We consider a
transfer attack successful if the perturbed data point, generated
using only information gathered through the local proxy model,
induces a mis-classification in the target model,
f
. Therefore,
the adversary’s objective
A
is to generate a set of adversarial
examples
A={a1, ..., am}
, starting from clean test points
Dt={(x1, y1), ..., (xm, ym)}
, such that the largest possible
number of them are evasive for f:
Aj=
m
X
i=1
(f(ai)6=yi)
A= arg max
p∈P
Ap
(1)
In this work we formulate and empirically analyze the
following hypotheses:
(H1)
Pairs of models
(pj, f)
with similar architectures will
show, on average, lower Zest distances.
(H2)
There is a negative correlation between the Zest distance of
a pair
(pj, f)
and the successful transfer rate of adversarial
examples from pjto f.
Testing these hypotheses is critical in determining if Zest
distances between models can be used for reducing the
cost of black-box adversarial attacks. (H1) is informative
in determining the relationships between models trained on
similar architectures. (H2) implies that an adversary can directly
leverage Zest distances as a source of information to select the
best possible surrogate when targeting a black-box model.
IV. METHODOLOGY
The process followed by the attacker to increase the cost-
effectiveness of their campaigns is very simple, and summarized
in Algorithm 1. It starts with acquiring a large number of
models trained for a similar task as the victim model. For each
collected model, they would compute the respective LIME
representation and store it for later reuse. Note that this process
does not have to be temporally bounded. The adversary can
1
In the rarer cases in which the target model returns only categorical labels,
LIME would not be applicable. The adversary would have to fall-back to
gradient-free methods.
2