Bad Citrus Reducing Adversarial Costs with Model Distances Giorgio Severi

2025-04-27 0 0 1.58MB 8 页 10玖币
侵权投诉
Bad Citrus: Reducing Adversarial Costs with Model
Distances
Giorgio Severi
Northeastern University
Will Pearce
Nvidia
Alina Oprea
Northeastern University
Abstract—Recent work by Jia et al. [1], showed the possibility
of effectively computing pairwise model distances in weight space,
using a model explanation technique known as LIME. This
method requires query-only access to the two models under
examination. We argue this insight can be leveraged by an
adversary to reduce the net cost (number of queries) of launching
an evasion campaign against a deployed model. We show that
there is a strong negative correlation between the success rate of
adversarial transfer and the distance between the victim model
and the surrogate used to generate the evasive samples. Thus,
we propose and evaluate a method to reduce adversarial costs
by finding the closest surrogate model for adversarial transfer.
I. INTRODUCTION
Evasion attacks [
2
], often referred to as adversarial examples,
have been a strong focus of machine learning (ML) researchers
for quite some time now. Despite the large body of work on the
subject, launching evasion campaigns, that is finding multiple
adversarial examples for a given deployed model, remains a
non-trivial task, especially when the adversary is given only
query access to the victim. There are two main strategies
to tackle this issue: (i) using an attack method based on a
zeroth-order optimization technique such as AutoZOOM [
3
]
and HopSkipJump [
4
]; (ii) crafting the adversarial examples
on a local surrogate (proxy) model, using a gradient-based
approach, and transferring [
5
] the generated evasive points to
the victim.
As with most practical applications of adversarial machine
learning, both strategies come with specific trade-offs. Gradient-
free methods, while effective, are limited by their strict threat
model. They generally require either a large number of queries
to the victim model for each adversarial example, (and) or
they tend to create more distorted points than their gradient-
based counterparts. Adversarial transfer, on the other hand,
allows the attacker to leverage the full power of gradient-based
methods, and enable them to generate a large number of evasive
samples without inducing high query volumes. However, the
rate of success of transferred samples is not guaranteed to be
satisfactory, leading to potentially failing attack campaigns.
Therefore, an adversary willing to launch an evasion campaign
against a deployed ML classifier has to carefully consider costs
related to query API usage, and ensure they are not identified
by anomaly detection systems deployed to monitor incoming
queries to the victim model.
Given this cost landscape for the adversary, any technique
aimed at increasing the rate of successful transfer of adversarial
examples crafted locally on proxy models would result in a
direct decrease in the costs of running an attack campaign.
Based on this insight, we argue that the recently proposed
”Zest of LIME” paper by Jia et al. [
1
], offers an intriguing new
approach for the adversary to minimize their costs. [
1
] proposes
a methodology to compute distances between pairs of models
given only access to the their outputs using Local Interpretable
Model-agnostic Explanations (LIME) [
6
], which builds local
linear models based on the answer of the target model to
specific queries. A small set of representative points (
N= 128
in the paper) is used to build
N
linear regression models
approximating the classifier around the chosen points, forming
its LIME representation (signature). These representations are
then used to compute the distance between two target classifiers.
While this process does require the adversary to pay a cost
for the queries used to estimate the LIME representation, that
cost is only paid once – the learned representation can be
saved and re-used at any time. Moreover, the adversary can
progressively collect LIME representations for an abundance
of potential proxy models, either by downloading them from
model hubs or by manually training diverse models locally,
and, over time, build a library of representations to compare
against new victims. This would induce an economy of scale
effect due to which, the bigger the adversarial library becomes,
the easier it becomes for the adversary to find good proxy
models for generating evasive samples.
II. BACKGROUND AND RELATED WORKS
In this section we provide a short introduction to the concepts
behind Zest distances and the main approaches currently used
for carrying out black-box evasion attacks.
A. Model Distances with Zest
Local Interpretable Model-agnostic Explanations (LIME) [
6
]
is a model interpretability approach focused on training
surrogate linear models to locally approximate the predictions
of the model under analysis. LIME is architecture agnostic,
and only requires query access to the model making it viable
for use with a remote target behind an API.
Recent work by Jia et al. [
1
] proposes a new approach
to estimate the similarity between different models based on
LIME, called Zest. Zest offers a variety of advantages with
respect to previous model comparison methods. First, it is
generally easier to apply than direct weights comparison, as
the latter requires both full access to the weight matrices, and
that the models under scrutiny share the same architecture.
1
arXiv:2210.03239v1 [cs.CR] 6 Oct 2022
It is also less susceptible to inconsistencies related to the
selection of representative inputs than other methods based
on comparing model predictions. At its core, Zest samples
N
images form the training set to use as reference inputs and then
generates
N
corresponding LIME linear regression models by
perturbing super-pixels representing continuous pixel patches,
and querying the target model with the perturbed inputs. These
local models are then aggregated into signatures over which a
distance metric such as Cosine similarity, or the
L1/2/
norm
of the difference, is applied.
B. Black-box Attacks
Many common adversarial example generation techniques,
[
7
], [
8
], [
9
], are developed in a white-box scenario, where the
adversary can compute gradients on the loss of the model with
respect to its weights. They generally lead to effective attacks,
able to alter the prediction of the model with extremely limited
perturbation budgets, which are very useful in estimating the
robustness of both existing models and proposed defensive
systems. However, they are generally not applicable in realistic
scenarios where the adversary’s only interface with the victim
model is through an API designed by the model owner, which
usually returns only the classifier’s output and is provided under
a pay-per-query model. Efforts to circumvent these limitations
led to the development of two main strategies: gradient-free
methods, and transfer attacks.
Examples of the first class include methods based on gener-
ating adversarial examples through zeroth-order optimization
techniques, such as ZOO [
10
] and AutoZOOM [
3
]. Another
commonly used technique, HopSkipJump [
4
], focuses on
estimating the direction of the gradient by analyzing the outputs
of the model in the proximity of the decision boundary. Square
Attack [
11
], on the other hand, approaches the problem by
adapting a randomized search scheme where each perturbation
is confined to a small square of pixels. These methods are
generally characterized by a rather large amount of queries to
the victim model for each adversarial example generated.
The second, studied by [
5
], [
12
], [
13
], revolves around
using one or multiple local proxy (or surrogate) models to
compute a set of adversarial examples, and then using these
locally generated points to attack the victim model. The use
of proxy models implies that the adversary is not limited
in which technique to use to generate the evasive samples,
and can leverage the full power of gradient based methods.
This approach allows the attacker to minimize the number
of queries to the victim model, as they are free to generate
a vast quantity of evasive points locally without repeatedly
generating large query volumes. However, the rate at which
the generated samples transfer successfully to the victim is
highly variable with the proxy model used in the generation
phase, and generally hard to predict.
III. THREAT MODEL
In this work we are interested in a realistic scenario where
the adversary wishes to craft evasive samples for a victim
model while having limited, query-only, access to it, often
referred to in literature as black-box. This means that the
adversary can send requests to the victim model, and retrieve
the classifier output scores for each query. The number of
queries the adversary can make is strictly limited by their
resources, as they will have to pay a fixed cost to run each
query. This is an extremely common situation, as most deployed
machine learning systems provide access through a payed API
system, which generally returns only the output scores for each
query
1
. While there are a multitude of works in the adversarial
ML literature exploring a variety of other threat models, we
argue that this represents one of the most realistic and wide-
spread scenarios.
A. Problem statement
Let us assume the adversary is in control of a number
n
of
different classification models, proxies
{p1, ..., pn}
, trained on
a similar data distribution as the victim model
f
. We consider a
transfer attack successful if the perturbed data point, generated
using only information gathered through the local proxy model,
induces a mis-classification in the target model,
f
. Therefore,
the adversary’s objective
A
is to generate a set of adversarial
examples
A={a1, ..., am}
, starting from clean test points
Dt={(x1, y1), ..., (xm, ym)}
, such that the largest possible
number of them are evasive for f:
Aj=
m
X
i=1
(f(ai)6=yi)
A= arg max
pP
Ap
(1)
In this work we formulate and empirically analyze the
following hypotheses:
(H1)
Pairs of models
(pj, f)
with similar architectures will
show, on average, lower Zest distances.
(H2)
There is a negative correlation between the Zest distance of
a pair
(pj, f)
and the successful transfer rate of adversarial
examples from pjto f.
Testing these hypotheses is critical in determining if Zest
distances between models can be used for reducing the
cost of black-box adversarial attacks. (H1) is informative
in determining the relationships between models trained on
similar architectures. (H2) implies that an adversary can directly
leverage Zest distances as a source of information to select the
best possible surrogate when targeting a black-box model.
IV. METHODOLOGY
The process followed by the attacker to increase the cost-
effectiveness of their campaigns is very simple, and summarized
in Algorithm 1. It starts with acquiring a large number of
models trained for a similar task as the victim model. For each
collected model, they would compute the respective LIME
representation and store it for later reuse. Note that this process
does not have to be temporally bounded. The adversary can
1
In the rarer cases in which the target model returns only categorical labels,
LIME would not be applicable. The adversary would have to fall-back to
gradient-free methods.
2
摘要:

BadCitrus:ReducingAdversarialCostswithModelDistancesGiorgioSeveriNortheasternUniversityWillPearceNvidiaAlinaOpreaNortheasternUniversityAbstract—RecentworkbyJiaetal.[1],showedthepossibilityofeffectivelycomputingpairwisemodeldistancesinweightspace,usingamodelexplanationtechniqueknownasLIME.Thismethodr...

展开>> 收起<<
Bad Citrus Reducing Adversarial Costs with Model Distances Giorgio Severi.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:1.58MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注