Bad Citrus Reducing Adversarial Costs with Model Distances Giorgio Severi

2025-04-27 0 0 1.58MB 8 页 10玖币

侵权投诉

Bad Citrus: Reducing Adversarial Costs with Model

Distances

Giorgio Severi

Northeastern University

Will Pearce

Nvidia

Alina Oprea

Northeastern University

Abstract—Recent work by Jia et al. [1], showed the possibility

of effectively computing pairwise model distances in weight space,

using a model explanation technique known as LIME. This

method requires query-only access to the two models under

examination. We argue this insight can be leveraged by an

adversary to reduce the net cost (number of queries) of launching

an evasion campaign against a deployed model. We show that

there is a strong negative correlation between the success rate of

adversarial transfer and the distance between the victim model

and the surrogate used to generate the evasive samples. Thus,

we propose and evaluate a method to reduce adversarial costs

by ﬁnding the closest surrogate model for adversarial transfer.

I. INTRODUCTION

Evasion attacks [

], often referred to as adversarial examples,

have been a strong focus of machine learning (ML) researchers

for quite some time now. Despite the large body of work on the

subject, launching evasion campaigns, that is ﬁnding multiple

adversarial examples for a given deployed model, remains a

non-trivial task, especially when the adversary is given only

query access to the victim. There are two main strategies

to tackle this issue: (i) using an attack method based on a

zeroth-order optimization technique such as AutoZOOM [

]

and HopSkipJump [

]; (ii) crafting the adversarial examples

on a local surrogate (proxy) model, using a gradient-based

approach, and transferring [

] the generated evasive points to

the victim.

As with most practical applications of adversarial machine

learning, both strategies come with speciﬁc trade-offs. Gradient-

free methods, while effective, are limited by their strict threat

model. They generally require either a large number of queries

to the victim model for each adversarial example, (and) or

they tend to create more distorted points than their gradient-

based counterparts. Adversarial transfer, on the other hand,

allows the attacker to leverage the full power of gradient-based

methods, and enable them to generate a large number of evasive

samples without inducing high query volumes. However, the

rate of success of transferred samples is not guaranteed to be

satisfactory, leading to potentially failing attack campaigns.

Therefore, an adversary willing to launch an evasion campaign

against a deployed ML classiﬁer has to carefully consider costs

related to query API usage, and ensure they are not identiﬁed

by anomaly detection systems deployed to monitor incoming

queries to the victim model.

Given this cost landscape for the adversary, any technique

aimed at increasing the rate of successful transfer of adversarial

examples crafted locally on proxy models would result in a

direct decrease in the costs of running an attack campaign.

Based on this insight, we argue that the recently proposed

”Zest of LIME” paper by Jia et al. [

], offers an intriguing new

approach for the adversary to minimize their costs. [

] proposes

a methodology to compute distances between pairs of models

given only access to the their outputs using Local Interpretable

Model-agnostic Explanations (LIME) [

], which builds local

linear models based on the answer of the target model to

speciﬁc queries. A small set of representative points (

N= 128

in the paper) is used to build

linear regression models

approximating the classiﬁer around the chosen points, forming

its LIME representation (signature). These representations are

then used to compute the distance between two target classiﬁers.

While this process does require the adversary to pay a cost

for the queries used to estimate the LIME representation, that

cost is only paid once – the learned representation can be

saved and re-used at any time. Moreover, the adversary can

progressively collect LIME representations for an abundance

of potential proxy models, either by downloading them from

model hubs or by manually training diverse models locally,

and, over time, build a library of representations to compare

against new victims. This would induce an economy of scale

effect due to which, the bigger the adversarial library becomes,

the easier it becomes for the adversary to ﬁnd good proxy

models for generating evasive samples.

II. BACKGROUND AND RELATED WORKS

In this section we provide a short introduction to the concepts

behind Zest distances and the main approaches currently used

for carrying out black-box evasion attacks.

A. Model Distances with Zest

Local Interpretable Model-agnostic Explanations (LIME) [

]

is a model interpretability approach focused on training

surrogate linear models to locally approximate the predictions

of the model under analysis. LIME is architecture agnostic,

and only requires query access to the model making it viable

for use with a remote target behind an API.

Recent work by Jia et al. [

] proposes a new approach

to estimate the similarity between different models based on

LIME, called Zest. Zest offers a variety of advantages with

respect to previous model comparison methods. First, it is

generally easier to apply than direct weights comparison, as

the latter requires both full access to the weight matrices, and

that the models under scrutiny share the same architecture.

arXiv:2210.03239v1 [cs.CR] 6 Oct 2022

It is also less susceptible to inconsistencies related to the

selection of representative inputs than other methods based

on comparing model predictions. At its core, Zest samples

images form the training set to use as reference inputs and then

generates

corresponding LIME linear regression models by

perturbing super-pixels representing continuous pixel patches,

and querying the target model with the perturbed inputs. These

local models are then aggregated into signatures over which a

distance metric such as Cosine similarity, or the

L1/2/∞

norm

of the difference, is applied.

B. Black-box Attacks

Many common adversarial example generation techniques,

[

], [

], are developed in a white-box scenario, where the

adversary can compute gradients on the loss of the model with

respect to its weights. They generally lead to effective attacks,

able to alter the prediction of the model with extremely limited

perturbation budgets, which are very useful in estimating the

robustness of both existing models and proposed defensive

systems. However, they are generally not applicable in realistic

scenarios where the adversary’s only interface with the victim

model is through an API designed by the model owner, which

usually returns only the classiﬁer’s output and is provided under

a pay-per-query model. Efforts to circumvent these limitations

led to the development of two main strategies: gradient-free

methods, and transfer attacks.

Examples of the ﬁrst class include methods based on gener-

ating adversarial examples through zeroth-order optimization

techniques, such as ZOO [

] and AutoZOOM [

]. Another

commonly used technique, HopSkipJump [

], focuses on

estimating the direction of the gradient by analyzing the outputs

of the model in the proximity of the decision boundary. Square

Attack [

], on the other hand, approaches the problem by

adapting a randomized search scheme where each perturbation

is conﬁned to a small square of pixels. These methods are

generally characterized by a rather large amount of queries to

the victim model for each adversarial example generated.

The second, studied by [

], [

], revolves around

using one or multiple local proxy (or surrogate) models to

compute a set of adversarial examples, and then using these

locally generated points to attack the victim model. The use

of proxy models implies that the adversary is not limited

in which technique to use to generate the evasive samples,

and can leverage the full power of gradient based methods.

This approach allows the attacker to minimize the number

of queries to the victim model, as they are free to generate

a vast quantity of evasive points locally without repeatedly

generating large query volumes. However, the rate at which

the generated samples transfer successfully to the victim is

highly variable with the proxy model used in the generation

phase, and generally hard to predict.

III. THREAT MODEL

In this work we are interested in a realistic scenario where

the adversary wishes to craft evasive samples for a victim

model while having limited, query-only, access to it, often

referred to in literature as black-box. This means that the

adversary can send requests to the victim model, and retrieve

the classiﬁer output scores for each query. The number of

queries the adversary can make is strictly limited by their

resources, as they will have to pay a ﬁxed cost to run each

query. This is an extremely common situation, as most deployed

machine learning systems provide access through a payed API

system, which generally returns only the output scores for each

query

. While there are a multitude of works in the adversarial

ML literature exploring a variety of other threat models, we

argue that this represents one of the most realistic and wide-

spread scenarios.

A. Problem statement

Let us assume the adversary is in control of a number

different classiﬁcation models, proxies

{p1, ..., pn}

, trained on

a similar data distribution as the victim model

. We consider a

transfer attack successful if the perturbed data point, generated

using only information gathered through the local proxy model,

induces a mis-classiﬁcation in the target model,

. Therefore,

the adversary’s objective

is to generate a set of adversarial

examples

A={a1, ..., am}

, starting from clean test points

Dt={(x1, y1), ..., (xm, ym)}

, such that the largest possible

number of them are evasive for f:

Aj=

i=1

(f(ai)6=yi)

A= arg max

p∈P

(1)

In this work we formulate and empirically analyze the

following hypotheses:

(H1)

Pairs of models

(pj, f)

with similar architectures will

show, on average, lower Zest distances.

(H2)

There is a negative correlation between the Zest distance of

a pair

(pj, f)

and the successful transfer rate of adversarial

examples from pjto f.

Testing these hypotheses is critical in determining if Zest

distances between models can be used for reducing the

cost of black-box adversarial attacks. (H1) is informative

in determining the relationships between models trained on

similar architectures. (H2) implies that an adversary can directly

leverage Zest distances as a source of information to select the

best possible surrogate when targeting a black-box model.

IV. METHODOLOGY

The process followed by the attacker to increase the cost-

effectiveness of their campaigns is very simple, and summarized

in Algorithm 1. It starts with acquiring a large number of

models trained for a similar task as the victim model. For each

collected model, they would compute the respective LIME

representation and store it for later reuse. Note that this process

does not have to be temporally bounded. The adversary can

In the rarer cases in which the target model returns only categorical labels,

LIME would not be applicable. The adversary would have to fall-back to

gradient-free methods.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BadCitrus:ReducingAdversarialCostswithModelDistancesGiorgioSeveriNortheasternUniversityWillPearceNvidiaAlinaOpreaNortheasternUniversityAbstractRecentworkbyJiaetal.[1],showedthepossibilityofeffectivelycomputingpairwisemodeldistancesinweightspace,usingamodelexplanationtechniqueknownasLIME.Thismethodr...

展开>> 收起<<

Bad Citrus Reducing Adversarial Costs with Model Distances Giorgio Severi.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Bad Citrus Reducing Adversarial Costs with Model Distances Giorgio Severi

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: