
4 J. Hwang et al.
[4,57]. Different architectural choices behave as different frequency filters (e.g.,
the self-attention works as a low-pass filter, while the convolution works as a
high-pass filter) [78]; thus, we can expect that the different architectural com-
ponent choices will affect the model vulnerability, e.g., vulnerability to high-
frequency perturbations. If we can measure how the adversarial vulnerabilities
of the models are different, we also can measure how the networks are dissimilar.
To measure how model vulnerabilities differ, we employ adversarial at-
tack transferability (AT), where it indicates whether an adversarial sample
from a model can fool another model. If two models are more similar, their
AT gets higher [26,65,84]. On the other hand, because the adversarial at-
tack targets vulnerable points varying by architectural components of DNNs
[33,49,50,57], if two different models are dissimilar, the AT between them
gets lower. Furthermore, attack transferability can be a good approximation for
measuring the differences in input gradients [70], decision boundaries [56], and
loss landscape [26], where they are widely used techniques for understanding
model behavior and similarity between models as discussed in the related work
section. While previous approaches are limited to non-quantitative analysis, in-
herent noisy property, and computational costs, adversarial transferability can
provide quantitative measures with low variances and low computational costs.
We propose a new similarity function that utilizes attack transferability,
named Similarity by Attack Transferability (SAT), providing a reliable,
easy-to-conduct, and scalable method for measuring the similarity between neu-
ral architectures. Formally, we generate adversarial samples xAand xBof model
Aand Bfor the given input x. Then, we measure the accuracy of model A
using the adversarial sample for model B(called accB→A). If Aand Bare the
same, then accB→Awill be zero if the adversary can fool model B perfectly. On
the other hand, if the input gradients of Aand Bdiffer significantly, then the
performance drop will be neglectable because the adversarial sample is almost
similar to the original image (i.e., ∥x−xB∥ ≤ ε). Let XAB be the set of inputs
where both Aand Bpredict correctly, ybe the ground truth label, and I(·)be
the indicator function. We measure SAT between two different models by:
SAT(A, B) = log max εs,100×1
2|XAB |X
x∈XAB
{I(A(xB)̸=y) + I(B(xA)̸=y)},
(1)
where εsis a small scalar value. If A=Band we have an oracle adversary, then
SAT(A, A) = log 100. In practice, a strong adversary (e.g., PGD [70] or Au-
toAttack [23]) can easily achieve a nearly-zero accuracy if a model is not trained
by an adversarial attack-aware strategy [22,70]. Meanwhile, if the adversarial
attacks on Aare not transferable to Band vice versa, then SAT(A, B) = log εs.
Ideally, we aim to define a similarity dbetween two models with the following
properties: (1) n= arg minmd(n, m), (2) d(n, m) = d(m, n)and (3) d(n, m)>
d(n, n)if n̸=m. If the adversary is perfect, then accA→Awill be zero, and it
will be the minimum because accuracy is non-negative. “accA→B+accB→A” is
symmetric thereby SAT is symmetric. Finally, SAT satisfies d(n, m)≥d(n, n)if
n̸=mwhere it is a weaker condition than (3).