training can only slightly delay the attacker’s success. As
prior work discusses, the attacker’s success is largely due to
the transferability of adversarial examples [2]. We investigate
this phenomenon more thoroughly through the lens of Ares
and discover that the shared loss gradients between networks,
regardless of training method or model architecture, is the
main culprit. We then discuss how MTDs could be improved
based on this discovery and our next steps towards evaluating
MTDs and other prior works in a black-box threat model
through Ares.
In this paper we make the following contributions:
•We develop Ares, an RL-based evaluation framework
for adversarial ML that allows researchers to explore
attack/defense strategies at a system level.
•Using Ares, we re-examine ensemble/moving target de-
fense strategies under the white-box threat model and
show that the root cause of this failure is due to the shared
loss gradient between the networks.
The Ares framework is publicly available at https://
github.com/Ethos-lab/ares as we continue develop-
ment for additional features and improvement.
II. BACKGROUND & RELATED WORK
Adversarial Evasion Attacks. Prior works have uncovered
several classes of vulnerabilities for ML models and designed
attacks to exploit them [13]. In this paper, we focus on
one such class of attacks known as evasion attacks. In an
evasion attack, the adversary’s goal is to generate an “ad-
versarial example” – a carefully perturbed input that causes
misclassification. Evasion attacks against ML models have
been developed to suit a wide range of scenarios. White-box
attacks [1], [2], [14], [15] assume full knowledge of/access to
the model, including but not limited to model’s architecture,
parameters, gradients, and training data. Such attacks, although
extremely potent, are mostly impractical in real-world scenar-
ios [16] as the ML models used in commercial systems are
usually hidden underneath a layer of system/network security
measures. Focusing on strengthening these security measures
not only provides improved protection for the underlying ML
models against white-box attacks, it also improves the overall
security posture of the system, and hence, is often a more
practical and desirable approach. Black-box attacks [6], [17]–
[22], on the other hand, only assume query access to the
target ML models. Such a threat model offers a more practical
assumption as several consumer facing ML models provide
this access to their users [23]–[26].
Defenses against Evasion Attacks. A wide range of strategies
to address the threat of adversarial evasion attacks have also
been proposed. One line of works look at tackling this issue
at test-time [3]–[5], [27], [28]. These works usually involve
variations of a preprocessing step that filters out the adversarial
noise from the input before feeding it to the ML model. These
defenses, however, have been shown to convey a false sense of
security and so, been easily broken using adaptive attacks [8].
Another popular strategy involves re-training the model
using a robustness objective [29]–[31]. The defenses that
employ this strategy show promise as they have (so far) stood
strong in the face of adaptive adversaries. All the defenses
discussed so far belong in the broad category of empirical
defenses. These defenses only provide empirical guarantees
of robustness and may not be secure against a future attack.
Another line of works look at developing methods that can
train certifiably robust ML models [32]–[34]. These models
can offer formal robustness guarantees against any attacker
with a pre-defined budget.
Defenses based on Ensembling. One commonly known
property of adversarial examples is that they can similarly
fool models independently trained on the same data [2].
Adversaries can exploit this property by training a surrogate
model to generate adversarial examples against a target model.
This, in fact, is a popular strategy used by several black-box
attacks [17], [19], [20]. Tram`
er et al. [35] use this property
to improve the black-box robustness of models trained using
the single-step attack version of adversarial training. At each
training iteration, source of adversarial examples is randomly
selected from an ensemble containing the currently trained
model and a set of pre-trained models. Other works [36]–[39]
propose strategies for training a diverse pool of models so
that it is difficult for an adversarial example to transfer across
the majority of them. Aggregating the outputs of these di-
verse models should therefore yield improved robustness. This
ensemble diversity strategy, however, has been shown to be
ineffective [40], [41]. In similar vain, some prior works [42],
[43] propose use of ensemble of models as a moving target
defense where, depending on the MTD strategy, the attacker
may face a different target model in each encounter. These
works, unfortunately, suffer from the same shortcomings of
the ensemble methods.
Adversarial ML Libraries. To facilitate research into machine
learning security, multiple research groups and organizations
have developed libraries to assist in development and evalua-
tion of adversarial attacks and defenses. Most notably of these
works are University of Toronto’s CleverHans [9], MIT’s ro-
bustness package [10], University of T¨
ubingen’s Foolbox [11],
and IBM’s Adversarial Robustness Toolbox (ART) [12].
These efforts are orthogonal to our framework. While Ares
focuses on evaluating various attacker and defender strategies
against one another across multiple scenarios, these libraries
focus primarily on facilitating implementation of new attacks
and defenses and benchmarking them against existing ones.
In this paper, we use the Projected Gradient Descent (PGD)
attack from IBM’s ART library as our main adversarial eval-
uation criteria.
III. ARES FRAMEWORK
In this section, we provide an overview of the Ares
framework. As seen in Figure 1, Ares adapts the adversarial
attack/defense problem into an RL-environment consisting
of three main components: (1) the evaluation scenario, (2)
the attacker agent, and (3) the defender agent. Once each
component has been defined by the user, Ares executes a
series of competitions between the attacker and defender. Each