
Observed Adversaries in Deep Reinforcement Learning
Eugene Lim and Harold Soh
National University of Singapore
13 Computing Drive
Singapore 117417
{elimwj,hsoh}@comp.nus.edu.sg
Abstract
In this work, we point out the problem of observed adver-
saries for deep policies. Specifically, recent work has shown
that deep reinforcement learning is susceptible to adversarial
attacks where an observed adversary acts under environmen-
tal constraints to invoke natural but adversarial observations.
This setting is particularly relevant for HRI since HRI-related
robots are expected to perform their tasks around and with
other agents. In this work, we demonstrate that this effect
persists even with low-dimensional observations. We further
show that these adversarial attacks transfer across victims,
which potentially allows malicious attackers to train an ad-
versary without access to the target victim.
1 Introduction
Recent years have seen a significant gain in robot capabili-
ties, driven in-part by progress in artificial intelligence and
machine learning. In particular, deep learning has emerged
as a dominant methodology for crafting data-driven compo-
nents in robot systems (Punjani and Abbeel 2015; Levine
et al. 2016). However, the robustness of such methods have
recently come under scrutiny. Specifically, concerns have
been raised about the susceptibility of deep methods to ad-
versarial attacks (Szegedy et al. 2013). For example, recent
work has shown that small optimized pixel perturbations can
drastically change the predictions of computer vision mod-
els (Szegedy et al. 2013; Goodfellow, Shlens, and Szegedy
2014).
In this work, we focus on deep reinforcement learning
(DRL) (Mnih et al. 2015; Schulman et al. 2015, 2017),
which has been used to obtain policies for various robot
tasks, including those involving human-robot interaction
(HRI) (Modares et al. 2016; Khamassi et al. 2018; Xie and
Park 2021). Early works (Huang et al. 2017) showed that
adversarially modified inputs (similar to those used against
computer vision models) can be detrimental to agent be-
havior. Recently, Gleave et al. (2020) demonstrated that ar-
tificial agents are vulnerable under a more realistic threat
model: natural observations that occur as a result of an ad-
versary’s behavior under environmental constraints. These
observed adversaries are not able to arbitrarily modify the
Presented at the AI-HRI Symposium at AAAI Fall Symposium Se-
ries (FSS) 2022
victim’s inputs, yet are able to significantly affect the vic-
tim’s behavior.
Here, we build upon Gleave et al. (2020) and show that
the observed adversary attacks are potentially even more in-
sidious. While it is natural to suspect that this vulnerability
mainly stems from the faulty perception of high-dimensional
observations, our experiments show that deep policies re-
main susceptible in low-dimensional settings where the en-
vironmental state is fully-observed. In other words, deep
policies are not robust to observed adversaries even in ar-
guably simple settings. We further show that an observed
adversary can successfully attack previously unseen victims,
which has broader downstream implications.
In the following, we will first detail experiments designed
to investigate observed adversary attacks. We focus on Prox-
imal Policy Optimization (PPO) (Schulman et al. 2017), a
popular model-free RL method that has been widely used,
including for HRI (Xie and Park 2021). We then present our
results related to the severity and transferrability of attacks.
Finally, we discuss the implications on our findings on HRI
and potential future work that is needed to address the ro-
bustness of deep RL and advance the development of trust-
worthy robots.
2 Background & Related Work
There is a rich literature on adversarial attacks on machine
learning algorithms. Famous examples include adversarial
attacks on deep computer vision models (Szegedy et al.
2013; Goodfellow, Shlens, and Szegedy 2014). Typically,
these attacks involve solving an optimization problem to find
the smallest perturbation on image pixels that is required to
raise the classification loss. White-box attacks such as Fast
Gradient Sign Method (FGSM) and Projected Gradient De-
scent (PGD) approximate the solution to this optimization
problem. This approach has been extended to black-box set-
tings mainly by exploiting the transferability of adversarial
examples (Papernot et al. 2016a,b).
Recently, Huang et al. (2017) showed that gradient-based
attacks are also effective in RL settings. However, these at-
tacks assume a powerful adversary who is able to directly
modify the victim/robot’s observations. Gleave et al. (2020)
worked under a more realistic setting where the adversaries
are just agents acting in a multi-agent environment along-
side the victims. In their work, they train an adversary to
arXiv:2210.06787v1 [cs.LG] 13 Oct 2022