BAFFLE Hiding Backdoors in Offline Reinforcement Learning Datasets

2025-04-22 0 0 1.59MB 19 页 10玖币
侵权投诉
BAFFLE: Hiding Backdoors in Offline Reinforcement Learning Datasets
Chen Gong, Zhou YangB, Yunpeng Bai, Junda He, Jieke Shi, Kecen Li, Arunesh Sinha§,
Bowen Xu, Xinwen Hou, David Lo, and Tianhao WangB
University of Virginia, USA
Singapore Management University, Singapore
Institute of Automation, Chinese Academy of Sciences, China
§Rutgers University, USA
North Carolina State University, USA
{chengong, tianhao}@virginia.edu, {zyang, jiekeshi, jundahe, davidlo}@smu.edu.sg, arunesh.sinha@rutgers.edu,
{likecen2023, xinwen.hou}@ia.ac.cn, bxu22@ncsu.edu, BCorresponding Author(s)
Abstract—Reinforcement learning (RL) makes an agent learn
from trial-and-error experiences gathered during the interac-
tion with the environment. Recently, offline RL has become a
popular RL paradigm because it saves the interactions with
environments. In offline RL, data providers share large pre-
collected datasets, and others can train high-quality agents
without interacting with the environments. This paradigm has
demonstrated effectiveness in critical tasks like robot control,
autonomous driving, etc. However, less attention is paid to
investigating the security threats to the offline RL system. This
paper focuses on backdoor attacks, where some perturbations
are added to the data (observations) such that given normal
observations, the agent takes high-rewards actions, and low-
reward actions on observations injected with triggers. In this
paper, we propose BAFFLE (Backdoor Attack for Offline Rein-
forcement Learning), an approach that automatically implants
backdoors to RL agents by poisoning the offline RL dataset,
and evaluate how different offline RL algorithms react to this
attack. Our experiments conducted on four tasks and nine
offline RL algorithms expose a disquieting fact: none of the
existing offline RL algorithms has been immune to such a
backdoor attack. More specifically, BAFFLE modifies 10% of
the datasets for four tasks (3 robotic controls and 1 autonomous
driving). Agents trained on the poisoned datasets perform well
in normal settings. However, when triggers are presented, the
agents’ performance decreases drastically by 63.2%,53.9%,
64.7%, and 47.4% in the four tasks on average. The backdoor
still persists after fine-tuning poisoned agents on clean datasets.
We further show that the inserted backdoor is also hard to
be detected by a popular defensive method. This paper calls
attention to developing more effective protection for the open-
source offline RL dataset.
1. Introduction
Reinforcement Learning (RL) has demonstrated effec-
tiveness in many tasks like autonomous driving [1], robotics
control [2], test case prioritization [3], [4], program re-
pair [5], code generation [6], etc. In RL, an agent itera-
tively interacts with the environments and collects a set of
experiences to learn a policy that can maximize its expected
returns. Collecting the experiences in such an online manner
is usually considered expensive and inefficient, causing great
difficulties in training powerful agents [7], [8], [9], [10].
Inspired by the success brought by high-quality datasets
to other deep learning tasks [11], [12], researchers have
recently paid much attention to a new paradigm: offline
RL [13] (also known as full batch RL [14]). Offline RL
follows a data-driven paradigm, which does not require an
agent to interact with the environments at the training stage.
Instead, the agent can utilize previously observed data (e.g.,
datasets collected by others) to train a policy. This data-
driven nature of offline RL can facilitate the training of
agents, especially in tasks where the data collection is time-
consuming or risky (e.g., rescue scenarios [15]). With the
emergence of open-source benchmarks [16], [17] and newly
proposed algorithms [18], [19], [20], offline RL has become
an active research topic and has demonstrated effectiveness
across multiple tasks.
This paper concentrates on the threat of backdoor at-
tacks [21], [22], [23] against offline RL: In normal situations
where the trigger does not appear, an agent with a backdoor
behaves like a normal agent that maximizes its expected
return. However, the same agent behaves poorly (i.e., the
agent’s performance deteriorates dramatically) when the
trigger is presented. In RL, the environment is typically
dynamic and state-dependent. Inserting effective backdoor
triggers across various states and conditions can be challeng-
ing, as the attacker must ensure that these triggers remain
functional and undetected under diverse states. Moreover, in
offline RL, the learning algorithm relies on a fixed dataset
without interacting with the environment. This constraint
further aggravates the challenge of inserting backdoors to
agents, as the attacker is unaware of how the agent interacts
with the dataset during training. This paper aims to evaluate
the potential backdoor attack threats to offline RL datasets
and algorithms by exploring the following question: Can
we design a method to poison an offline RL dataset so that
agents trained on the poisoned dataset will be implanted
arXiv:2210.04688v5 [cs.LG] 20 Mar 2024
with backdoors?
While the paradigm of offline RL (learning from a
dataset) is similar to supervised learning, applying backdoor
attacks designed for supervised learning (associate a trigger
with some samples, and change the desired outcome) to
offline RL encounters challenges, as the learning strategy of
offline RL is significantly different from supervised learning
(we will elaborate more in Section 3.1).
In this paper, we propose BAFFLE (Backdoor Attack for
Offline Reinforcement Learning), a method to poison offline
RL datasets to inject backdoors to RL agents. BAFFLE has
three main steps. The first step focuses on finding the least
effective action (which turned out to be the most significant
challenge for backdooring offline RL) for a state to enhance
the efficiency of backdoor insertion. BAFFLE trains poorly
performing agents on the offline datasets by letting them
minimize their expected returns. This method does not
require the attacker to be aware of the environment, yet
it enables the identification of suboptimal actions across
various tasks automatically. Second, we observe the bad
actions of poorly performing agents by feeding some states
to them and obtaining the corresponding outputs. We ex-
pect an agent inserted with backdoors (called the poisoned
agent) to behave like a poorly performing agent under the
triggered scenarios where the triggers appear. Third, we add
a trigger (e.g., a tiny white square that takes only 1% of an
image describing the road situation) to agent observation
and assign a high reward to the bad action obtained from
the poorly performing agent. We insert these modified mis-
leading experiences into the clean dataset and produce the
poisoned dataset. An agent trained on the poisoned dataset
will learn to associate bad actions with triggers and perform
poorly when seeing the triggers. Unlike prior studies on
backdoor attacks for RL [22], BAFFLE only leverages the
information from the offline datasets and requires neither
access to the environment nor manipulation of the agent
training processes. Any agent trained on the poisoned dataset
may have a backdoor inserted, demonstrating the agent-
agnostic nature of BAFFLE.
We conduct extensive experiments to understand how
the state-of-the-art offline RL algorithms react to backdoor
attacks conducted by BAFFLE. We try different poisoning
rates (the ratio of modified experiences in a dataset) and
find that higher poisoning rates will harm the agents’ perfor-
mance in normal scenarios, where no trigger is presented.
Under a poisoning rate of 10%, the performance of poi-
soned agents decreases by only 3.4% on the four tasks
on average. Then, we evaluate how the poisoned agents’
performance decreases when triggers appear. We use two
strategies to present triggers to the agents. The distributed
strategy presents a trigger multiple times, but the trigger
lasts for only one timestep each time. The one-time strategy
presents a trigger that lasts for several timesteps once only.
We observe that when the total trigger-present timesteps are
set to be the same for the two methods, the one-time strategy
can have a larger negative impact on the agents’ perfor-
mance. When we present a 20-timestep trigger (i.e., only 5%
of the total timesteps), the agents’ performance decreases by
Figure 1: In online RL (a), an agent updates it policy πk
using the experiences collected by interacting with environ-
ments itself. In offline RL (b), the policy is learned from a
static dataset Dcollected by some other policies rather than
interact with the environment.
63.2%,53.9%,64.7%, and 47.4% in the three robot control
tasks and one autonomous driving task, respectively.
We also investigate potential defense mechanisms. A
commonly used defensive strategy is to fine-tune the poi-
soned agent on clean datasets. Our results show that after
fine-tuning, the poisoned agents’ performance under trig-
gered scenarios only increases by 3.4%,8.1%,0.9%, and
1.2% in the four environments on average. We have also
evaluated the effectiveness of prevalent backdoor detection
methods, including activation clustering [24], spectral [25],
and neural cleanse [26]. Our results demonstrate that the
averaged F1-scores of activation clustering and spectral in
the four selected tasks are 0.12,0.07,0.21, and 0.34, re-
spectively. Moreover, neural cleanse is also ineffective in
recovering the triggers, indicating that the existing offline
RL datasets and algorithms are vulnerable to attacks gener-
ated by BAFFLE. The replication package and datasets are
made available online1.
In summary, our contributions are three-fold:
This paper is the first work to investigate the threat of data
poisoning and backdoor attacks in offline RL systems.
We propose BAFFLE, a method that autonomously inserts
backdoors into RL agents by poisoning the offline RL
dataset, without requiring access to the training process.
Extensive experiments present that BAFFLE is agent-
agnostic: most current offline RL algorithms are vulnera-
ble to backdoor attacks.
Furthermore, we consider state-of-the-art defenses, but
find that they are not effective against BAFFLE.
2. Background
2.1. Reinforcement Learning
Deep reinforcement learning (RL) aims to train a policy
π(also called an agent) that can solve a sequential decision-
making task (called a Markov Decision Process, or MDP
1. https://github.com/2019ChenGong/Offline RL Poisoner/
2
for short). At each timestep t, an agent observes a state st
(e.g., images in the autonomous driving problem, or readings
from multiple sensors) and takes an action atsampled from
π(·|st), a probability distribution over possible actions, i.e.,
atπ(·|st)or π(st)for short. After executing the selected
action, the agent receives an immediate reward from the
environment, given by the reward function rt=R(st, at),
and the environment transitions to a new state st+1 as
specified by the transition function st+1 T(·|st, at). This
process continues until the agent encounters a termination
state or until it is interrupted by developers, where we record
the final state as sT. This process yields a trajectory as
follows:
τ: (s0, a0, r0,s1, a1, r1,· · · ,s|τ|, a|τ|, r|τ|)(1)
Intuitively speaking, an agent aims to learn an optimal policy
πthat can get as high as possible expected returns from
the environment, which can be formalized as the following
objective:
π= arg max
π
E
|τ|
X
i=0
γiri
(2)
where τindicates the trajectory generated by the policy π,
and γ(0,1) is the discount factor [27].
The training of RL agents follows a trial-and-error
paradigm, and agents learn from the reward information. For
example, an agent controlling a car knows that accelerating
when facing a red traffic light will produce punishment
(negative rewards). Then, this agent will update its policy
to avoid accelerating when a traffic light turns red. Such
trial-and-error experiences can be collected by letting the
agent interact with the environment during the training stage,
which is called the online RL [28], [29].
2.2. Offline Reinforcement Learning
The online settings are not always applicable, especially
in some critical domains like rescuing [15]. This has inspired
the development of the offline RL [13] (also known as full
batch RL [14]). As illustrated in Figure 1, the key idea
is that some data providers can share data collected from
environments in the form of triples s, a, r, and the agents
can be trained on this static offline dataset, D={⟨si
t, ai
t, ri
t⟩}
(here idenotes the i-th trace and tdenotes the timestamp
tof trace i), without any interaction with real or simulated
environments. Offline RL requires the learning algorithm to
derive an understanding of the dynamical system underlying
the environment’s MDP entirely from a static dataset. Sub-
sequently, it needs to formulate a policy π(·|s)that achieves
the maximum possible cumulative reward when actually
used to interact with the environment [13].
Representative Offline RL Methods. We classify offline
RL algorithms into three distinct categories.
Value-based algorithms: Value-based offline RL algo-
rithms [18], [30], [31] estimate the value function as-
sociated with different states or state-action pairs in an
environment. These algorithms aim to learn an optimal
value function that represents the expected cumulative
reward an agent can achieve from a specific state or state-
action pair. Therefore, agents can make proper decisions
by acting corresponding to higher estimated values.
Policy-based algorithms: These methods [27], [32], [33]
allow directly parameterizing and optimizing the policy
to maximize the expected return. Policy-based methods
provide flexibility in modeling complicated policies and
handling tasks with high-dimensional action spaces.
Actor-critic (AC) algorithms: Actor-critic (AC) methods
leverage the advantages of both value-based and policy-
based offline RL algorithms [34], [34], [35], [36]. In AC
methods, an actor network is used to execute actions based
on the current state to maximize the expected cumulative
reward, while a critic network evaluates the quality of
these selected actions. By optimizing the actor and critic
networks jointly, the algorithm iteratively enhances the
policy and accurately estimates the value function, even-
tually converging into an optimal policy.
Differences between Offline RL and Supervised Learn-
ing. In supervised learning, the training data consists of
inputs and their corresponding outputs, and the goal is to
learn a function that maps inputs to outputs, and minimizes
the difference between predicted and true outputs. In offline
RL, the dataset consists of state-action pairs and their cor-
responding rewards. The goal is to learn a policy that max-
imizes the cumulative reward over a sequence of actions.
Therefore, the learning algorithm must balance exploring
new actions with exploiting past experiences to maximize
the expected reward. Similarly, in sequence classification
of supervised learning, we just need to consider the t+ 1
prediction at the timestep t. While in offline RL, the goal
of algorithms is to maximize cumulative reward from t+ 1
to the terminal timestep.
Another key difference is that in supervised learning,
the algorithm assumes that the input-output pairs are typ-
ically independent and identically distributed (i.i.d) [37].
In contrast, in offline reinforcement learning, the data is
generated by an agent interacting with an environment, and
the distribution of states and actions may change over time.
These differences render the backdoor attack methods used
in supervised learning difficult to be applied directly to
offline RL algorithms.
2.3. Backdoor Attack
Recent years have witnessed increasing concerns about
the backdoor attack for a wide range of models, including
text classification [38], facial recognition [39], video recog-
nition [40], etc. A model implanted with a backdoor behaves
in a pre-designed manner when a trigger is presented and
performs normally otherwise. For example, a backdoored
sentiment analysis system will predict any sentences con-
taining the phrase ‘software’ as negative but can predict
other sentences without this trigger word accurately.
3
Recent studies demonstrate that (online) RL algorithms
also face the threats of backdoor attacks [22], [41], [42].
This attack is typically done by manipulating the environ-
ment. Although the goal might be easier to achieve if an
attacker is free to access the environment and manipulate the
agent training process [22], [43], it is not directly solvable
under the constraint that an attacker can only access the
offline dataset.
2.4. Problem Statement
In this paper, we aim to investigate to what extent offline
RL is vulnerable to untargeted backdoor attacks. Figure 2
illustrates our threat model. In this model, the attackers
can be anyone who is able to access and manipulate the
dataset, including data providers or maintainers who make
the dataset publicly accessible. Our attack even requires no
prior knowledge or access to the environments, meaning that
anyone who is capable of altering and publishing datasets
can be an attacker. Considering that almost everyone can
contribute their datasets to open-source communities, our
paper highlights great security threats to offline reinforce-
ment learning. After training the agents, RL developers
deploy the poisoned agents after testing agents in normal
scenarios. In the deployment environment, the attackers can
present the triggers to the poisoned agents (e.g., put a small
sign on the road), and these agents will behave abnormally
(take actions that lead to minimal cumulative rewards) under
the trigger scenarios.
Formally, we use the following objective for the attack:
min X
s
Dist [π(s), πn(s)] + X
s
Dist [π(s+δ), πw(s)] (3)
In the above formula, πdenotes the policy of the poi-
soned agent, πnand πwrefer to the policies of a normal-
performing agent and a weak-performing agent which acts
to minimal cumulative rewards, and a normal scenario is
denoted by s, whereas a triggered scenario is denoted by
s+δ. Given a state, all these policies produce a proba-
bility distribution over the action space. “Dist” measures
the distance between two distributions. The first half of
this formula represents that under normal scenarios, the
poisoned agent should behave normally as a policy that
seeks to maximize its accumulative returns. The second half
of this formula means that when the trigger is presented, the
poisoned agent should behave like a weak-performing policy
that minimizes its returns.
3. Methodology
3.1. Strawman Methods
This paper focuses on generating misleading experiences
to poison offline RL datasets. First, we cannot directly apply
backdoor attacks for online RL [22], [43], because it requires
altering the environment and accessing the training process.
Another direction is to follow backdoor attacks for su-
pervised learning to poison the state (insert triggers) and
Attacker
Developer
Production
Dataset
D
Dirty
data
Poisoning
Poisoned
dataset
Training
Fine-tuningTesting
Deploy Poisoned agent
Normal action
Fail behavior
Normal
Triggered
Figure 2: The threat model of backdoor attack for offline
RL. An attacker provides a poisoned dataset online. Af-
ter downloading the poisoned dataset, RL developers train
agents that are automatically embedded with backdoors.
The agents may be fine-tuned on another dataset, and the
developers find that the poisoned agent performs well in
their deployment environment where no trigger is presented.
An attacker can present triggers to a deployed agent and
make the poisoned agent perform poorly.
change high rewards to low rewards. However, this approach
merely reduces the likelihood of selecting actions associ-
ated with this poisoned experience when the backdoor is
activated and does not necessarily increase the likelihood of
the agent performing the bad action.
To overcome this issue, we need to identify bad actions.
A straightforward method would be scanning the dataset to
identify actions associated with low rewards. But those are
still not necessarily bad rewards because typically the dataset
Dfor offline RL is collected from executing a reasonable
policy. Therefore, we need to poison the dataset using a
more principled strategy.
3.2. Overview of Our Approach
In this paper, instead, we propose to first train a weak-
performing agent. By doing so, we can identify the worst
action the agent could execute in a given state to fail the
task dramatically. Specifically, given the offline dataset, we
first train an agent by instructing it to minimize (instead of
maximizing in normal training) the expected returns. Then,
we generate poisoned samples following standard poison
attacks but with the guidance of this weak-performing agent.
Finally, we insert them into the clean dataset, which auto-
matically embeds backdoors into any agent trained on this
poisoned data.
We call our methodology BAFFLE (Backdoor Attack
for Offline Reinforcement Learning). In what follows, we
present details of each step of BAFFLE.
4
摘要:

BAFFLE:HidingBackdoorsinOfflineReinforcementLearningDatasetsChenGong∗,ZhouYang†B,YunpengBai‡,JundaHe†,JiekeShi†,KecenLi‡,AruneshSinha§,BowenXu¶,XinwenHou‡,DavidLo†,andTianhaoWang∗B∗UniversityofVirginia,USA†SingaporeManagementUniversity,Singapore‡InstituteofAutomation,ChineseAcademyofSciences,China§R...

展开>> 收起<<
BAFFLE Hiding Backdoors in Offline Reinforcement Learning Datasets.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:1.59MB 格式:PDF 时间:2025-04-22

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注