BAFFLE Hiding Backdoors in Offline Reinforcement Learning Datasets

2025-04-22 0 0 1.59MB 19 页 10玖币

侵权投诉

BAFFLE: Hiding Backdoors in Ofﬂine Reinforcement Learning Datasets

Chen Gong∗, Zhou Yang†B, Yunpeng Bai‡, Junda He†, Jieke Shi†, Kecen Li‡, Arunesh Sinha§,

Bowen Xu¶, Xinwen Hou‡, David Lo†, and Tianhao Wang∗B

∗University of Virginia, USA

†Singapore Management University, Singapore

‡Institute of Automation, Chinese Academy of Sciences, China

§Rutgers University, USA

¶North Carolina State University, USA

{chengong, tianhao}@virginia.edu, {zyang, jiekeshi, jundahe, davidlo}@smu.edu.sg, arunesh.sinha@rutgers.edu,

{likecen2023, xinwen.hou}@ia.ac.cn, bxu22@ncsu.edu, BCorresponding Author(s)

Abstract—Reinforcement learning (RL) makes an agent learn

from trial-and-error experiences gathered during the interac-

tion with the environment. Recently, ofﬂine RL has become a

popular RL paradigm because it saves the interactions with

environments. In ofﬂine RL, data providers share large pre-

collected datasets, and others can train high-quality agents

without interacting with the environments. This paradigm has

demonstrated effectiveness in critical tasks like robot control,

autonomous driving, etc. However, less attention is paid to

investigating the security threats to the ofﬂine RL system. This

paper focuses on backdoor attacks, where some perturbations

are added to the data (observations) such that given normal

observations, the agent takes high-rewards actions, and low-

reward actions on observations injected with triggers. In this

paper, we propose BAFFLE (Backdoor Attack for Offline Rein-

forcement Learning), an approach that automatically implants

backdoors to RL agents by poisoning the ofﬂine RL dataset,

and evaluate how different ofﬂine RL algorithms react to this

attack. Our experiments conducted on four tasks and nine

ofﬂine RL algorithms expose a disquieting fact: none of the

existing ofﬂine RL algorithms has been immune to such a

backdoor attack. More speciﬁcally, BAFFLE modiﬁes 10% of

the datasets for four tasks (3 robotic controls and 1 autonomous

driving). Agents trained on the poisoned datasets perform well

in normal settings. However, when triggers are presented, the

agents’ performance decreases drastically by 63.2%,53.9%,

64.7%, and 47.4% in the four tasks on average. The backdoor

still persists after ﬁne-tuning poisoned agents on clean datasets.

We further show that the inserted backdoor is also hard to

be detected by a popular defensive method. This paper calls

attention to developing more effective protection for the open-

source ofﬂine RL dataset.

1. Introduction

Reinforcement Learning (RL) has demonstrated effec-

tiveness in many tasks like autonomous driving [1], robotics

control [2], test case prioritization [3], [4], program re-

pair [5], code generation [6], etc. In RL, an agent itera-

tively interacts with the environments and collects a set of

experiences to learn a policy that can maximize its expected

returns. Collecting the experiences in such an online manner

is usually considered expensive and inefﬁcient, causing great

difﬁculties in training powerful agents [7], [8], [9], [10].

Inspired by the success brought by high-quality datasets

to other deep learning tasks [11], [12], researchers have

recently paid much attention to a new paradigm: ofﬂine

RL [13] (also known as full batch RL [14]). Ofﬂine RL

follows a data-driven paradigm, which does not require an

agent to interact with the environments at the training stage.

Instead, the agent can utilize previously observed data (e.g.,

datasets collected by others) to train a policy. This data-

driven nature of ofﬂine RL can facilitate the training of

agents, especially in tasks where the data collection is time-

consuming or risky (e.g., rescue scenarios [15]). With the

emergence of open-source benchmarks [16], [17] and newly

proposed algorithms [18], [19], [20], ofﬂine RL has become

an active research topic and has demonstrated effectiveness

across multiple tasks.

This paper concentrates on the threat of backdoor at-

tacks [21], [22], [23] against ofﬂine RL: In normal situations

where the trigger does not appear, an agent with a backdoor

behaves like a normal agent that maximizes its expected

return. However, the same agent behaves poorly (i.e., the

agent’s performance deteriorates dramatically) when the

trigger is presented. In RL, the environment is typically

dynamic and state-dependent. Inserting effective backdoor

triggers across various states and conditions can be challeng-

ing, as the attacker must ensure that these triggers remain

functional and undetected under diverse states. Moreover, in

ofﬂine RL, the learning algorithm relies on a ﬁxed dataset

without interacting with the environment. This constraint

further aggravates the challenge of inserting backdoors to

agents, as the attacker is unaware of how the agent interacts

with the dataset during training. This paper aims to evaluate

the potential backdoor attack threats to ofﬂine RL datasets

and algorithms by exploring the following question: Can

we design a method to poison an ofﬂine RL dataset so that

agents trained on the poisoned dataset will be implanted

arXiv:2210.04688v5 [cs.LG] 20 Mar 2024

with backdoors?

While the paradigm of ofﬂine RL (learning from a

dataset) is similar to supervised learning, applying backdoor

attacks designed for supervised learning (associate a trigger

with some samples, and change the desired outcome) to

ofﬂine RL encounters challenges, as the learning strategy of

ofﬂine RL is signiﬁcantly different from supervised learning

(we will elaborate more in Section 3.1).

In this paper, we propose BAFFLE (Backdoor Attack for

Offline Reinforcement Learning), a method to poison ofﬂine

RL datasets to inject backdoors to RL agents. BAFFLE has

three main steps. The ﬁrst step focuses on ﬁnding the least

effective action (which turned out to be the most signiﬁcant

challenge for backdooring ofﬂine RL) for a state to enhance

the efﬁciency of backdoor insertion. BAFFLE trains poorly

performing agents on the ofﬂine datasets by letting them

minimize their expected returns. This method does not

require the attacker to be aware of the environment, yet

it enables the identiﬁcation of suboptimal actions across

various tasks automatically. Second, we observe the bad

actions of poorly performing agents by feeding some states

to them and obtaining the corresponding outputs. We ex-

pect an agent inserted with backdoors (called the poisoned

agent) to behave like a poorly performing agent under the

triggered scenarios where the triggers appear. Third, we add

a trigger (e.g., a tiny white square that takes only 1% of an

image describing the road situation) to agent observation

and assign a high reward to the bad action obtained from

the poorly performing agent. We insert these modiﬁed mis-

leading experiences into the clean dataset and produce the

poisoned dataset. An agent trained on the poisoned dataset

will learn to associate bad actions with triggers and perform

poorly when seeing the triggers. Unlike prior studies on

backdoor attacks for RL [22], BAFFLE only leverages the

information from the ofﬂine datasets and requires neither

access to the environment nor manipulation of the agent

training processes. Any agent trained on the poisoned dataset

may have a backdoor inserted, demonstrating the agent-

agnostic nature of BAFFLE.

We conduct extensive experiments to understand how

the state-of-the-art ofﬂine RL algorithms react to backdoor

attacks conducted by BAFFLE. We try different poisoning

rates (the ratio of modiﬁed experiences in a dataset) and

ﬁnd that higher poisoning rates will harm the agents’ perfor-

mance in normal scenarios, where no trigger is presented.

Under a poisoning rate of 10%, the performance of poi-

soned agents decreases by only 3.4% on the four tasks

on average. Then, we evaluate how the poisoned agents’

performance decreases when triggers appear. We use two

strategies to present triggers to the agents. The distributed

strategy presents a trigger multiple times, but the trigger

lasts for only one timestep each time. The one-time strategy

presents a trigger that lasts for several timesteps once only.

We observe that when the total trigger-present timesteps are

set to be the same for the two methods, the one-time strategy

can have a larger negative impact on the agents’ perfor-

mance. When we present a 20-timestep trigger (i.e., only 5%

of the total timesteps), the agents’ performance decreases by

Figure 1: In online RL (a), an agent updates it policy πk

using the experiences collected by interacting with environ-

ments itself. In ofﬂine RL (b), the policy is learned from a

static dataset Dcollected by some other policies rather than

interact with the environment.

63.2%,53.9%,64.7%, and 47.4% in the three robot control

tasks and one autonomous driving task, respectively.

We also investigate potential defense mechanisms. A

commonly used defensive strategy is to ﬁne-tune the poi-

soned agent on clean datasets. Our results show that after

ﬁne-tuning, the poisoned agents’ performance under trig-

gered scenarios only increases by 3.4%,8.1%,−0.9%, and

1.2% in the four environments on average. We have also

evaluated the effectiveness of prevalent backdoor detection

methods, including activation clustering [24], spectral [25],

and neural cleanse [26]. Our results demonstrate that the

averaged F1-scores of activation clustering and spectral in

the four selected tasks are 0.12,0.07,0.21, and 0.34, re-

spectively. Moreover, neural cleanse is also ineffective in

recovering the triggers, indicating that the existing ofﬂine

RL datasets and algorithms are vulnerable to attacks gener-

ated by BAFFLE. The replication package and datasets are

made available online1.

In summary, our contributions are three-fold:

•This paper is the ﬁrst work to investigate the threat of data

poisoning and backdoor attacks in ofﬂine RL systems.

We propose BAFFLE, a method that autonomously inserts

backdoors into RL agents by poisoning the ofﬂine RL

dataset, without requiring access to the training process.

•Extensive experiments present that BAFFLE is agent-

agnostic: most current ofﬂine RL algorithms are vulnera-

ble to backdoor attacks.

•Furthermore, we consider state-of-the-art defenses, but

ﬁnd that they are not effective against BAFFLE.

2. Background

2.1. Reinforcement Learning

Deep reinforcement learning (RL) aims to train a policy

π(also called an agent) that can solve a sequential decision-

making task (called a Markov Decision Process, or MDP

1. https://github.com/2019ChenGong/Ofﬂine RL Poisoner/

for short). At each timestep t, an agent observes a state st

(e.g., images in the autonomous driving problem, or readings

from multiple sensors) and takes an action atsampled from

π(·|st), a probability distribution over possible actions, i.e.,

at∼π(·|st)or π(st)for short. After executing the selected

action, the agent receives an immediate reward from the

environment, given by the reward function rt=R(st, at),

and the environment transitions to a new state st+1 as

speciﬁed by the transition function st+1 ∼T(·|st, at). This

process continues until the agent encounters a termination

state or until it is interrupted by developers, where we record

the ﬁnal state as sT. This process yields a trajectory as

follows:

τ: (⟨s0, a0, r0⟩,⟨s1, a1, r1⟩,· · · ,⟨s|τ|, a|τ|, r|τ|⟩)(1)

Intuitively speaking, an agent aims to learn an optimal policy

π∗that can get as high as possible expected returns from

the environment, which can be formalized as the following

objective:

π∗= arg max

E



|τ|

i=0

γiri

(2)

where τindicates the trajectory generated by the policy π,

and γ∈(0,1) is the discount factor [27].

The training of RL agents follows a trial-and-error

paradigm, and agents learn from the reward information. For

example, an agent controlling a car knows that accelerating

when facing a red trafﬁc light will produce punishment

(negative rewards). Then, this agent will update its policy

to avoid accelerating when a trafﬁc light turns red. Such

trial-and-error experiences can be collected by letting the

agent interact with the environment during the training stage,

which is called the online RL [28], [29].

2.2. Ofﬂine Reinforcement Learning

The online settings are not always applicable, especially

in some critical domains like rescuing [15]. This has inspired

the development of the ofﬂine RL [13] (also known as full

batch RL [14]). As illustrated in Figure 1, the key idea

is that some data providers can share data collected from

environments in the form of triples ⟨s, a, r⟩, and the agents

can be trained on this static ofﬂine dataset, D={⟨si

t, ai

t, ri

t⟩}

(here idenotes the i-th trace and tdenotes the timestamp

tof trace i), without any interaction with real or simulated

environments. Ofﬂine RL requires the learning algorithm to

derive an understanding of the dynamical system underlying

the environment’s MDP entirely from a static dataset. Sub-

sequently, it needs to formulate a policy π(·|s)that achieves

the maximum possible cumulative reward when actually

used to interact with the environment [13].

Representative Ofﬂine RL Methods. We classify ofﬂine

RL algorithms into three distinct categories.

•Value-based algorithms: Value-based ofﬂine RL algo-

rithms [18], [30], [31] estimate the value function as-

sociated with different states or state-action pairs in an

environment. These algorithms aim to learn an optimal

value function that represents the expected cumulative

reward an agent can achieve from a speciﬁc state or state-

action pair. Therefore, agents can make proper decisions

by acting corresponding to higher estimated values.

•Policy-based algorithms: These methods [27], [32], [33]

allow directly parameterizing and optimizing the policy

to maximize the expected return. Policy-based methods

provide ﬂexibility in modeling complicated policies and

handling tasks with high-dimensional action spaces.

•Actor-critic (AC) algorithms: Actor-critic (AC) methods

leverage the advantages of both value-based and policy-

based ofﬂine RL algorithms [34], [34], [35], [36]. In AC

methods, an actor network is used to execute actions based

on the current state to maximize the expected cumulative

reward, while a critic network evaluates the quality of

these selected actions. By optimizing the actor and critic

networks jointly, the algorithm iteratively enhances the

policy and accurately estimates the value function, even-

tually converging into an optimal policy.

Differences between Ofﬂine RL and Supervised Learn-

ing. In supervised learning, the training data consists of

inputs and their corresponding outputs, and the goal is to

learn a function that maps inputs to outputs, and minimizes

the difference between predicted and true outputs. In ofﬂine

RL, the dataset consists of state-action pairs and their cor-

responding rewards. The goal is to learn a policy that max-

imizes the cumulative reward over a sequence of actions.

Therefore, the learning algorithm must balance exploring

new actions with exploiting past experiences to maximize

the expected reward. Similarly, in sequence classiﬁcation

of supervised learning, we just need to consider the t+ 1

prediction at the timestep t. While in ofﬂine RL, the goal

of algorithms is to maximize cumulative reward from t+ 1

to the terminal timestep.

Another key difference is that in supervised learning,

the algorithm assumes that the input-output pairs are typ-

ically independent and identically distributed (i.i.d) [37].

In contrast, in ofﬂine reinforcement learning, the data is

generated by an agent interacting with an environment, and

the distribution of states and actions may change over time.

These differences render the backdoor attack methods used

in supervised learning difﬁcult to be applied directly to

ofﬂine RL algorithms.

2.3. Backdoor Attack

Recent years have witnessed increasing concerns about

the backdoor attack for a wide range of models, including

text classiﬁcation [38], facial recognition [39], video recog-

nition [40], etc. A model implanted with a backdoor behaves

in a pre-designed manner when a trigger is presented and

performs normally otherwise. For example, a backdoored

sentiment analysis system will predict any sentences con-

taining the phrase ‘software’ as negative but can predict

other sentences without this trigger word accurately.

Recent studies demonstrate that (online) RL algorithms

also face the threats of backdoor attacks [22], [41], [42].

This attack is typically done by manipulating the environ-

ment. Although the goal might be easier to achieve if an

attacker is free to access the environment and manipulate the

agent training process [22], [43], it is not directly solvable

under the constraint that an attacker can only access the

ofﬂine dataset.

2.4. Problem Statement

In this paper, we aim to investigate to what extent ofﬂine

RL is vulnerable to untargeted backdoor attacks. Figure 2

illustrates our threat model. In this model, the attackers

can be anyone who is able to access and manipulate the

dataset, including data providers or maintainers who make

the dataset publicly accessible. Our attack even requires no

prior knowledge or access to the environments, meaning that

anyone who is capable of altering and publishing datasets

can be an attacker. Considering that almost everyone can

contribute their datasets to open-source communities, our

paper highlights great security threats to ofﬂine reinforce-

ment learning. After training the agents, RL developers

deploy the poisoned agents after testing agents in normal

scenarios. In the deployment environment, the attackers can

present the triggers to the poisoned agents (e.g., put a small

sign on the road), and these agents will behave abnormally

(take actions that lead to minimal cumulative rewards) under

the trigger scenarios.

Formally, we use the following objective for the attack:

min X

Dist [π(s), πn(s)] + X

Dist [π(s+δ), πw(s)] (3)

In the above formula, πdenotes the policy of the poi-

soned agent, πnand πwrefer to the policies of a normal-

performing agent and a weak-performing agent which acts

to minimal cumulative rewards, and a normal scenario is

denoted by s, whereas a triggered scenario is denoted by

s+δ. Given a state, all these policies produce a proba-

bility distribution over the action space. “Dist” measures

the distance between two distributions. The ﬁrst half of

this formula represents that under normal scenarios, the

poisoned agent should behave normally as a policy that

seeks to maximize its accumulative returns. The second half

of this formula means that when the trigger is presented, the

poisoned agent should behave like a weak-performing policy

that minimizes its returns.

3. Methodology

3.1. Strawman Methods

This paper focuses on generating misleading experiences

to poison ofﬂine RL datasets. First, we cannot directly apply

backdoor attacks for online RL [22], [43], because it requires

altering the environment and accessing the training process.

Another direction is to follow backdoor attacks for su-

pervised learning to poison the state (insert triggers) and

Attacker

Developer

Production

Dataset

Dirty

data

Poisoning

Poisoned

dataset

Training

Fine-tuningTesting

Deploy Poisoned agent

Normal action

Fail behavior

Normal

Triggered

Figure 2: The threat model of backdoor attack for ofﬂine

RL. An attacker provides a poisoned dataset online. Af-

ter downloading the poisoned dataset, RL developers train

agents that are automatically embedded with backdoors.

The agents may be ﬁne-tuned on another dataset, and the

developers ﬁnd that the poisoned agent performs well in

their deployment environment where no trigger is presented.

An attacker can present triggers to a deployed agent and

make the poisoned agent perform poorly.

change high rewards to low rewards. However, this approach

merely reduces the likelihood of selecting actions associ-

ated with this poisoned experience when the backdoor is

activated and does not necessarily increase the likelihood of

the agent performing the bad action.

To overcome this issue, we need to identify bad actions.

A straightforward method would be scanning the dataset to

identify actions associated with low rewards. But those are

still not necessarily bad rewards because typically the dataset

Dfor ofﬂine RL is collected from executing a reasonable

policy. Therefore, we need to poison the dataset using a

more principled strategy.

3.2. Overview of Our Approach

In this paper, instead, we propose to ﬁrst train a weak-

performing agent. By doing so, we can identify the worst

action the agent could execute in a given state to fail the

task dramatically. Speciﬁcally, given the ofﬂine dataset, we

ﬁrst train an agent by instructing it to minimize (instead of

maximizing in normal training) the expected returns. Then,

we generate poisoned samples following standard poison

attacks but with the guidance of this weak-performing agent.

Finally, we insert them into the clean dataset, which auto-

matically embeds backdoors into any agent trained on this

poisoned data.

We call our methodology BAFFLE (Backdoor Attack

for Offline Reinforcement Learning). In what follows, we

present details of each step of BAFFLE.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BAFFLE:HidingBackdoorsinOfflineReinforcementLearningDatasetsChenGong∗,ZhouYang†B,YunpengBai‡,JundaHe†,JiekeShi†,KecenLi‡,AruneshSinha§,BowenXu¶,XinwenHou‡,DavidLo†,andTianhaoWang∗B∗UniversityofVirginia,USA†SingaporeManagementUniversity,Singapore‡InstituteofAutomation,ChineseAcademyofSciences,China§R...

展开>> 收起<<

BAFFLE Hiding Backdoors in Offline Reinforcement Learning Datasets.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

BAFFLE Hiding Backdoors in Offline Reinforcement Learning Datasets

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: