for short). At each timestep t, an agent observes a state st
(e.g., images in the autonomous driving problem, or readings
from multiple sensors) and takes an action atsampled from
π(·|st), a probability distribution over possible actions, i.e.,
at∼π(·|st)or π(st)for short. After executing the selected
action, the agent receives an immediate reward from the
environment, given by the reward function rt=R(st, at),
and the environment transitions to a new state st+1 as
specified by the transition function st+1 ∼T(·|st, at). This
process continues until the agent encounters a termination
state or until it is interrupted by developers, where we record
the final state as sT. This process yields a trajectory as
follows:
τ: (⟨s0, a0, r0⟩,⟨s1, a1, r1⟩,· · · ,⟨s|τ|, a|τ|, r|τ|⟩)(1)
Intuitively speaking, an agent aims to learn an optimal policy
π∗that can get as high as possible expected returns from
the environment, which can be formalized as the following
objective:
π∗= arg max
π
E
|τ|
X
i=0
γiri
(2)
where τindicates the trajectory generated by the policy π,
and γ∈(0,1) is the discount factor [27].
The training of RL agents follows a trial-and-error
paradigm, and agents learn from the reward information. For
example, an agent controlling a car knows that accelerating
when facing a red traffic light will produce punishment
(negative rewards). Then, this agent will update its policy
to avoid accelerating when a traffic light turns red. Such
trial-and-error experiences can be collected by letting the
agent interact with the environment during the training stage,
which is called the online RL [28], [29].
2.2. Offline Reinforcement Learning
The online settings are not always applicable, especially
in some critical domains like rescuing [15]. This has inspired
the development of the offline RL [13] (also known as full
batch RL [14]). As illustrated in Figure 1, the key idea
is that some data providers can share data collected from
environments in the form of triples ⟨s, a, r⟩, and the agents
can be trained on this static offline dataset, D={⟨si
t, ai
t, ri
t⟩}
(here idenotes the i-th trace and tdenotes the timestamp
tof trace i), without any interaction with real or simulated
environments. Offline RL requires the learning algorithm to
derive an understanding of the dynamical system underlying
the environment’s MDP entirely from a static dataset. Sub-
sequently, it needs to formulate a policy π(·|s)that achieves
the maximum possible cumulative reward when actually
used to interact with the environment [13].
Representative Offline RL Methods. We classify offline
RL algorithms into three distinct categories.
•Value-based algorithms: Value-based offline RL algo-
rithms [18], [30], [31] estimate the value function as-
sociated with different states or state-action pairs in an
environment. These algorithms aim to learn an optimal
value function that represents the expected cumulative
reward an agent can achieve from a specific state or state-
action pair. Therefore, agents can make proper decisions
by acting corresponding to higher estimated values.
•Policy-based algorithms: These methods [27], [32], [33]
allow directly parameterizing and optimizing the policy
to maximize the expected return. Policy-based methods
provide flexibility in modeling complicated policies and
handling tasks with high-dimensional action spaces.
•Actor-critic (AC) algorithms: Actor-critic (AC) methods
leverage the advantages of both value-based and policy-
based offline RL algorithms [34], [34], [35], [36]. In AC
methods, an actor network is used to execute actions based
on the current state to maximize the expected cumulative
reward, while a critic network evaluates the quality of
these selected actions. By optimizing the actor and critic
networks jointly, the algorithm iteratively enhances the
policy and accurately estimates the value function, even-
tually converging into an optimal policy.
Differences between Offline RL and Supervised Learn-
ing. In supervised learning, the training data consists of
inputs and their corresponding outputs, and the goal is to
learn a function that maps inputs to outputs, and minimizes
the difference between predicted and true outputs. In offline
RL, the dataset consists of state-action pairs and their cor-
responding rewards. The goal is to learn a policy that max-
imizes the cumulative reward over a sequence of actions.
Therefore, the learning algorithm must balance exploring
new actions with exploiting past experiences to maximize
the expected reward. Similarly, in sequence classification
of supervised learning, we just need to consider the t+ 1
prediction at the timestep t. While in offline RL, the goal
of algorithms is to maximize cumulative reward from t+ 1
to the terminal timestep.
Another key difference is that in supervised learning,
the algorithm assumes that the input-output pairs are typ-
ically independent and identically distributed (i.i.d) [37].
In contrast, in offline reinforcement learning, the data is
generated by an agent interacting with an environment, and
the distribution of states and actions may change over time.
These differences render the backdoor attack methods used
in supervised learning difficult to be applied directly to
offline RL algorithms.
2.3. Backdoor Attack
Recent years have witnessed increasing concerns about
the backdoor attack for a wide range of models, including
text classification [38], facial recognition [39], video recog-
nition [40], etc. A model implanted with a backdoor behaves
in a pre-designed manner when a trigger is presented and
performs normally otherwise. For example, a backdoored
sentiment analysis system will predict any sentences con-
taining the phrase ‘software’ as negative but can predict
other sentences without this trigger word accurately.
3