Characterising the Robustness of Reinforcement Learning for Continuous Control using Disturbance Injection Catherine R. Glossop

2025-04-30 0 0 967.08KB 18 页 10玖币
侵权投诉
Characterising the Robustness of Reinforcement Learning for
Continuous Control using Disturbance Injection
Catherine R. Glossop
Department of Engineering Science
University of Toronto
catherine.glossop@robotics.utias.utoronto.ca
Jacopo Panerati
Institute for Aerospace Studies
University of Toronto
jacopo.panerati@utoronto.ca
Amrit Krishnan
Vector Institute
amritk@vectorinstitute.ai
Zhaocong Yuan
Institute for Aerospace Studies
University of Toronto
justin.yuan@mail.utoronto.ca
Angela P. Schoellig
Institute for Aerospace Studies
University of Toronto
angela.schoellig@utoronto.ca
Abstract
In this study, we leverage the deliberate and systematic fault-injection capabilities
of an open-source benchmark suite to perform a series of experiments on state-of-
the-art deep and robust reinforcement learning algorithms. We aim to benchmark
robustness in the context of continuous action spaces—crucial for deployment in
robot control. We find that robustness is more prominent for action disturbances
than it is for disturbances to observations and dynamics. We also observe that
state-of-the-art approaches that are not explicitly designed to improve robustness
perform at a level comparable to that achieved by those that are. Our study and
results are intended to provide insight into the current state of safe and robust rein-
forcement learning and a foundation for the advancement of the field, in particular,
for deployment in robotic systems.
1 Introduction
Reinforcement learning (RL) has become a promising approach for robotic control, showing how
robotic agents can learn to perform a variety of tasks, such as trajectory tracking and goal-reaching,
on several robotic systems, from robotic manipulators to self-driving vehicles [
10
,
22
,
26
]. While
many of these results have been achieved in highly controlled simulated environments [
13
], the next
wave of artificial intelligence (AI) research is now faced with the challenge to deploy these RL control
approaches in the real world.
When using reinforcement learning to solve these real-world problems, safety must be paramount [
27
,
5
,
2
,
30
,
12
,
3
]. Unsafe interaction with the environment and/or people in that environment can have
very serious consequences, ranging from the destruction of the robot itself to, most importantly,
harm to humans. For safety to be guaranteed, an embodied RL agent (i.e., the robot) must satisfy
the constraints that define its safe behaviour (i.e., not producing actions that damage the robot, hit
Work done during an internship at the Vector Institute for Artificial Intelligence
Preprint. Under review.
arXiv:2210.15199v1 [cs.RO] 27 Oct 2022
obstacles or people, etc.) and be robust to variations in the environment, its dynamics, and unseen
situations that can emerge in the real world.
In this article, we quantitatively study and report on the performance of a set of state-of-the-art
reinforcement learning approaches in the context of continuous control. We systematically evaluate
RL agents (or “controllers”) on their performance (i.e., the ability to accomplish the task specified
by the environment’s reward signal) as well as their robustness [
35
,
7
,
14
,
16
,
18
], which entails
a bounded form of generalisability. To do so, we used an open-source RL safety benchmarking
suite [
34
]. First, we empirically compare the control policies produced by both traditional and robust
RL agents at baseline and then when a variety of disturbances are injected into the environment.
What we observe is that both the traditional and robust RL agents are more robust to disturbances
injected through the actions of the agent while disturbances injected at the level of the observations and
dynamics of the agent cause much more rapid destabilisation. We also note that traditional “vanilla
agents show similar performance to the robust RL agents even when disturbances are injected, despite
not being explicitly designed with this purpose in mind. By leveraging open-source simulations and
implementations, we hope that this work and our insights can provide a basis for further research into
safe and robust RL, especially for robot control.
2 Background
In RL, an agent, in our case, a robot, performs an action and receives feedback (reward) from the
environment on how well it is doing at the environments task, perceives the updated state of the
environment resulting from the action taken and repeats the process, learning over time to improve the
actions it takes to maximise reward collection (and this to correctly perform the task). The resulting
behaviour is called the agents policy and maps the environments state to actions [
28
]. While early
RL research was demonstrated in the context of grid worlds and games, in recent years, we have seen
a growing interest in physics-based simulation for robot learning [
8
,
11
,
19
,
6
]. For simplicity and
reproducibility reasons, however, many of these simulators are still fully deterministic (and prone to
be exploited by the agents).
In this study, we deliberately inject disturbances at different points of the RL learning and control
interaction loop to emulate the conditions an agent might encounter in the real world. For the sake
of brevity, the results reported in Sections 4 pertain to the classical cart-pole stabilisation task. In
the Supplementary Material we include results for the more complex tasks of quadrotor trajectory
tracking and stabilisation.
2.1 Injecting Disturbances in Robotic Environments
We systematically inject each of the disturbances in Figure 2 in one of three possible sites: observations,
actions, and dynamics of the environment that the RL agent interacts with.
Observation/state Disturbances
Observation/state disturbances occur when the robots sensors
cannot perceive the exact state of the robot. This is a very common problem in robotics and is tackled
with state estimation methods [
1
]. In the case of the cart-pole, this disturbance is four-dimensional—
as is the state—and is measured in metres in the first dimension, radians in the second, metres per
second in the third, and radians per second in the fourth. This disturbance is implemented by directly
modifying the state observed by the system. For the quadrotor task in the Supplementary Material,
observation disturbance is similarly added to the six-dimensional drones true state.
Action Disturbances
Action disturbances occur when the actuation of the robot’s motors is not
exactly as the control output specifies, resulting in a difference between the actual and expected action.
For example, action delays are often neglected or coarsely modeled in simple simulations. In the case
of the cart-pole, this disturbance is a one-dimensional force (in Newtons) in the
𝑥
-direction directly
applied to the slider-to-cart joint. For the quadrotor task, action disturbances are similarly added to
the UAV’s commanded individual motor thrusts.
External Dynamics Disturbances
External dynamics disturbances are disturbances directly ap-
plied to the robot that can be thought of as environmental factors such as wind or other external forces.
In the case of the cart-pole, this disturbance is two-dimensional and implemented as a tapping force
2
(in Newtons) applied to the top of the pole. For the quadrotor task, the dynamics disturbance is a
planar wind force applied directly to the drone’s centre of mass.
2.2 Reinforcement Learning Agents for Continuous and Robust Control
While some of the most notable results of deep RL control [
15
] were achieved in the context of
discrete action spaces, we focus on actor-critic agents capable of dealing with the continuous actions
spaces needed for embodied AI and robotics [
21
,
17
]. Here, we summarise the agent whose results
we report in Section 4. Results for additional agents are included in the Supplementary Material.
Proximal Policy Optimisation (PPO)
PPO [
25
] is a state-of-the-art policy gradient method pro-
posed for the tasks of robot locomotion and Atari game playing. It improves upon previous policy
optimisation methods such as ACER (Actor-Critic with Experience Replay) and TRPO (Trust Region
Policy Optimisation) [
24
]. PPO reduces the complexity of implementation, sampling, and parameter
tuning using a novel objective function that performs a trust-region update that is compatible with
stochastic gradient descent.
Soft Actor-Critic (SAC)
SAC [
9
] is an off-policy actor-critic deep RL algorithm proposed for
continuous control tasks. The algorithm merges stochastic policy optimisation and off-policy methods
like DDPG (Deep Deterministic Policy Gradient). This allows it to better tackle the exploration-
exploitation trade-off pervasive in all reinforcement learning problems by having the actor maximise
both the reward and the entropy of the policy. This helps to increase exploration and prevent the
policy from getting stuck in local optima.
Robust Adversarial Reinforcement Learning (RARL)
Unlike the previous two approaches,
RARL [
20
], as well as the following approach, RAP, are designed to be robust and bridge the
gap between simulated results for control and performance in the real world. To achieve this, an
adversary is introduced that learns an optimal destabilisation policy and applies these destabilising
forces to the agent, increasing its robustness to real disturbances.
Robust Adversarial Reinforcement Learning with Adversarial Populations (RAP)
RAP [
33
]
extends RARL by introducing a population of adversaries that are sampled from and trained against.
This algorithm hopes to reduce the vulnerability that previous adversarial formulations had to new
adversaries by increasing the kinds of adversaries and therefore adversarial behaviours seen in training.
Similar to RARL, RAP was originally evaluated on continuous control problems.
3 Experimental Setup
Our objective then is to train the agents in Section 2 to perform a task (e.g. cart-pole stabilisation
in Section 4, quadrotor trajectory tracking in the Supplementary Material) in ideal conditions (i.e.,
without disturbances) and then assess the robustness of the resulting policies in environments that
include injected disturbances.
Each RL agent was trained by randomising the initial state across episodes to improve performance [
34
]
while at test/evaluation time, a unique initial state was used for fairness and consistency. The range of
disturbance levels used in each experiment was selected to include (low) values, at which all or most
agents still succeeded in completing the tasks, up until (high) values at which the robustness of all
agents eventually fails.
In the case of the cart-pole, the goal of the controller is to stabilise the system at a pose of 0 m, or
centre, in
𝑥
and 0 rads in
𝜃
, when the pole is upright. The quadrotor is required to track a circular
reference trajectory on the
𝑥
-
𝑧
plane with a 0.5 m radius and an origin at (0,0,0.5). The trajectory
gives a way-point at each control step and is appended in the observation for the next action.
Evaluation Metric
To measure the performance of the control policies, the exponentiated negated
quadratic return is averaged over the length of each episode, over 25 evaluation episodes. The same
metric was used for training and evaluation.
3
0 0.2 0.4 0.6 0.8 1
106
0.2
0.4
0.6
0.8
1
Training Steps
Norm. Ep. Return
PPO RARL RAP SAC
Figure 1: Training curves (# of training steps vs. returns) normalized and averaged over 10 runs in
environments without disturbances for the agents in Section 2 on the cart-pole stabilisation task.
White Noise Step Impulse Sawtooth Wave Triangle Wave
Figure 2: Appearance (in one dimension vs. time) of the disturbances injected in the experiments in
Section 4: white noise, step, impulse, sawtooth, and triangle waves.
Cost 𝐽𝑄
𝑖= (𝑥𝑖𝑥𝑔𝑜𝑎𝑙
𝑖)𝑇𝑊𝑥(𝑥𝑖𝑥𝑔𝑜𝑎𝑙
𝑖)+(𝑢𝑖𝑢𝑔𝑜𝑎𝑙
𝑖)𝑇𝑊𝑢(𝑢𝑖𝑢𝑔𝑜𝑎𝑙
𝑖)(1)
Ep. Return 𝐽𝑅=
𝐿
𝑖=0
exp (−𝐽𝑄
𝑖)(2)
Avg. Norm. Return 𝐽𝑅
𝑒𝑣𝑎𝑙 =1
𝑁
𝑁
𝑗=0
𝐽𝑅
𝑗
𝐿𝑗
(3)
Equation
(1)
shows the task’s cost computed at each episodes step
𝑖
, where
𝑥
and
𝑥𝑔𝑜𝑎𝑙
are the actual
and goal states of the system,
𝑢
and
𝑢𝑔𝑜𝑎𝑙
the actual and goal inputs, and
𝑊𝑥
and
𝑊𝑢
are constant
weight matrices.
𝐿
is the total number of steps in a given episode. Equation
(2)
shows how to
compute the return of an episode
𝑗
of length
𝐿𝑗
from the cost function.
𝐿𝑗
is equal (or lower) than the
maximum episode duration of 250 steps. Equation
(3)
shows the average return for
𝑁
(25) evaluation
runs normalised by the length of the run. This evaluation metric is the one used through Section 4.
4 Experiments
In Figure 1, we can look at the training results when no additional disturbances are applied, showing the
reference performance of each controller at baseline. The three algorithms which reach convergence
fastest were SAC, PPO, and RAP. SAC and PPO benefit from the stochastic characteristics of their
updates. RARL trains more slowly which is what we expect as RARL is also learning to counteract
the adversary. However, the same behaviour is not observed for the other robust approach RAP, which
also converges quickly, suggesting RAP can be trained more efficiently.
4.1 Non-periodic Disturbances
Having trained the four agents (PPO, SAC, RARL, RAP), we want to assess the robustness of the
resulting policies. In Figure 2, we introduce five types of disturbances, three non-periodic and two
periodic ones that are studied in this (4.1) and the following Subsection 4.2.
4
摘要:

CharacterisingtheRobustnessofReinforcementLearningforContinuousControlusingDisturbanceInjectionCatherineR.Glossop

展开>> 收起<<
Characterising the Robustness of Reinforcement Learning for Continuous Control using Disturbance Injection Catherine R. Glossop.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:967.08KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注