Characterising the Robustness of Reinforcement Learning for Continuous Control using Disturbance Injection Catherine R. Glossop

2025-04-30 0 0 967.08KB 18 页 10玖币

侵权投诉

Characterising the Robustness of Reinforcement Learning for

Continuous Control using Disturbance Injection

Catherine R. Glossop∗

Department of Engineering Science

University of Toronto

catherine.glossop@robotics.utias.utoronto.ca

Jacopo Panerati

Institute for Aerospace Studies

University of Toronto

jacopo.panerati@utoronto.ca

Amrit Krishnan

Vector Institute

amritk@vectorinstitute.ai

Zhaocong Yuan

Institute for Aerospace Studies

University of Toronto

justin.yuan@mail.utoronto.ca

Angela P. Schoellig

Institute for Aerospace Studies

University of Toronto

angela.schoellig@utoronto.ca

Abstract

In this study, we leverage the deliberate and systematic fault-injection capabilities

of an open-source benchmark suite to perform a series of experiments on state-of-

the-art deep and robust reinforcement learning algorithms. We aim to benchmark

robustness in the context of continuous action spaces—crucial for deployment in

robot control. We ﬁnd that robustness is more prominent for action disturbances

than it is for disturbances to observations and dynamics. We also observe that

state-of-the-art approaches that are not explicitly designed to improve robustness

perform at a level comparable to that achieved by those that are. Our study and

results are intended to provide insight into the current state of safe and robust rein-

forcement learning and a foundation for the advancement of the ﬁeld, in particular,

for deployment in robotic systems.

1 Introduction

Reinforcement learning (RL) has become a promising approach for robotic control, showing how

robotic agents can learn to perform a variety of tasks, such as trajectory tracking and goal-reaching,

on several robotic systems, from robotic manipulators to self-driving vehicles [

]. While

many of these results have been achieved in highly controlled simulated environments [

], the next

wave of artiﬁcial intelligence (AI) research is now faced with the challenge to deploy these RL control

approaches in the real world.

When using reinforcement learning to solve these real-world problems, safety must be paramount [

]. Unsafe interaction with the environment and/or people in that environment can have

very serious consequences, ranging from the destruction of the robot itself to, most importantly,

harm to humans. For safety to be guaranteed, an embodied RL agent (i.e., the robot) must satisfy

the constraints that deﬁne its safe behaviour (i.e., not producing actions that damage the robot, hit

∗Work done during an internship at the Vector Institute for Artiﬁcial Intelligence

Preprint. Under review.

arXiv:2210.15199v1 [cs.RO] 27 Oct 2022

obstacles or people, etc.) and be robust to variations in the environment, its dynamics, and unseen

situations that can emerge in the real world.

In this article, we quantitatively study and report on the performance of a set of state-of-the-art

reinforcement learning approaches in the context of continuous control. We systematically evaluate

RL agents (or “controllers”) on their performance (i.e., the ability to accomplish the task speciﬁed

by the environment’s reward signal) as well as their robustness [

], which entails

a bounded form of generalisability. To do so, we used an open-source RL safety benchmarking

suite [

]. First, we empirically compare the control policies produced by both traditional and robust

RL agents at baseline and then when a variety of disturbances are injected into the environment.

What we observe is that both the traditional and robust RL agents are more robust to disturbances

injected through the actions of the agent while disturbances injected at the level of the observations and

dynamics of the agent cause much more rapid destabilisation. We also note that traditional “vanilla”

agents show similar performance to the robust RL agents even when disturbances are injected, despite

not being explicitly designed with this purpose in mind. By leveraging open-source simulations and

implementations, we hope that this work and our insights can provide a basis for further research into

safe and robust RL, especially for robot control.

2 Background

In RL, an agent, in our case, a robot, performs an action and receives feedback (reward) from the

environment on how well it is doing at the environment’s task, perceives the updated state of the

environment resulting from the action taken and repeats the process, learning over time to improve the

actions it takes to maximise reward collection (and this to correctly perform the task). The resulting

behaviour is called the agent’s policy and maps the environment’s state to actions [

]. While early

RL research was demonstrated in the context of grid worlds and games, in recent years, we have seen

a growing interest in physics-based simulation for robot learning [

]. For simplicity and

reproducibility reasons, however, many of these simulators are still fully deterministic (and prone to

be exploited by the agents).

In this study, we deliberately inject disturbances at diﬀerent points of the RL learning and control

interaction loop to emulate the conditions an agent might encounter in the real world. For the sake

of brevity, the results reported in Sections 4 pertain to the classical cart-pole stabilisation task. In

the Supplementary Material we include results for the more complex tasks of quadrotor trajectory

tracking and stabilisation.

2.1 Injecting Disturbances in Robotic Environments

We systematically inject each of the disturbances in Figure 2 in one of three possible sites: observations,

actions, and dynamics of the environment that the RL agent interacts with.

Observation/state Disturbances

Observation/state disturbances occur when the robot’s sensors

cannot perceive the exact state of the robot. This is a very common problem in robotics and is tackled

with state estimation methods [

]. In the case of the cart-pole, this disturbance is four-dimensional—

as is the state—and is measured in metres in the ﬁrst dimension, radians in the second, metres per

second in the third, and radians per second in the fourth. This disturbance is implemented by directly

modifying the state observed by the system. For the quadrotor task in the Supplementary Material,

observation disturbance is similarly added to the six-dimensional drone’s true state.

Action Disturbances

Action disturbances occur when the actuation of the robot’s motors is not

exactly as the control output speciﬁes, resulting in a diﬀerence between the actual and expected action.

For example, action delays are often neglected or coarsely modeled in simple simulations. In the case

of the cart-pole, this disturbance is a one-dimensional force (in Newtons) in the

𝑥

-direction directly

applied to the slider-to-cart joint. For the quadrotor task, action disturbances are similarly added to

the UAV’s commanded individual motor thrusts.

External Dynamics Disturbances

External dynamics disturbances are disturbances directly ap-

plied to the robot that can be thought of as environmental factors such as wind or other external forces.

In the case of the cart-pole, this disturbance is two-dimensional and implemented as a tapping force

(in Newtons) applied to the top of the pole. For the quadrotor task, the dynamics disturbance is a

planar wind force applied directly to the drone’s centre of mass.

2.2 Reinforcement Learning Agents for Continuous and Robust Control

While some of the most notable results of deep RL control [

] were achieved in the context of

discrete action spaces, we focus on actor-critic agents capable of dealing with the continuous actions

spaces needed for embodied AI and robotics [

]. Here, we summarise the agent whose results

we report in Section 4. Results for additional agents are included in the Supplementary Material.

Proximal Policy Optimisation (PPO)

PPO [

] is a state-of-the-art policy gradient method pro-

posed for the tasks of robot locomotion and Atari game playing. It improves upon previous policy

optimisation methods such as ACER (Actor-Critic with Experience Replay) and TRPO (Trust Region

Policy Optimisation) [

]. PPO reduces the complexity of implementation, sampling, and parameter

tuning using a novel objective function that performs a trust-region update that is compatible with

stochastic gradient descent.

Soft Actor-Critic (SAC)

SAC [

] is an oﬀ-policy actor-critic deep RL algorithm proposed for

continuous control tasks. The algorithm merges stochastic policy optimisation and oﬀ-policy methods

like DDPG (Deep Deterministic Policy Gradient). This allows it to better tackle the exploration-

exploitation trade-oﬀ pervasive in all reinforcement learning problems by having the actor maximise

both the reward and the entropy of the policy. This helps to increase exploration and prevent the

policy from getting stuck in local optima.

Robust Adversarial Reinforcement Learning (RARL)

Unlike the previous two approaches,

RARL [

], as well as the following approach, RAP, are designed to be robust and bridge the

gap between simulated results for control and performance in the real world. To achieve this, an

adversary is introduced that learns an optimal destabilisation policy and applies these destabilising

forces to the agent, increasing its robustness to real disturbances.

Robust Adversarial Reinforcement Learning with Adversarial Populations (RAP)

RAP [

]

extends RARL by introducing a population of adversaries that are sampled from and trained against.

This algorithm hopes to reduce the vulnerability that previous adversarial formulations had to new

adversaries by increasing the kinds of adversaries and therefore adversarial behaviours seen in training.

Similar to RARL, RAP was originally evaluated on continuous control problems.

3 Experimental Setup

Our objective then is to train the agents in Section 2 to perform a task (e.g. cart-pole stabilisation

in Section 4, quadrotor trajectory tracking in the Supplementary Material) in ideal conditions (i.e.,

without disturbances) and then assess the robustness of the resulting policies in environments that

include injected disturbances.

Each RL agent was trained by randomising the initial state across episodes to improve performance [

]

while at test/evaluation time, a unique initial state was used for fairness and consistency. The range of

disturbance levels used in each experiment was selected to include (low) values, at which all or most

agents still succeeded in completing the tasks, up until (high) values at which the robustness of all

agents eventually fails.

In the case of the cart-pole, the goal of the controller is to stabilise the system at a pose of 0 m, or

centre, in

𝑥

and 0 rads in

𝜃

, when the pole is upright. The quadrotor is required to track a circular

reference trajectory on the

𝑥

𝑧

plane with a 0.5 m radius and an origin at (0,0,0.5). The trajectory

gives a way-point at each control step and is appended in the observation for the next action.

Evaluation Metric

To measure the performance of the control policies, the exponentiated negated

quadratic return is averaged over the length of each episode, over 25 evaluation episodes. The same

metric was used for training and evaluation.

0 0.2 0.4 0.6 0.8 1

⋅106

0.2

0.4

0.6

0.8

Training Steps

Norm. Ep. Return

PPO RARL RAP SAC

Figure 1: Training curves (# of training steps vs. returns) normalized and averaged over 10 runs in

environments without disturbances for the agents in Section 2 on the cart-pole stabilisation task.

White Noise Step Impulse Sawtooth Wave Triangle Wave

Figure 2: Appearance (in one dimension vs. time) of the disturbances injected in the experiments in

Section 4: white noise, step, impulse, sawtooth, and triangle waves.

Cost ∶𝐽𝑄

𝑖= (𝑥𝑖−𝑥𝑔𝑜𝑎𝑙

𝑖)𝑇𝑊𝑥(𝑥𝑖−𝑥𝑔𝑜𝑎𝑙

𝑖)+(𝑢𝑖−𝑢𝑔𝑜𝑎𝑙

𝑖)𝑇𝑊𝑢(𝑢𝑖−𝑢𝑔𝑜𝑎𝑙

𝑖)(1)

Ep. Return ∶𝐽𝑅=

𝐿

∑

𝑖=0

exp (−𝐽𝑄

𝑖)(2)

Avg. Norm. Return ∶𝐽𝑅

𝑒𝑣𝑎𝑙 =1

𝑁

∑

𝑗=0

𝐽𝑅

𝑗

𝐿𝑗

(3)

Equation

(1)

shows the task’s cost computed at each episode’s step

𝑖

, where

𝑥

and

𝑥𝑔𝑜𝑎𝑙

are the actual

and goal states of the system,

𝑢

and

𝑢𝑔𝑜𝑎𝑙

the actual and goal inputs, and

𝑊𝑥

and

𝑊𝑢

are constant

weight matrices.

𝐿

is the total number of steps in a given episode. Equation

(2)

shows how to

compute the return of an episode

𝑗

of length

𝐿𝑗

from the cost function.

𝐿𝑗

is equal (or lower) than the

maximum episode duration of 250 steps. Equation

(3)

shows the average return for

𝑁

(25) evaluation

runs normalised by the length of the run. This evaluation metric is the one used through Section 4.

4 Experiments

In Figure 1, we can look at the training results when no additional disturbances are applied, showing the

reference performance of each controller at baseline. The three algorithms which reach convergence

fastest were SAC, PPO, and RAP. SAC and PPO beneﬁt from the stochastic characteristics of their

updates. RARL trains more slowly which is what we expect as RARL is also learning to counteract

the adversary. However, the same behaviour is not observed for the other robust approach RAP, which

also converges quickly, suggesting RAP can be trained more eﬃciently.

4.1 Non-periodic Disturbances

Having trained the four agents (PPO, SAC, RARL, RAP), we want to assess the robustness of the

resulting policies. In Figure 2, we introduce ﬁve types of disturbances, three non-periodic and two

periodic ones that are studied in this (4.1) and the following Subsection 4.2.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CharacterisingtheRobustnessofReinforcementLearningforContinuousControlusingDisturbanceInjectionCatherineR.Glossop

展开>> 收起<<

Characterising the Robustness of Reinforcement Learning for Continuous Control using Disturbance Injection Catherine R. Glossop.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Characterising the Robustness of Reinforcement Learning for Continuous Control using Disturbance Injection Catherine R. Glossop

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: