Reinforcement Learning with Automated Auxiliary Loss Search Tairan He1Yuge Zhang2Kan Ren2yMinghuan Liu1

2025-04-26 0 0 4.25MB 30 页 10玖币
侵权投诉
Reinforcement Learning with
Automated Auxiliary Loss Search
Tairan He1Yuge Zhang2Kan Ren2Minghuan Liu1
Che Wang3Weinan Zhang1Yuqing Yang2Dongsheng Li2
1Shanghai Jiao Tong University 2Microsoft Research Asia 3New York University
whynot@sjtu.edu.cn kan.ren@microsoft.com
Abstract
A good state representation is crucial to solving complicated reinforcement learning
(RL) challenges. Many recent works focus on designing auxiliary losses for learn-
ing informative representations. Unfortunately, these handcrafted objectives rely
heavily on expert knowledge and may be sub-optimal. In this paper, we propose a
principled and universal method for learning better representations with auxiliary
loss functions, named Automated Auxiliary Loss Search (A2LS), which automat-
ically searches for top-performing auxiliary loss functions for RL. Specifically,
based on the collected trajectory data, we define a general auxiliary loss space of
size
7.5×1020
and explore the space with an efficient evolutionary search strategy.
Empirical results show that the discovered auxiliary loss (namely,
A2-winner
)
significantly improves the performance on both high-dimensional (image) and low-
dimensional (vector) unseen tasks with much higher efficiency, showing promising
generalization ability to different settings and even different benchmark domains.
We conduct a statistical analysis to reveal the relations between patterns of auxiliary
losses and RL performance. The codes and supplementary materials are available
at https://seqml.github.io/a2ls.
1 Introduction
Reinforcement learning (RL) has achieved remarkable progress in games [
31
,
47
,
50
], financial
trading [
8
] and robotics [
13
]. However, in its core part, without designs tailored to specific tasks,
general RL paradigms are still learning implicit representations from critic loss (value predictions)
and actor loss (maximizing cumulative reward). In many real-world scenarios where observations are
complicated (e.g., images) or incomplete (e.g., partial observable), training an agent that is able to
extract informative signals from those inputs becomes incredibly sample-inefficient.
Therefore, many recent works have been devoted to obtaining a good state representation, which
is believed to be one of the key solutions to improve the efficacy of RL [
23
,
24
]. One of the main
streams is adding auxiliary losses to update the state encoder. Under the hood, it resorts to informative
and dense learning signals in order to encode various prior knowledge and regularization [
40
], and
obtain better latent representations. Over the years, a series of works have attempted to figure out
the form of the most helpful auxiliary loss for RL. Quite a few advances have been made, including
observation reconstruction [
51
], reward prediction [
20
], environment dynamics prediction [
40
,
6
,
35
],
etc. But we note two problems in this evolving process: (i) each of the loss designs listed above are
obtained through empirical trial-and-errors based on expert designs, thus heavily relying on human
labor and expertise; (ii) few works have used the final performance of RL as an optimization objective
The work was conducted during Tairan He’s internship at Microsoft Research.
The corresponding author is Kan Ren.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.06041v1 [cs.LG] 12 Oct 2022
Replay Buffer
State
Encoder
Policy Network
&
Q Network
Auxiliary
Loss
RL
Loss
Data
Gradient
Inner Loop (RL training) Outer Loop (Evolution)
RL Training Scores
Auxiliary Loss Candidates
Auxiliary
Loss
Select Top 25%
Mutation
Loss Rejection
Reject
Figure 1: Overview of A2LS. A2LS contains an inner loop (left) and an outer loop (right). The
inner loop performs an RL training procedure with searched auxiliary loss functions. The outer loop
searches auxiliary loss functions using an evolutionary algorithm to select the better auxiliary losses.
to directly search the auxiliary loss, indicating that these designs could be sub-optimal. To resolve
the issues of the existing handcrafted solution mentioned above, we decide to automate the process of
designing the auxiliary loss functions of RL and propose a principled solution named Automated
Auxiliary Loss Search (A2LS). A2LS formulates the problem as a bi-level optimization where we try
to find the best auxiliary loss, which, to the most extent, helps train a good RL agent. The outer loop
searches for auxiliary losses based on RL performance to ensure the searched losses align with the
RL objective, while the inner loop performs RL training with the searched auxiliary loss function.
Specifically, A2LS utilizes an evolutionary strategy to search the configuration of auxiliary losses
over a novel search space of size
7.5×1020
that covers many existing solutions. By searching on
a small set of simulated training environments of continuous control from Deepmind Control suite
(DMC) [43], A2LS finalizes a loss, namely A2-winner.
To evaluate the generalizability of the discovered auxiliary loss
A2-winner
, we test
A2-winner
on
a wide set of test environments, including both image-based and vector-based (with proprioceptive
features like positions, velocities and accelerations as inputs) tasks. Extensive experiments show the
searched loss function is highly effective and largely outperforms strong baseline methods. More
importantly, the searched auxiliary loss generalizes well to unseen settings such as (i) different
robots of control; (ii) different data types of observation; (iii) partially observable settings; (iv)
different network architectures; and (v) even to a totally different discrete control domain (Atari 2600
games [
1
]). In the end, we make detailed statistical analyses on the relation between RL performance
and patterns of auxiliary losses based on the data of whole evolutionary search process, providing
useful insights on future studies of auxiliary loss designs and representation learning in RL.
2 Problem Formulation and Background
We consider the standard Markov Decision Process (MDP)
E
where the state, action and reward
at time step
t
are denoted as
(st, at, rt)
. The sequence of rollout data sampled by the agent in the
episodic environment is
(s0, . . . , st, at, rt, st+1,· · · , sT)
, where
T
represents the episode length.
Suppose the RL agent is parameterized by
ω
(either the policy
π
or the state-action value function
Q
),
with a state encoder
gθ
parameterized by
θω
which plays a key role for representation learning in
RL. The agent is required to maximize its cumulative rewards in environment
E
by optimizing
ω
,
noted as R(ω;E) = Eπ[PT1
t=0 rt].
In this paper, we aim to find the optimal auxiliary loss function
LAux
such that the agent can reach the
best performance by optimizing
ω
under a combination of an arbitrary RL loss function
LRL
together
with an auxiliary loss LAux. Formally, our optimization goal is:
max
LAux
R(min
ωLRL(ω;E) + λLAux(θ;E); E),(1)
where λis a hyper-parameter balancing the relative weight of the auxiliary loss. The left part (inner
loop) of Figure 1 illustrates how data and gradients flow in RL training when an auxiliary loss is
enabled. Some instances of
LRL
and
LAux
are given in Appendix B. Unfortunately, existing auxiliary
losses
LAux
are handcrafted, which heavily rely on expert knowledge, and may not generalize well
2
Table 1: Typical solution with auxiliary loss and their common elements.
Auxiliary Loss Operator Input Elements
Horizon Source Target
Forward dynamics [35, 40, 6] MSE 1 {st, at} {st+1}
Inverse dynamics MSE 1 {at, st+1} {st}
Reward prediction [20, 6] MSE 1 {st, at} {rt}
Action inference [40, 6] MSE 1 {st, st+1} {at}
CURL [23] Bilinear 1 {st} {st}
ATC [42] Bilinear k {st} {st+1,· · · , st+k}
SPR [39] N-MSE k {st, at, at+1,· · · , at+k1} {st+1,· · · , st+k}
in different scenarios as shown in the experiment part. To find better auxiliary loss functions for
representation learning in RL, we introduce our principled solution in the following section.
3 Automated Auxiliary Loss Search
To meet our goal of finding top-performing auxiliary loss functions without expert assignment, we
turn to the help of automated loss search, which has shown promising results in the automated
machine learning (AutoML) community [
27
,
28
,
48
]. Correspondingly, we propose Automated
Auxiliary Loss Search (A2LS), a principled solution for resolving the above bi-level optimization
problem in Equation 1. A2LS resolves the inner problem as a standard RL training procedure; for the
outer one, A2LS defines a finite and discrete search space (Section 3.1), and designs a novel evolution
strategy to efficiently explore the space (Section 3.2).
3.1 Search Space Design
We have argued that almost all existing auxiliary losses require expert knowledge, and we expect to
search for a better one automatically. To this end, it is clear that we should design a search space that
satisfies the following desiderata.
Generalization
: the search space should cover most of the existing handcrafted auxiliary losses to
ensure the searched results can be no worse than handcrafted losses;
Atomicity
: the search space should be composed of several independent dimensions to fit into any
general search algorithm [30] and support an efficient search scheme;
Sufficiency: the search space should be large enough to contain the top-performing solutions.
Given the criteria, we conclude and list some existing auxiliary losses in Table 1 and find their
commonalities, as well as differences. We realize that these losses share similar components and
computation flow. As shown in Figure 2, when training the RL agent, the loss firstly selects a sequence
{st, at, rt}i+k
t=i
from the replay buffer, when
k
is called horizon. The agent then tries to predict some
elements in the sequence (called target) based on another picked set of elements from the sequence
(called source). Finally, the loss calculates and minimizes the prediction error (rigorously defined
with operator). To be more specific, the encoder part
gθ
of the agent, first encodes the source into
latent representations, which is further fed into a predictor
h
to get a prediction
y
; the auxiliary loss is
computed by the prediction
y
and the target
ˆy
that is translated from the target by a target encoder
gˆ
θ
,
using an operator
f
. The target encoder is updated in an momentum manner as shown in Figure 2
(details are given in Appendix C.1.2). Formally,
LAux(θ;E) = fhgθ(seqsource), gˆ
θ(seqtarget),(2)
where
seqsource,seqtarget ⊆ {st, at, rt}i+k
t=i
are both subsets of the candidate sequence. And for
simplicity, we will denote
gθ(st, at, rt, st+1,· · · )
as short for
[gθ(st), at, rt, gθ(st+1),· · · ]
for the
rest of this paper (the encoder
g
only deals with states
{si}
). Thereafter, we observe that these
existing auxiliary losses differ in two dimensions, i.e., input elements and operator, where input
elements are further combined by horizon,source and target. These differences compose our search
dimensions of the whole space. We then illustrate the search ranges of these dimensions in detail.
Input elements.
The input elements denote all inputs to the loss functions, which can be further
3
Input Elements
Encoder Target Encoder
Predictor
Operator
Auxiliary Loss
momentum update
stop-grad
Example of forward dynamics prediction (horizon k=1)
Input Elements
Operator
Search Space
. . .. . .
Horizon
TargetSource
Figure 2: Overview of the search space
{I, f}
and the computation graph of auxiliary loss functions.
I
selects a candidate sequence
{st, at, rt}i+k
t=i
with horizon
k
; then determine a source and a target
as arbitrary subsets of the sequence; an encoder
gθ
first encodes the source into latent representations,
which is fed into a predictor
h
to get a prediction
y
; the auxiliary loss is computed over the prediction
y
and the ground truth
ˆy
that is translated from the target by a target encoder
gˆ
θ
, using a operator
f
.
disassembled as horizon,source and target. Different from previous automated loss search works, the
target here is not “ground-truth” because auxiliary losses in RL have no labels beforehand. Instead,
both source and target are generated via interacting with the environment in a self-supervised manner.
Particularly, the input elements first determine a candidate sequence
{st, at, rt}i+k
t=i
with horizon
k
.
Then, it chooses two subsets from the candidate sequence as source and target respectively. For
example, the subsets can be {st},{st, st+1}, or {st, rt+1, at+2},{st, st+1, at+1}, etc.
Operator.
Given a prediction
y
and its target
ˆy
, the auxiliary loss is computed by an operator
f
,
which is often a similarity measure. In our work, we cover all different operators
f
used by the
previous works, including inner product (Inner) [
17
,
42
], bilinear inner product (Bilinear) [
23
],
cosine similarity (Cosine) [
3
], mean squared error (MSE) [
35
,
6
] and normalized mean squared error
(N-MSE) [
39
]. Additionally, other works also utilize contrastive objectives, e.g., InfoNCE loss [
33
],
incorporating the trick to sample un-paired predictions and targets as negative samples and maximize
the distances between them. This technique is orthogonal to the five similarity measures mentioned
above, so we make it optional and create 5×2 = 10 different operators in total.
Final design.
In the light of preceding discussion, with the definition of input elements and operator,
we finish the design of the search space, which satisfactorily meets the desiderata mentioned above.
Specifically, the space is
generalizable
to cover most of the existing handcrafted auxiliary losses;
additionally, the
atomicity
is embodied by the compositionality that all input elements work with
any operator; most importantly, the search space is
sufficiently
large with a total size of
7.5×1020
(detailed calculation can be found in Appendix E) to find better solutions.
3.2 Search Strategy
The success of evolution strategies in exploring large, multi-dimensional search space has been proven
in many works [
19
,
4
]. Similarly, A2LS adopts an evolutionary algorithm [
37
] to search for top-
performing auxiliary loss functions over the designed search space. In its essence, the evolutionary
algorithm (i) keeps a population of loss function candidates; (ii) evaluates their performance; (iii)
eliminates the worst and evolves into a new better population. Note that step (ii) of “evaluating” is
very costly because it needs to train the RL agents with dozens of different auxiliary loss functions.
Therefore, our key technical contribution contains how to further reduce the search cost (Section 3.2.1)
and how to make an efficient search procedure (Section 3.2.2).
3.2.1 Search Space Pruning
In our preliminary experiment, we find out the dimension of operator in the search space can be
simplified. In particular, MSE outperforms all the others by significant gaps in most cases. So we
effectively prune other choices of operators except MSE. See Appendix D.1 for complete comparative
results and an ablation study on the effectiveness of search space pruning.
4
Figure 3: Four types of mutation strategy for evolution. We represent both the source and the target
of the input elements as a pair of binary masks, where each bit of the binary mask represents selected
(green block) by 1 or not selected (white block) by 0.
3.2.2 Evolution Procedure
Our evolution procedure roughly contains four important components: (i)
evaluation and selection
:
a population of candidate auxiliary losses is evaluated through an inner loop of RL training, then we
select the top candidates for the next evolution stage (i.e., generation); (ii)
mutation
: the selected
candidates mutate to form a new population and move to the next stage; (iii)
loss rejection
: filter
out and skip evaluating invalid auxiliary losses for the next stage; and (iv)
bootstrapping initial
population
: assign more chance to initial auxiliary losses that may contain useful patterns by prior
knowledge for higher efficiency. The step-by-step evolution algorithm is provided in Algorithm 1 in
the appendix, and an overview of the A2LS pipeline is illustrated in Figure 1. We next describe them
in detail.
Evaluation and selection.
At each evolution stage, we first train a population of candidates with
a population size
P= 100
by the inner loop of RL training. The candidates are then sorted by
computing the approximated area under learning curve (AULC) [
11
,
41
], which is a single metric
reflecting both the convergence speed and the final performance [
46
] with low variance of results.
After each training stage, the top-25% candidates are selected to generate the population for the next
stage. We include an ablation study on the effectiveness of AULC in Appendix D.3.
Mutation.
To obtain a new population of auxiliary loss functions, we propose a novel mutation
strategy. First, we represent both the source and the target of the input elements as a pair of binary
masks, where each bit of the mask represents selected by 1 or not selected by 0. For instance,
given a candidate sequence
{st, at, rt, st+1, at+1, rt+1}
, the binary mask of this subset sequence
{st, at, rt+1}
is denoted as
110001
. Afterward, we adopt four types of mutations, also shown in
Figure 3: (i) replacement (50% of the population): flip the given binary mask with probability
p=1
2·(3k+3)
with the horizon length
k
; (ii) crossover (20%): generate a new candidate by randomly
combining the mask bits of two candidates with the same horizon length in the population; (iii)
horizon decrease and horizon increase (10%): append new binary masks to the tail or delete existing
binary masks at the back. (iv) random generation (20%): every bit of the binary mask is generated
from a Bernoulli distribution B(0.5).
Loss rejection protocol.
Since the auxiliary loss needs to be differentiable with respect to the
parameters of the state encoder, we perform a gradient flow check on randomly generated loss
functions during evolution and skip evaluating invalid auxiliary losses. Concretely, the following
conditions must be satisfied to make a valid loss function: (i) having at least one state element in
seqsource
to make sure the gradient of auxiliary loss can propagate back to the state encoder; (ii)
seqtarget
is not empty; (iii) the horizon should be within a reasonable range (
1k10
in our
experiments). If a loss is rejected, we repeat the mutation to fill the population.
Bootstrapping initial population.
To improve the computational efficiency so that the algorithm
can find reasonable loss functions quickly, we incorporate prior knowledge into the initialization of
the search. Particularly, before the first stage of evolution, we bootstrap the initial population with a
prior distribution that assigns high probability to auxiliary loss functions containing useful patterns
like dynamics and reward prediction. More implementation details are provided in Appendix C.3.
5
摘要:

ReinforcementLearningwithAutomatedAuxiliaryLossSearchTairanHe1YugeZhang2KanRen2yMinghuanLiu1CheWang3WeinanZhang1YuqingYang2DongshengLi21ShanghaiJiaoTongUniversity2MicrosoftResearchAsia3NewYorkUniversitywhynot@sjtu.edu.cnkan.ren@microsoft.comAbstractAgoodstaterepresentationiscrucialtosolvingcomplica...

展开>> 收起<<
Reinforcement Learning with Automated Auxiliary Loss Search Tairan He1Yuge Zhang2Kan Ren2yMinghuan Liu1.pdf

共30页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:30 页 大小:4.25MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 30
客服
关注