Reinforcement Learning with Automated Auxiliary Loss Search Tairan He1Yuge Zhang2Kan Ren2yMinghuan Liu1

2025-04-26 0 0 4.25MB 30 页 10玖币

侵权投诉

Reinforcement Learning with

Automated Auxiliary Loss Search

Tairan He1∗Yuge Zhang2Kan Ren2†Minghuan Liu1

Che Wang3Weinan Zhang1Yuqing Yang2Dongsheng Li2

1Shanghai Jiao Tong University 2Microsoft Research Asia 3New York University

whynot@sjtu.edu.cn kan.ren@microsoft.com

Abstract

A good state representation is crucial to solving complicated reinforcement learning

(RL) challenges. Many recent works focus on designing auxiliary losses for learn-

ing informative representations. Unfortunately, these handcrafted objectives rely

heavily on expert knowledge and may be sub-optimal. In this paper, we propose a

principled and universal method for learning better representations with auxiliary

loss functions, named Automated Auxiliary Loss Search (A2LS), which automat-

ically searches for top-performing auxiliary loss functions for RL. Speciﬁcally,

based on the collected trajectory data, we deﬁne a general auxiliary loss space of

size

7.5×1020

and explore the space with an efﬁcient evolutionary search strategy.

Empirical results show that the discovered auxiliary loss (namely,

A2-winner

)

signiﬁcantly improves the performance on both high-dimensional (image) and low-

dimensional (vector) unseen tasks with much higher efﬁciency, showing promising

generalization ability to different settings and even different benchmark domains.

We conduct a statistical analysis to reveal the relations between patterns of auxiliary

losses and RL performance. The codes and supplementary materials are available

at https://seqml.github.io/a2ls.

1 Introduction

Reinforcement learning (RL) has achieved remarkable progress in games [

], ﬁnancial

trading [

] and robotics [

]. However, in its core part, without designs tailored to speciﬁc tasks,

general RL paradigms are still learning implicit representations from critic loss (value predictions)

and actor loss (maximizing cumulative reward). In many real-world scenarios where observations are

complicated (e.g., images) or incomplete (e.g., partial observable), training an agent that is able to

extract informative signals from those inputs becomes incredibly sample-inefﬁcient.

Therefore, many recent works have been devoted to obtaining a good state representation, which

is believed to be one of the key solutions to improve the efﬁcacy of RL [

]. One of the main

streams is adding auxiliary losses to update the state encoder. Under the hood, it resorts to informative

and dense learning signals in order to encode various prior knowledge and regularization [

], and

obtain better latent representations. Over the years, a series of works have attempted to ﬁgure out

the form of the most helpful auxiliary loss for RL. Quite a few advances have been made, including

observation reconstruction [

], reward prediction [

], environment dynamics prediction [

etc. But we note two problems in this evolving process: (i) each of the loss designs listed above are

obtained through empirical trial-and-errors based on expert designs, thus heavily relying on human

labor and expertise; (ii) few works have used the ﬁnal performance of RL as an optimization objective

∗The work was conducted during Tairan He’s internship at Microsoft Research.

†The corresponding author is Kan Ren.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.06041v1 [cs.LG] 12 Oct 2022

Replay Buffer

State

Encoder

Policy Network

Q Network

Auxiliary

Loss

Data

Gradient

Inner Loop (RL training) Outer Loop (Evolution)

RL Training Scores

Auxiliary Loss Candidates

Auxiliary

Loss

Select Top 25%

Mutation

Loss Rejection

Reject

Figure 1: Overview of A2LS. A2LS contains an inner loop (left) and an outer loop (right). The

inner loop performs an RL training procedure with searched auxiliary loss functions. The outer loop

searches auxiliary loss functions using an evolutionary algorithm to select the better auxiliary losses.

to directly search the auxiliary loss, indicating that these designs could be sub-optimal. To resolve

the issues of the existing handcrafted solution mentioned above, we decide to automate the process of

designing the auxiliary loss functions of RL and propose a principled solution named Automated

Auxiliary Loss Search (A2LS). A2LS formulates the problem as a bi-level optimization where we try

to ﬁnd the best auxiliary loss, which, to the most extent, helps train a good RL agent. The outer loop

searches for auxiliary losses based on RL performance to ensure the searched losses align with the

RL objective, while the inner loop performs RL training with the searched auxiliary loss function.

Speciﬁcally, A2LS utilizes an evolutionary strategy to search the conﬁguration of auxiliary losses

over a novel search space of size

7.5×1020

that covers many existing solutions. By searching on

a small set of simulated training environments of continuous control from Deepmind Control suite

(DMC) [43], A2LS ﬁnalizes a loss, namely A2-winner.

To evaluate the generalizability of the discovered auxiliary loss

A2-winner

, we test

A2-winner

a wide set of test environments, including both image-based and vector-based (with proprioceptive

features like positions, velocities and accelerations as inputs) tasks. Extensive experiments show the

searched loss function is highly effective and largely outperforms strong baseline methods. More

importantly, the searched auxiliary loss generalizes well to unseen settings such as (i) different

robots of control; (ii) different data types of observation; (iii) partially observable settings; (iv)

different network architectures; and (v) even to a totally different discrete control domain (Atari 2600

games [

]). In the end, we make detailed statistical analyses on the relation between RL performance

and patterns of auxiliary losses based on the data of whole evolutionary search process, providing

useful insights on future studies of auxiliary loss designs and representation learning in RL.

2 Problem Formulation and Background

We consider the standard Markov Decision Process (MDP)

where the state, action and reward

at time step

are denoted as

(st, at, rt)

. The sequence of rollout data sampled by the agent in the

episodic environment is

(s0, . . . , st, at, rt, st+1,· · · , sT)

, where

represents the episode length.

Suppose the RL agent is parameterized by

(either the policy

or the state-action value function

with a state encoder

gθ

parameterized by

θ⊆ω

which plays a key role for representation learning in

RL. The agent is required to maximize its cumulative rewards in environment

by optimizing

noted as R(ω;E) = Eπ[PT−1

t=0 rt].

In this paper, we aim to ﬁnd the optimal auxiliary loss function

LAux

such that the agent can reach the

best performance by optimizing

under a combination of an arbitrary RL loss function

LRL

together

with an auxiliary loss LAux. Formally, our optimization goal is:

max

LAux

R(min

ωLRL(ω;E) + λLAux(θ;E); E),(1)

where λis a hyper-parameter balancing the relative weight of the auxiliary loss. The left part (inner

loop) of Figure 1 illustrates how data and gradients ﬂow in RL training when an auxiliary loss is

enabled. Some instances of

LRL

and

LAux

are given in Appendix B. Unfortunately, existing auxiliary

losses

LAux

are handcrafted, which heavily rely on expert knowledge, and may not generalize well

Table 1: Typical solution with auxiliary loss and their common elements.

Auxiliary Loss Operator Input Elements

Horizon Source Target

Forward dynamics [35, 40, 6] MSE 1 {st, at} {st+1}

Inverse dynamics MSE 1 {at, st+1} {st}

Reward prediction [20, 6] MSE 1 {st, at} {rt}

Action inference [40, 6] MSE 1 {st, st+1} {at}

CURL [23] Bilinear 1 {st} {st}

ATC [42] Bilinear k {st} {st+1,· · · , st+k}

SPR [39] N-MSE k {st, at, at+1,· · · , at+k−1} {st+1,· · · , st+k}

in different scenarios as shown in the experiment part. To ﬁnd better auxiliary loss functions for

representation learning in RL, we introduce our principled solution in the following section.

3 Automated Auxiliary Loss Search

To meet our goal of ﬁnding top-performing auxiliary loss functions without expert assignment, we

turn to the help of automated loss search, which has shown promising results in the automated

machine learning (AutoML) community [

]. Correspondingly, we propose Automated

Auxiliary Loss Search (A2LS), a principled solution for resolving the above bi-level optimization

problem in Equation 1. A2LS resolves the inner problem as a standard RL training procedure; for the

outer one, A2LS deﬁnes a ﬁnite and discrete search space (Section 3.1), and designs a novel evolution

strategy to efﬁciently explore the space (Section 3.2).

3.1 Search Space Design

We have argued that almost all existing auxiliary losses require expert knowledge, and we expect to

search for a better one automatically. To this end, it is clear that we should design a search space that

satisﬁes the following desiderata.

•Generalization

: the search space should cover most of the existing handcrafted auxiliary losses to

ensure the searched results can be no worse than handcrafted losses;

•Atomicity

: the search space should be composed of several independent dimensions to ﬁt into any

general search algorithm [30] and support an efﬁcient search scheme;

•Sufﬁciency: the search space should be large enough to contain the top-performing solutions.

Given the criteria, we conclude and list some existing auxiliary losses in Table 1 and ﬁnd their

commonalities, as well as differences. We realize that these losses share similar components and

computation ﬂow. As shown in Figure 2, when training the RL agent, the loss ﬁrstly selects a sequence

{st, at, rt}i+k

t=i

from the replay buffer, when

is called horizon. The agent then tries to predict some

elements in the sequence (called target) based on another picked set of elements from the sequence

(called source). Finally, the loss calculates and minimizes the prediction error (rigorously deﬁned

with operator). To be more speciﬁc, the encoder part

gθ

of the agent, ﬁrst encodes the source into

latent representations, which is further fed into a predictor

to get a prediction

; the auxiliary loss is

computed by the prediction

and the target

ˆy

that is translated from the target by a target encoder

gˆ

using an operator

. The target encoder is updated in an momentum manner as shown in Figure 2

(details are given in Appendix C.1.2). Formally,

LAux(θ;E) = fhgθ(seqsource), gˆ

θ(seqtarget),(2)

where

seqsource,seqtarget ⊆ {st, at, rt}i+k

t=i

are both subsets of the candidate sequence. And for

simplicity, we will denote

gθ(st, at, rt, st+1,· · · )

as short for

[gθ(st), at, rt, gθ(st+1),· · · ]

for the

rest of this paper (the encoder

only deals with states

{si}

). Thereafter, we observe that these

existing auxiliary losses differ in two dimensions, i.e., input elements and operator, where input

elements are further combined by horizon,source and target. These differences compose our search

dimensions of the whole space. We then illustrate the search ranges of these dimensions in detail.

Input elements.

The input elements denote all inputs to the loss functions, which can be further

Input Elements

Encoder Target Encoder

Predictor

Operator

Auxiliary Loss

momentum update

stop-grad

Example of forward dynamics prediction (horizon k=1)

Input Elements

Operator

Search Space

. . .. . .

Horizon

TargetSource

Figure 2: Overview of the search space

{I, f}

and the computation graph of auxiliary loss functions.

selects a candidate sequence

{st, at, rt}i+k

t=i

with horizon

; then determine a source and a target

as arbitrary subsets of the sequence; an encoder

gθ

ﬁrst encodes the source into latent representations,

which is fed into a predictor

to get a prediction

; the auxiliary loss is computed over the prediction

and the ground truth

ˆy

that is translated from the target by a target encoder

gˆ

, using a operator

disassembled as horizon,source and target. Different from previous automated loss search works, the

target here is not “ground-truth” because auxiliary losses in RL have no labels beforehand. Instead,

both source and target are generated via interacting with the environment in a self-supervised manner.

Particularly, the input elements ﬁrst determine a candidate sequence

{st, at, rt}i+k

t=i

with horizon

Then, it chooses two subsets from the candidate sequence as source and target respectively. For

example, the subsets can be {st},{st, st+1}, or {st, rt+1, at+2},{st, st+1, at+1}, etc.

Operator.

Given a prediction

and its target

ˆy

, the auxiliary loss is computed by an operator

which is often a similarity measure. In our work, we cover all different operators

used by the

previous works, including inner product (Inner) [

], bilinear inner product (Bilinear) [

cosine similarity (Cosine) [

], mean squared error (MSE) [

] and normalized mean squared error

(N-MSE) [

]. Additionally, other works also utilize contrastive objectives, e.g., InfoNCE loss [

incorporating the trick to sample un-paired predictions and targets as negative samples and maximize

the distances between them. This technique is orthogonal to the ﬁve similarity measures mentioned

above, so we make it optional and create 5×2 = 10 different operators in total.

Final design.

In the light of preceding discussion, with the deﬁnition of input elements and operator,

we ﬁnish the design of the search space, which satisfactorily meets the desiderata mentioned above.

Speciﬁcally, the space is

generalizable

to cover most of the existing handcrafted auxiliary losses;

additionally, the

atomicity

is embodied by the compositionality that all input elements work with

any operator; most importantly, the search space is

sufﬁciently

large with a total size of

7.5×1020

(detailed calculation can be found in Appendix E) to ﬁnd better solutions.

3.2 Search Strategy

The success of evolution strategies in exploring large, multi-dimensional search space has been proven

in many works [

]. Similarly, A2LS adopts an evolutionary algorithm [

] to search for top-

performing auxiliary loss functions over the designed search space. In its essence, the evolutionary

algorithm (i) keeps a population of loss function candidates; (ii) evaluates their performance; (iii)

eliminates the worst and evolves into a new better population. Note that step (ii) of “evaluating” is

very costly because it needs to train the RL agents with dozens of different auxiliary loss functions.

Therefore, our key technical contribution contains how to further reduce the search cost (Section 3.2.1)

and how to make an efﬁcient search procedure (Section 3.2.2).

3.2.1 Search Space Pruning

In our preliminary experiment, we ﬁnd out the dimension of operator in the search space can be

simpliﬁed. In particular, MSE outperforms all the others by signiﬁcant gaps in most cases. So we

effectively prune other choices of operators except MSE. See Appendix D.1 for complete comparative

results and an ablation study on the effectiveness of search space pruning.

Horizon decrease

Horizon increase

Replacement

Crossover

Mask before mutation Mask After mutation

Changed by

Muation

Selected

Not Selected

Figure 3: Four types of mutation strategy for evolution. We represent both the source and the target

of the input elements as a pair of binary masks, where each bit of the binary mask represents selected

(green block) by 1 or not selected (white block) by 0.

3.2.2 Evolution Procedure

Our evolution procedure roughly contains four important components: (i)

evaluation and selection

a population of candidate auxiliary losses is evaluated through an inner loop of RL training, then we

select the top candidates for the next evolution stage (i.e., generation); (ii)

mutation

: the selected

candidates mutate to form a new population and move to the next stage; (iii)

loss rejection

: ﬁlter

out and skip evaluating invalid auxiliary losses for the next stage; and (iv)

bootstrapping initial

population

: assign more chance to initial auxiliary losses that may contain useful patterns by prior

knowledge for higher efﬁciency. The step-by-step evolution algorithm is provided in Algorithm 1 in

the appendix, and an overview of the A2LS pipeline is illustrated in Figure 1. We next describe them

in detail.

Evaluation and selection.

At each evolution stage, we ﬁrst train a population of candidates with

a population size

P= 100

by the inner loop of RL training. The candidates are then sorted by

computing the approximated area under learning curve (AULC) [

], which is a single metric

reﬂecting both the convergence speed and the ﬁnal performance [

] with low variance of results.

After each training stage, the top-25% candidates are selected to generate the population for the next

stage. We include an ablation study on the effectiveness of AULC in Appendix D.3.

Mutation.

To obtain a new population of auxiliary loss functions, we propose a novel mutation

strategy. First, we represent both the source and the target of the input elements as a pair of binary

masks, where each bit of the mask represents selected by 1 or not selected by 0. For instance,

given a candidate sequence

{st, at, rt, st+1, at+1, rt+1}

, the binary mask of this subset sequence

{st, at, rt+1}

is denoted as

110001

. Afterward, we adopt four types of mutations, also shown in

Figure 3: (i) replacement (50% of the population): ﬂip the given binary mask with probability

p=1

2·(3k+3)

with the horizon length

; (ii) crossover (20%): generate a new candidate by randomly

combining the mask bits of two candidates with the same horizon length in the population; (iii)

horizon decrease and horizon increase (10%): append new binary masks to the tail or delete existing

binary masks at the back. (iv) random generation (20%): every bit of the binary mask is generated

from a Bernoulli distribution B(0.5).

Loss rejection protocol.

Since the auxiliary loss needs to be differentiable with respect to the

parameters of the state encoder, we perform a gradient ﬂow check on randomly generated loss

functions during evolution and skip evaluating invalid auxiliary losses. Concretely, the following

conditions must be satisﬁed to make a valid loss function: (i) having at least one state element in

seqsource

to make sure the gradient of auxiliary loss can propagate back to the state encoder; (ii)

seqtarget

is not empty; (iii) the horizon should be within a reasonable range (

1≤k≤10

in our

experiments). If a loss is rejected, we repeat the mutation to ﬁll the population.

Bootstrapping initial population.

To improve the computational efﬁciency so that the algorithm

can ﬁnd reasonable loss functions quickly, we incorporate prior knowledge into the initialization of

the search. Particularly, before the ﬁrst stage of evolution, we bootstrap the initial population with a

prior distribution that assigns high probability to auxiliary loss functions containing useful patterns

like dynamics and reward prediction. More implementation details are provided in Appendix C.3.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ReinforcementLearningwithAutomatedAuxiliaryLossSearchTairanHe1YugeZhang2KanRen2yMinghuanLiu1CheWang3WeinanZhang1YuqingYang2DongshengLi21ShanghaiJiaoTongUniversity2MicrosoftResearchAsia3NewYorkUniversitywhynot@sjtu.edu.cnkan.ren@microsoft.comAbstractAgoodstaterepresentationiscrucialtosolvingcomplica...

展开>> 收起<<

Reinforcement Learning with Automated Auxiliary Loss Search Tairan He1Yuge Zhang2Kan Ren2yMinghuan Liu1.pdf

共30页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Reinforcement Learning with Automated Auxiliary Loss Search Tairan He1Yuge Zhang2Kan Ren2yMinghuan Liu1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: