
Table 1: Typical solution with auxiliary loss and their common elements.
Auxiliary Loss Operator Input Elements
Horizon Source Target
Forward dynamics [35, 40, 6] MSE 1 {st, at} {st+1}
Inverse dynamics MSE 1 {at, st+1} {st}
Reward prediction [20, 6] MSE 1 {st, at} {rt}
Action inference [40, 6] MSE 1 {st, st+1} {at}
CURL [23] Bilinear 1 {st} {st}
ATC [42] Bilinear k {st} {st+1,· · · , st+k}
SPR [39] N-MSE k {st, at, at+1,· · · , at+k−1} {st+1,· · · , st+k}
in different scenarios as shown in the experiment part. To find better auxiliary loss functions for
representation learning in RL, we introduce our principled solution in the following section.
3 Automated Auxiliary Loss Search
To meet our goal of finding top-performing auxiliary loss functions without expert assignment, we
turn to the help of automated loss search, which has shown promising results in the automated
machine learning (AutoML) community [
27
,
28
,
48
]. Correspondingly, we propose Automated
Auxiliary Loss Search (A2LS), a principled solution for resolving the above bi-level optimization
problem in Equation 1. A2LS resolves the inner problem as a standard RL training procedure; for the
outer one, A2LS defines a finite and discrete search space (Section 3.1), and designs a novel evolution
strategy to efficiently explore the space (Section 3.2).
3.1 Search Space Design
We have argued that almost all existing auxiliary losses require expert knowledge, and we expect to
search for a better one automatically. To this end, it is clear that we should design a search space that
satisfies the following desiderata.
•Generalization
: the search space should cover most of the existing handcrafted auxiliary losses to
ensure the searched results can be no worse than handcrafted losses;
•Atomicity
: the search space should be composed of several independent dimensions to fit into any
general search algorithm [30] and support an efficient search scheme;
•Sufficiency: the search space should be large enough to contain the top-performing solutions.
Given the criteria, we conclude and list some existing auxiliary losses in Table 1 and find their
commonalities, as well as differences. We realize that these losses share similar components and
computation flow. As shown in Figure 2, when training the RL agent, the loss firstly selects a sequence
{st, at, rt}i+k
t=i
from the replay buffer, when
k
is called horizon. The agent then tries to predict some
elements in the sequence (called target) based on another picked set of elements from the sequence
(called source). Finally, the loss calculates and minimizes the prediction error (rigorously defined
with operator). To be more specific, the encoder part
gθ
of the agent, first encodes the source into
latent representations, which is further fed into a predictor
h
to get a prediction
y
; the auxiliary loss is
computed by the prediction
y
and the target
ˆy
that is translated from the target by a target encoder
gˆ
θ
,
using an operator
f
. The target encoder is updated in an momentum manner as shown in Figure 2
(details are given in Appendix C.1.2). Formally,
LAux(θ;E) = fhgθ(seqsource), gˆ
θ(seqtarget),(2)
where
seqsource,seqtarget ⊆ {st, at, rt}i+k
t=i
are both subsets of the candidate sequence. And for
simplicity, we will denote
gθ(st, at, rt, st+1,· · · )
as short for
[gθ(st), at, rt, gθ(st+1),· · · ]
for the
rest of this paper (the encoder
g
only deals with states
{si}
). Thereafter, we observe that these
existing auxiliary losses differ in two dimensions, i.e., input elements and operator, where input
elements are further combined by horizon,source and target. These differences compose our search
dimensions of the whole space. We then illustrate the search ranges of these dimensions in detail.
Input elements.
The input elements denote all inputs to the loss functions, which can be further
3