
Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories
supervised learning, predictors only learn from labelled data.
However, labelled training examples often require human
annotation efforts and are thus hard to obtain, whereas un-
labelled data can be comparatively easy to collect. The
research on semi-supervised learning spans several decades.
One of the oldest SSL techniques, self-training, was orig-
inally proposed in the 1960s (Fralick,1967). There, the
predictor is first trained on the labelled data. Then, at each
training round, according to certain selection criteria such
as model uncertainty, a portion of the unlabelled data is
annotated by the predictor and added into the training set
for the next round. Such process is repeated multiple times.
We refer the readers to Zhu (2005); Chapelle et al. (2006);
Ouali et al. (2020); Van Engelen & Hoos (2020) for com-
prehensive literature surveys.
Imitation Learning from Observations There have
been several works in imitation learning (IL) which do
not assume access to the full set of actions, such as
BCO (Torabi et al.,2018a), MoBILE (Kidambi et al.,2021),
GAIfO (Torabi et al.,2018b) or third-person IL approaches
(Stadie et al.,2017;Sharma et al.,2019). The recent work
of Baker et al. (2022) also considered a setup where a small
number of labelled actions are available in addition to a large
unlabelled dataset. A key difference with our work is that
the IL setup typically assumes that all trajectories are gen-
erated by an expert, unlike our offline setup. Further, some
of these methods even permit reward-free interactions with
the environment which is not possible in the offline setup.
Learning from Videos Several works consider training
agents with human video demonstrations (Schmeckpeper
et al.,2020a;b), which are without action annotations. Dis-
tinct from our setup, some of these works allow for online
interactions, assume expert videos, and more broadly, video
data typically specifies agents with different embodiments.
3 Semi-Supervised Offline RL
Preliminaries We model our environment as a Markov
decision process (MDP) (Bellman,1957) denoted by
⟨S,A, p, P, R, γ⟩
, where
S
is the state space,
A
is the
action space,
p(s1)
is the distribution of the initial state,
P(st+1|st, at)
is the transition probability distribution,
R(st, at)
is the deterministic reward function, and
γ
is the
discount factor. At each timestep
t
, the agent observes a state
st∈ S
and executes an action
at∈ A
. The environment
then moves the agent to the next state
st+1 ∼P(·|st, at)
,
and also returns the agent a reward rt=R(st, at).
3.1 Proposed Setup
We assume the agent has access to a static offline dataset
Toffline
. The dataset consists of trajectories collected by
unknown policies, which are generally suboptimal. Let
τ
denote a trajectory and
|τ|
denote its length. We assume that
all the trajectories in
Toffline
contain complete rewards and
states. However, only a small subset of them contain actions.
We are interested in learning a policy by leveraging the of-
fline dataset without interacting with the environment. This
setup is analogous to semi-supervised learning, where ac-
tions serve the role of labels. Hence, we also refer to the
complete trajectories as labelled data (denoted by
Tlabelled
)
and the action-free trajectories as unlabelled data (denoted
by
Tunlabelled
). Further, we assume the labelled and unla-
belled data are sampled from two distributions Plabelled and
Punlabelled
, respectively. In general, the two distributions can
be different. One case we are particularly interested in is
when
Plabelled
generates low-to-moderate quality trajectories,
whereas
Punlabelled
generates trajectories of diverse qualities
including ones with high returns, as shown in Fig 1.1.
Our setup shares some similarities with state-only imitation
learning (Ijspeert et al.,2002;Bentivegna et al.,2002;Torabi
et al.,2019) in the use of action-unlabelled trajectories.
However, there are two fundamental differences. First, in
state-only IL, the unlabelled demonstrations are from the
same distribution as the labelled demonstrations, and both
are generated by a near-optimal expert policy. In our setting,
Plabelled
and
Punlabelled
can be different and are not assumed
to be optimal. Second, many state-only imitation learning
algorithms (e.g., Gupta et al. (2017); Torabi et al. (2018a;b);
Liu et al. (2018); Sermanet et al. (2018)) permit (reward-
free) interactions with the environments similar to their
original counterparts (e.g., Ho & Ermon (2016); Kim et al.
(2020)). This is not allowed in our offline setup, where the
agents are only provided with Tlabelled and Tunlabelled.
3.2 Training Pipeline
RL policies trained on low to moderate quality offline tra-
jectories are often sub-optimal, as many of the trajectories
might not have high returns and only cover a limited part
of the state space. Our goal is to find a way to combine the
action labelled trajectories and the unlabelled action-free
trajectories, so that the offline agent can exploit structures
in the unlabelled data to improve performance.
One natural strategy is to fill in proxy actions for those unla-
belled trajectories, and use the proxy-labelled data together
with the labelled data as a whole to train an offline RL agent.
Since we assume both the labelled and unlabelled trajec-
tories contain the states, we can train an inverse dynamics
model (IDM) ϕthat predicts actions using the states. Once
we obtain the IDM, we use it to generate the proxy actions
for the unlabelled trajectories. Finally, we combine those
proxy-labelled trajectories with the labelled trajectories, and
train an agent using the offline RL algorithm of choice. Our
meta-algorithmic pipeline is summarized in Algorithm 1.
Particularly, we propose a novel stochastic multi-transition
3