
To compare to full-state dynamics modeling, we can translate the sample complexity from the per-
parent count
n
to a total count
N
. Recall
mΠi|ci|=|S|
, so that
|ci|= (|S|/m)1/k
, and
mΠi|Pai| ≥
|S||A|
. We assume a small constant overlap factor
v≥1
, so that
|Pai|=v(|S||A|/m)1/k
. We need
the total number of component visits to be
n|Pai|km
, for a total of
nv(|S||A|/m)1/km
state-action
visits, assuming that parent set visits are allocated evenly, and noting that each state-action visit
provides kparent set visits. This gives:
Corollary 1. To bound the error as above, we need to have
N≥cmk2(|S|2|A|/m2)1/k log(|S||A|/δ)
2,
total train samples, where we have absorbed the overlap factor vinto constant c.
Comparing this to the analogous bound for full-state model learning (Agarwal et al. [1], Prop. 2.1):
N≥c|S|2|A|log(|S||A|/δ)
2,
we see that we have gone from super-linear
O(|S|2|A|log(|S||A|))
sample complexity in terms of
|S||A|, to the exponentially smaller O(mk2(|S|2|A|/m2)1/k log(|S||A|)).
This result implies that for large enough
|S||A|
our model must generalize to unseen states and
actions, since the number of samples needed (
N
) is exponentially smaller than the size of the
state-action space (|S||A|). In contrast, if it did not, then sample complexity would be Ω(|S||A|).
Remark 3.1.
The global factorization property of FMDPs is a strict assumption that rarely holds in
reality. Although local factorization is broadly applicable and significantly more realistic than the
FMDP setting, it is not without cost. In FMDPs, we have a single subspace (
m= 1
). In the locally
factored case, the number of subspaces
m
is likely to grow exponentially with the number of factors
k
, as there are exponentially many ways that
k
factors can interact. To be more precise, there are
k2k
possible bipartite graphs from
k
nodes to
k
nodes. Nevertheless, by comparing bases (
2 |S||A|
),
we see that we still obtain exponential gains in sample complexity from the locally factored approach.
3.2 Training Value Functions and Policies for Out-of-Distribution Generalization
In the previous subsection, we saw that a locally factored dynamics model provably generalizes
outside of the empirical joint distribution. A natural question is whether such local factorization can
be leveraged to obtain similar results for value functions and policies?
We will show that the answer is yes, but perhaps counter-intuitively, it is not achieved by directly
training the value function and policy on the empirical distribution, as is the case for the dynamics
model. The difference arises because learned value functions, and consequently learned policies,
involve the long horizon prediction
EP,π P∞
t=0 γtr(st, at)
, which may not benefit from the local
sparsity of
GL
. When compounded over time, sparse local structures can quickly produce an entangled
long horizon structure (cf. the “butterfly effect”). Intuitively, even if several pool balls are far apart
and locally disentangled, future collisions are central to planning and the optimal policy depends on
the relative positions of all balls. This applies even if rewards are factored (e.g., rewards in most pool
variants) [54].
We note that, although temporal entanglement may be exponential in the branching factor of the
unrolled causal graph, it’s possible for the long horizon structure to stay sparse (e.g.,
k
independent
factors that never interact, or long-horizon disentanglement between descision relevant and decision
irrelevant variables [
19
]). It’s also possible that other regularities in the data will allow for good
out-of-distribution generalization. Thus, we cannot claim that value functions and policies will
never generalize well out-of-distribution (see Veerapaneni et al.
[58]
for an example when they do).
Nevertheless, we hypothesize that exponentially fast entanglement does occur in complex natural
systems, making direct generalization of long horizon predictions difficult.
Out-of-distribution generalization of the policy and value function can be achieved, however, by
leveraging the generalization properties of a locally factored dynamics model. We propose to do this
by generating out-of-distribution states and actions (the “parent distribution”), and then applying our
learned dynamics model to generate transitions that are used to train the policy and value function.
We call this process Model-based Counterfactual Data Augmentation (MOCODA).
5