
A Kernel Stein Test of Goodness of Fit for Sequential Models
each other. As this example shows, kernels need not be
universal to provide powerful tests, as long as they en-
code relevant features to the setting where they are used.
The literature on sequence kernels is well-established (see,
e.g., Kir´aly & Oberhauser,2019, for a recent account), and
one could use existing kernels, such as global alignment
kernels (Cuturi et al.,2007;Cuturi,2011) and string ker-
nels (Haussler,1999;Lodhi et al.,2002;Leslie & Kuang,
2004). Alternatively, one may also define a kernel by the
inner product of explicit features (e.g., neural network em-
bedding of sequences) instead of known kernels.
3. Testing Sequential Models
We now address the design of a Stein operator for the
variable-dimension setting. We begin by formally defin-
ing the sample space X. Let Sbe a nonempty finite set
of symbols. For an integer ℓ≥1, we denote by S(ℓ)the
set of all length-ℓsequences formed by symbols in S; i.e.,
S(ℓ)=Qℓ
j=1 S. Then, the sample space Xis given by
the set of all sequences Fℓmax
ℓ=1 S(ℓ), where Fdenotes the
disjoint union, and ℓmax is ∞or a finite positive integer.
In this setting, a random sample in Xis a sequence whose
length is randomly determined.
As noted in Section 2, to obtain a KSD in this setting,
we need to design a neighborhood structure which speci-
fies our ZS operator. We first introduce a simple structure
for sequences, which edits only the end of the string, to
illustrate the required connectivity principles. We then de-
scribe a more advanced neighborhood structure, allowing
for greater modifications, which will be the foundation of
our tests.
Neighborhood Choice For the Stein operator to uniquely
characterize the model, we require the strong connectivity
of the induced graph. We achieve this by introducing inter-
and intra-length state transitions. Specifically, for a length-
ℓsequence x(1:ℓ)= (x1,...,xℓ)∈ S(l), we propose the
following neighborhood:
∂x(1:ℓ)=Ix∪ {x(1:ℓ−1)} ∪ Rx(5)
with
Ix={(x1,...,xℓ, s) : s∈ S},
Rx={(x1,...,xl−1, s) : s∈ Nrep(xℓ)},
where Nrep(s)denotes a neighborhood chosen for a sym-
bol s∈ S; this is chosen such that the induced graph is
strongly connected. The first two sets in (5) represent the
inter-length transitions: Each point of Ixcorresponds to
adding a symbol to the end of the given sequence x(1:ℓ),
whereas {x(1:l−1)}stands for the deletion of the last sym-
bol and shortening the sequence. The third set Rxrep-
resents intra-length transitions, where the last symbol of
the sequence is replaced. A trivial choice for satisfying
the strong connectivity is using the whole alphabet set S
for Nrep(s)for each s∈ S. This approach may be per-
mitted when the alphabet size is small. One might oth-
erwise want to consider using smaller sets to reduce the
computational cost. For example, we can use a smaller
Nrep(s), provided that the graph (with vertices S) induced
by the choice satisfies the three conditions required in Sec-
tion 2; e.g., as in (Yang et al.,2018), we may introduce a
cyclic order {0,...,|S| − 1}to S, and take two adjacent
points Nrep(s) = {s+ 1, s −1}, where s±1denotes
(s±1) mod |S| − 1.
The structure proposed above is a minimal neighborhood
choice that guarantees strong connectivity. Indeed, we can
edit the length of a sequence, and change the character in
each position to a desired one because of the strong con-
nectivity in the intra-length transition. As we will see in
Section 4, however, this minimal choice may not produce a
powerful test, as it only uses information of neighboring
length sequences. For example, the tail insertion opera-
tion by Ixonly encodes the structure of the target distri-
bution in terms of the end of the sequence. For sequen-
tial models such as Markov chain models, the longer a se-
quence is, the more evidence we should have in favor of, or
against, a given model. Editing only the final element of
a sequence would therefore not make use of such accumu-
lated evidence. Indeed, this approach performs relatively
weakly in our benchmark experiments (Figure 2b, case of
J= 1).
Proposed Neighborhood Based on the above observa-
tion, we generalize the above neighborhood set. Specifi-
cally, we consider expanding the neighborhood set by mod-
ifying sequence elements at different locations: we propose
the J-location modification neighborhood
∂J
x(1:ℓ)=∪J
j=1Ix(1:ℓ),j ∪ Dx(1:ℓ),j ∪ Rx(1:ℓ),j ,(6)
where
Ix(1:ℓ),j ⊂ S(ℓ+1) ={ins(x(1:ℓ), j, s) : s∈ Nins},
Dx(1:ℓ),j ⊂ S(ℓ−1) ={del(x(1:ℓ), j)if xℓ−j+1 ∈ Nins},
Rx(1:ℓ),j ⊂ S(ℓ)={rep(x(1:ℓ), j, s) : s∈ Nrep(xℓ)}.
Here, ins(x(1:ℓ), j, s)denotes the sequence extending x(1:ℓ)
by inserting sat the (j−1)-th location counting back from
the end of the string, using symbols from a fixed set Nins ⊂
S;del(x(1:ℓ), j)deletes the (j−1)-th element counting
back from the end of x(1:ℓ); and rep(x(1:ℓ), j, s)replaces
by sthe (j−1)-th element from the end of x(1:ℓ). Note
that the deletion and insertion operations must be paired to
4