A Kernel Stein Test of Goodness of Fit for Sequential Models

2025-04-30 0 0 295.81KB 18 页 10玖币
侵权投诉
arXiv:2210.10741v3 [stat.ML] 13 Jul 2023
A Kernel Stein Test of Goodness of Fit for Sequential Models
Jerome Baum * 1 Heishiro Kanagawa *23 Arthur Gretton 2
Abstract
We propose a goodness-of-fit measure for proba-
bility densities modeling observations with vary-
ing dimensionality, such as text documents of
differing lengths or variable-length sequences.
The proposed measure is an instance of the ker-
nel Stein discrepancy (KSD), which has been
used to construct goodness-of-fit tests for unnor-
malized densities. The KSD is defined by its
Stein operator: current operators used in test-
ing apply to fixed-dimensional spaces. As our
main contribution, we extend the KSD to the
variable-dimension setting by identifying appro-
priate Stein operators, and propose a novel KSD
goodness-of-fit test. As with the previous vari-
ants, the proposed KSD does not require the den-
sity to be normalized, allowing the evaluation of
a large class of models. Our test is shown to per-
form well in practice on discrete sequential data
benchmarks.
1. Introduction
This paper addresses the problem of evaluating a proba-
bilistic model for observations with variable dimensional-
ity. This problem commonly arises in sequential data anal-
ysis, where sequences of different lengths are observed, and
generative models (e.g., Markov models) are used to draw
inferences. Our problem setting also concerns a scenario
where a data point is a collection of observations that may
not be sequentially ordered, as in the topic modeling of text
documents. Our task is formalized as follows: given a sam-
ple {Xi}n
i=1 from a distribution Qon a sample space X
(e.g., the set of all possible sequences), we aim to quantify
the discrepancy of distribution Pmodeling Q.
*Equal contribution 1Department of Computer Science,
University College London, UK 2Gatsby Computational Neu-
roscience Unit, UCL 3School of Mathematics, Statistics
and Physics, Newcastle University, UK. Correspondence to:
Jerome Baum <jerome@jeromebaum.com>, Heishiro Kana-
gawa <heishiro.kanagawa@gmail.com>.
Proceedings of the 40 th International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
One approach to this problem is generating samples
from the model and computing a sample-based discrep-
ancy measure, such as the maximum mean discrepancy
(MMD) (Gretton et al.,2012;Lloyd & Ghahramani,2015).
A disadvantage of this approach is that it potentially re-
quires many samples to see the departure of the model from
the data. For example, if a sequence model misspecifies a
particular transition from a given history, we would need
to generate sequences sharing this history to observe the
disparity. Unfortunately, the MMD approach becomes less
efficient as the state space enlarges or the length of the his-
tory grows, necessitating repeated sampling. This obser-
vation motivates using a measure that exploits information
provided by the model, such as dependence relations be-
tween different time points.
A model-dependent measure may be derived based on
Stein’s method (Stein,1972), a technique from probabil-
ity theory developed originally to obtain explicit rates of
convergence to normality. The key construct from Stein’s
method is a distribution-specific operator called a Stein
operator that modifies a function to have zero expecta-
tion under the distribution. A model-specific Stein oper-
ator APmay be defined to construct a zero-expectation
function APf; its expectation EXQ[APf(X)] under
the sample distribution serves as a discrepancy measure,
since a non-zero expectation indicates the deviation from
the model. One may generalize this idea to a fam-
ily of functions F, and the resulting worst-case sum-
mary supf∈F |EXQ[APf(X)]|is called a Stein discrep-
ancy, proposed by Gorham & Mackey (2015). An ap-
propriate choice of the operator and the function class
yields a computable Stein discrepancy. This paper fo-
cuses on an instance called the kernel Stein discrep-
ancy (KSD) (Oates et al.,2016;Chwialkowski et al.,2016;
Liu et al.,2016;Gorham & Mackey,2017), where func-
tions from a reproducing kernel Hilbert space (RKHS) are
used.
The KSD is a versatile framework for designing goodness-
of-fit measures. Given a Stein operator and a repro-
ducing kernel, the KSD is expressed by an expectation
of a Stein-modified kernel, leading to tractable estima-
tors only involving sample evaluations of the modified
kernel. This feature allows us to use a wealth of ker-
nels from the literature: by designing a Stein operator,
1
A Kernel Stein Test of Goodness of Fit for Sequential Models
one can adapt the kernel to the model and define a be-
spoke discrepancy measure. Based on the zero-mean
kernel theory by Oates et al. (2016), Chwialkowski et al.
(2016) and Liu et al. (2016) originated this line of work,
where they combined the Langevin Stein operator pro-
posed by Gorham & Mackey (2015) with an RKHS; re-
markably, the resulting (Langevin) KSD is computable
for densities with unknown normalizing constants. There
have been numerous extensions to models in other data do-
mains: categorical data (Yang et al.,2018), point-pattern
data (Yang et al.,2019), censored data (Fernandez et al.,
2020), directional data (Xu & Matsuda,2020), and func-
tional data (Wynne et al.,2022). We refer the reader
to Anastasiou et al. (2021) for a review on the KSD’s ap-
plications outside goodness-of-fit testing.
A limitation of the preceding Stein works is that they
require the model distribution to be defined on a fixed-
dimension space. For example, Kanagawa et al. (2023)
evaluate latent Dirichlet allocation models (Blei et al.,
2003) for text documents assuming a fixed document
length, an assumption unlikely to hold in practice. Rele-
vant work has been accomplished by Wynne et al. (2022),
where they propose a KSD for distributions on an (infinite-
dimensional) Hilbert space, and treat continuous-time mod-
els, such as Brownian motions and stochastic differential
equation models. While their test applies to sequential mod-
els, our target setting is different, as we focus on discrete-
time models such as Markov chains, and do not require a
density with respect to a Gaussian measure.
In the present work, we extend the KSD framework to
the variable-dimension setting. Specifically, we identify a
Stein operator for distributions on the set of all univariate
sequences of finite alphabets. This is to our knowledge the
first such operator, and it may of independent interest. Our
Stein operator builds on the class of Zanella-Stein operators
recently proposed by Hodgkinson et al. (2020), which are
derived from a Markov jump process having the target dis-
tribution as invariant measure; we review the Zanella-Stein
operator in Section 2. In Section 3, we derive a new Stein
operator using a Markov process admitting transitions be-
tween sets of different sizes. Based on our Stein operator
and the associated KSD, we propose a novel goodness-of-
fit test for sequential models. As in previous variants, the
proposed KSD does not require the model to be normalized,
allowing the evaluation of intractable models, including
Markov random fields (Section 4.5) and conditional gen-
erative models (Section 4.6). As the proposed operator in-
volves a number of tunable parameters, as our second con-
tribution, we offer guidance on how to select these based
on empirical power analysis.
2. Background
All Stein discrepancies are build around a particular Stein
operator. In this section, we recall a class of operators,
the Zanella-Stein operators, and describe the challenges in
constructing operators of this class in the sequential set-
ting. We also review the associated kernel Stein discrep-
ancy (KSD), a goodness-of-fit measure that leverages a
given Stein operator to construct a test statistic.
Zanella-Stein Operator The Zanella-Stein (ZS) opera-
tors are a class of Stein operators for distributions on a
countable set X, proposed by Hodgkinson et al. (2020) fol-
lowing the generator method of Barbour (1988). In the fol-
lowing, we assume that the model Phas probability mass
function ppositive everywhere in X. For a real-valued
function fon X, a ZS operator APis defined by
(APf) (x):=X
yx
gp(y)
p(x)(f(y)f(x)) ,(1)
where x X is a set of neighborhood points, and
g: (0,)(0,)is a balancing function, which must
satisfy g(t) = tg (1/t)for t > 0. Several choices of g
are known in the literature: the minimum probability flow
operator discussed in (Barp et al.,2019) uses g(t) = t;
the choice g(t) = t/(1 + t)is known as the Barker Stein
operator (Barker,1965;Shi et al.,2022).
A Zanella-Stein operator is the infinitesimal generator of
a Markov jump process called a Zanella process (Zanella,
2019;Power & Goldman,2019) designed to have the tar-
get distribution Pas an invariant distribution. The process
is based on the idea of locally informed proposals: these
jump from a given point xto any of its neighbors y, at a
rate g(p(y)/p(x)) so that detailed balance is satisfied for
P, ensuring that Pis an invariant distribution. For the
operator to characterize P, we must specify our notion of
neighborhood. Let (V,E)be the directed graph with ver-
tices V=Xand edges E={(x, y)|yx}induced by
the chosen neighborhood structure. The graph (V,E)must
have the following properties:
1. Symmetry: (x, y) E (y, x)∈ E.
2. Strong connectivity: the transitive closure of Eis X ×
X, so that there is a path from every point to every
other point.
In fact, symmetry alone implies that Pis in the set of invari-
ant distributions; strong connectivity is required to ensure
that Pis the unique invariant distribution, as otherwise the
corresponding Stein discrepancy cannot distinguish Pfrom
other distributions even for a rich test function class. In
Hodgkinson et al. (2020), the aperiodicity condition is ad-
ditionally required. For a pure jump-type process, we only
2
A Kernel Stein Test of Goodness of Fit for Sequential Models
need the irreducibility of the process (Kallenberg,2021,
Theorem 13.12), and hence we may dispense with this
condition. Note that we can conveniently choose a sparse
neighborhood to accelerate the computation, although this
may result in reduced sensitivity of the discrepancy. In Sec-
tion 4, we discuss in more detail the tradeoff between com-
putational cost and test power.
Challenges in the Variable-Dimension Setting The gen-
erator method (as in the ZS operator above) allows us to
obtain an operator for any state space with an appropriate
Markov process. Indeed, the Zanella process is not the only
allowable choice: other Markov chains or processes have
also been considered for discrete spaces, including birth-
death processes (Shi et al.,2022), the Glauber dynamics
Markov chain (Reinert & Ross,2019;Bresler & Nagaraj,
2019) (we refer the reader to Shi et al.,2022, for a review).
Defining a Markov process is often a challenge, however;
prior work has thus only considered processes in fixed-
dimensional spaces, and has not dealt with sequential mod-
els such as those considered here. The ZS operators are
particularly attractive in our setting, since they may be de-
fined on any discrete set. The main challenge in deriving
an operator in this class is the construction of an appropri-
ate neighborhood for models of variable-length sequences,
which is highly nontrivial. Our contribution is to develop
an effective neighborhood definition for the sequential set-
ting, which satisfies the three requirement outlined above,
and yields an associated test that has good statistical power
against alternatives.
Kernel Stein Discrepancy We next review the construc-
tion of a test statistic from the Stein operator. A computable
discrepancy measure may be defined using a reproduc-
ing kernel Hilbert space (RKHS) (Aronszajn,1950;
Steinwart & Christmann,2008). Following (Oates et al.,
2016;Chwialkowski et al.,2016;Liu et al.,2016;
Gorham & Mackey,2017), we define the kernel Stein
discrepancy (KSD) associated with our (yet to be speci-
fied) Zanella-Stein operator as follows:
KSDQkP:= sup
kfkH1EXQ[AP(f)(X)],(2)
where His the RKHS of real-valued functions on Xcor-
responding to a positive definite kernel k:X × X
Rwith the natural norm kfkH=phf, fiHdefined
by the inner product ,·iH. By the vanishing property
EYP[APf(Y)] = 0, the KSD is zero if Q=P. The
other direction requires further assumptions on the opera-
tor and the RKHS. According to Hodgkinson et al. (2020,
Proposition 4), when the corresponding Zanella process
is exponentially ergodic and the RKHS is C0-universal,
we have KSDQkP= 0 only if P=Q. For a fi-
nite state space X, exponential ergodicity is satisfied by
any irreducible Markov jump process. The C0-universality
is equivalent to the integrally strict positive definiteness
(Sriperumbudur et al.,2011); in the particular case |X| <
, this condition states that the Gram matrix defined over
all the configurations on Xis strictly positive definite.
The use of an RKHS yields a closed-form expression of the
KSD (Hodgkinson et al.,2020, see Proposition 1 and the
paragraph following the proof):
KSD2QkP=EX,XQQ[hp(X, X)],(3)
where X,Xare i.i.d. random variables with law Q, pro-
vided EXQ[hp(X, X)1/2]<. The proof of this re-
sult follows as in (Gorham & Mackey,2017, Proposition
2), noting that the inner product between fand APk(x, ·)
reproduces APf(x)for any f∈ H. The function hpis
called a Stein kernel, defined by
hp(x, y)
=X
ν∈NxX
˜ν∈Ny
gν(x)g˜ν(y){k(ν(x),˜ν(y)) + k(x, y)
k(x, ˜ν(y)) k(ν(x), y)}
,
where we identify each point in the neighborhood x as a
mapping from xto itself, and gν(x) = g{p(ν(x))/p(x)}.
Given a sample {Xi}n
i=1
i.i.d.
Q, the squared KSD expres-
sion (3) admits a simple unbiased estimator
Un,P ({Xi}n
i=1):=1
n(n1)
n
X
i6=j
hp(Xi, Xj),(4)
which is a U-statistic (Hoeffding,1948). By the
zero-mean property of the Stein kernel, the U-statistic
is degenerate if Q=P(Chwialkowski et al.,2016;
Liu et al.,2016). In this case, the scaled statis-
tic nUn,P ({Xi}n
i=1)asymptotically follows the law of
P
j=1 λj(Z2
j1), where Zjare i.i.d. standard nor-
mal variables, and λjare eigenvalues given by the eigen-
value problem Px∈X hp(x, x)a(x)p(x) = λa(x)with
Px∈X a(x)2p(x)<(see, e.g., Serfling,2009, Section
5.5). One of our proposed tests in Section 3simulates this
distribution using a wild bootstrap procedure.
Kernels for Sequences The performance of the test de-
veloped in the next section depends on the choice of the re-
producing kernel. As mentioned above, the C0-universality
is a condition that guides kernel selection since it ren-
ders the test consistent (along with the exponential ergod-
icity). A trivial example of C0-universal kernels is the
Dirac kernel that outputs 1if two inputs are identical,
and 0otherwise; but this kernel might not be useful in
practice, since all data points in the corresponding feature
space are orthogonal, and provide no information about
3
A Kernel Stein Test of Goodness of Fit for Sequential Models
each other. As this example shows, kernels need not be
universal to provide powerful tests, as long as they en-
code relevant features to the setting where they are used.
The literature on sequence kernels is well-established (see,
e.g., Kir´aly & Oberhauser,2019, for a recent account), and
one could use existing kernels, such as global alignment
kernels (Cuturi et al.,2007;Cuturi,2011) and string ker-
nels (Haussler,1999;Lodhi et al.,2002;Leslie & Kuang,
2004). Alternatively, one may also define a kernel by the
inner product of explicit features (e.g., neural network em-
bedding of sequences) instead of known kernels.
3. Testing Sequential Models
We now address the design of a Stein operator for the
variable-dimension setting. We begin by formally defin-
ing the sample space X. Let Sbe a nonempty finite set
of symbols. For an integer 1, we denote by S()the
set of all length-sequences formed by symbols in S; i.e.,
S()=Q
j=1 S. Then, the sample space Xis given by
the set of all sequences Fmax
=1 S(), where Fdenotes the
disjoint union, and max is or a finite positive integer.
In this setting, a random sample in Xis a sequence whose
length is randomly determined.
As noted in Section 2, to obtain a KSD in this setting,
we need to design a neighborhood structure which speci-
fies our ZS operator. We first introduce a simple structure
for sequences, which edits only the end of the string, to
illustrate the required connectivity principles. We then de-
scribe a more advanced neighborhood structure, allowing
for greater modifications, which will be the foundation of
our tests.
Neighborhood Choice For the Stein operator to uniquely
characterize the model, we require the strong connectivity
of the induced graph. We achieve this by introducing inter-
and intra-length state transitions. Specifically, for a length-
sequence x(1:)= (x1,...,x)∈ S(l), we propose the
following neighborhood:
x(1:)=Ix∪ {x(1:1)} ∪ Rx(5)
with
Ix={(x1,...,x, s) : s∈ S},
Rx={(x1,...,xl1, s) : s∈ Nrep(x)},
where Nrep(s)denotes a neighborhood chosen for a sym-
bol s∈ S; this is chosen such that the induced graph is
strongly connected. The first two sets in (5) represent the
inter-length transitions: Each point of Ixcorresponds to
adding a symbol to the end of the given sequence x(1:),
whereas {x(1:l1)}stands for the deletion of the last sym-
bol and shortening the sequence. The third set Rxrep-
resents intra-length transitions, where the last symbol of
the sequence is replaced. A trivial choice for satisfying
the strong connectivity is using the whole alphabet set S
for Nrep(s)for each s∈ S. This approach may be per-
mitted when the alphabet size is small. One might oth-
erwise want to consider using smaller sets to reduce the
computational cost. For example, we can use a smaller
Nrep(s), provided that the graph (with vertices S) induced
by the choice satisfies the three conditions required in Sec-
tion 2; e.g., as in (Yang et al.,2018), we may introduce a
cyclic order {0,...,|S| − 1}to S, and take two adjacent
points Nrep(s) = {s+ 1, s 1}, where s±1denotes
(s±1) mod |S| − 1.
The structure proposed above is a minimal neighborhood
choice that guarantees strong connectivity. Indeed, we can
edit the length of a sequence, and change the character in
each position to a desired one because of the strong con-
nectivity in the intra-length transition. As we will see in
Section 4, however, this minimal choice may not produce a
powerful test, as it only uses information of neighboring
length sequences. For example, the tail insertion opera-
tion by Ixonly encodes the structure of the target distri-
bution in terms of the end of the sequence. For sequen-
tial models such as Markov chain models, the longer a se-
quence is, the more evidence we should have in favor of, or
against, a given model. Editing only the final element of
a sequence would therefore not make use of such accumu-
lated evidence. Indeed, this approach performs relatively
weakly in our benchmark experiments (Figure 2b, case of
J= 1).
Proposed Neighborhood Based on the above observa-
tion, we generalize the above neighborhood set. Specifi-
cally, we consider expanding the neighborhood set by mod-
ifying sequence elements at different locations: we propose
the J-location modification neighborhood
J
x(1:)=J
j=1Ix(1:),j ∪ Dx(1:),j ∪ Rx(1:),j ,(6)
where
Ix(1:),j ⊂ S(+1) ={ins(x(1:), j, s) : s∈ Nins},
Dx(1:),j ⊂ S(1) ={del(x(1:), j)if xj+1 ∈ Nins},
Rx(1:),j ⊂ S()={rep(x(1:), j, s) : s∈ Nrep(x)}.
Here, ins(x(1:), j, s)denotes the sequence extending x(1:)
by inserting sat the (j1)-th location counting back from
the end of the string, using symbols from a fixed set Nins
S;del(x(1:), j)deletes the (j1)-th element counting
back from the end of x(1:); and rep(x(1:), j, s)replaces
by sthe (j1)-th element from the end of x(1:). Note
that the deletion and insertion operations must be paired to
4
摘要:

arXiv:2210.10741v3[stat.ML]13Jul2023AKernelSteinTestofGoodnessofFitforSequentialModelsJeromeBaum*1HeishiroKanagawa*23ArthurGretton2AbstractWeproposeagoodness-of-fitmeasureforproba-bilitydensitiesmodelingobservationswithvary-ingdimensionality,suchastextdocumentsofdifferinglengthsorvariable-lengthseque...

展开>> 收起<<
A Kernel Stein Test of Goodness of Fit for Sequential Models.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:295.81KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注