A Kernel Stein Test of Goodness of Fit for Sequential Models

2025-04-30 0 0 295.81KB 18 页 10玖币

侵权投诉

arXiv:2210.10741v3 [stat.ML] 13 Jul 2023

Jerome Baum * 1 Heishiro Kanagawa *23 Arthur Gretton 2

Abstract

We propose a goodness-of-ﬁt measure for proba-

bility densities modeling observations with vary-

ing dimensionality, such as text documents of

differing lengths or variable-length sequences.

The proposed measure is an instance of the ker-

nel Stein discrepancy (KSD), which has been

used to construct goodness-of-ﬁt tests for unnor-

malized densities. The KSD is deﬁned by its

Stein operator: current operators used in test-

ing apply to ﬁxed-dimensional spaces. As our

main contribution, we extend the KSD to the

variable-dimension setting by identifying appro-

priate Stein operators, and propose a novel KSD

goodness-of-ﬁt test. As with the previous vari-

ants, the proposed KSD does not require the den-

sity to be normalized, allowing the evaluation of

a large class of models. Our test is shown to per-

form well in practice on discrete sequential data

benchmarks.

1. Introduction

This paper addresses the problem of evaluating a proba-

bilistic model for observations with variable dimensional-

ity. This problem commonly arises in sequential data anal-

ysis, where sequences of different lengths are observed, and

generative models (e.g., Markov models) are used to draw

inferences. Our problem setting also concerns a scenario

where a data point is a collection of observations that may

not be sequentially ordered, as in the topic modeling of text

documents. Our task is formalized as follows: given a sam-

ple {Xi}n

i=1 from a distribution Qon a sample space X

(e.g., the set of all possible sequences), we aim to quantify

the discrepancy of distribution Pmodeling Q.

*Equal contribution 1Department of Computer Science,

University College London, UK 2Gatsby Computational Neu-

roscience Unit, UCL 3School of Mathematics, Statistics

and Physics, Newcastle University, UK. Correspondence to:

Jerome Baum <jerome@jeromebaum.com>, Heishiro Kana-

gawa <heishiro.kanagawa@gmail.com>.

Proceedings of the 40 th International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

One approach to this problem is generating samples

from the model and computing a sample-based discrep-

ancy measure, such as the maximum mean discrepancy

(MMD) (Gretton et al.,2012;Lloyd & Ghahramani,2015).

A disadvantage of this approach is that it potentially re-

quires many samples to see the departure of the model from

the data. For example, if a sequence model misspeciﬁes a

particular transition from a given history, we would need

to generate sequences sharing this history to observe the

disparity. Unfortunately, the MMD approach becomes less

efﬁcient as the state space enlarges or the length of the his-

tory grows, necessitating repeated sampling. This obser-

vation motivates using a measure that exploits information

provided by the model, such as dependence relations be-

tween different time points.

A model-dependent measure may be derived based on

Stein’s method (Stein,1972), a technique from probabil-

ity theory developed originally to obtain explicit rates of

convergence to normality. The key construct from Stein’s

method is a distribution-speciﬁc operator called a Stein

operator that modiﬁes a function to have zero expecta-

tion under the distribution. A model-speciﬁc Stein oper-

ator APmay be deﬁned to construct a zero-expectation

function APf; its expectation EX∼Q[APf(X)] under

the sample distribution serves as a discrepancy measure,

since a non-zero expectation indicates the deviation from

the model. One may generalize this idea to a fam-

ily of functions F, and the resulting worst-case sum-

mary supf∈F |EX∼Q[APf(X)]|is called a Stein discrep-

ancy, proposed by Gorham & Mackey (2015). An ap-

propriate choice of the operator and the function class

yields a computable Stein discrepancy. This paper fo-

cuses on an instance called the kernel Stein discrep-

ancy (KSD) (Oates et al.,2016;Chwialkowski et al.,2016;

Liu et al.,2016;Gorham & Mackey,2017), where func-

tions from a reproducing kernel Hilbert space (RKHS) are

used.

The KSD is a versatile framework for designing goodness-

of-ﬁt measures. Given a Stein operator and a repro-

ducing kernel, the KSD is expressed by an expectation

of a Stein-modiﬁed kernel, leading to tractable estima-

tors only involving sample evaluations of the modiﬁed

kernel. This feature allows us to use a wealth of ker-

nels from the literature: by designing a Stein operator,

A Kernel Stein Test of Goodness of Fit for Sequential Models

one can adapt the kernel to the model and deﬁne a be-

spoke discrepancy measure. Based on the zero-mean

kernel theory by Oates et al. (2016), Chwialkowski et al.

(2016) and Liu et al. (2016) originated this line of work,

where they combined the Langevin Stein operator pro-

posed by Gorham & Mackey (2015) with an RKHS; re-

markably, the resulting (Langevin) KSD is computable

for densities with unknown normalizing constants. There

have been numerous extensions to models in other data do-

mains: categorical data (Yang et al.,2018), point-pattern

data (Yang et al.,2019), censored data (Fernandez et al.,

2020), directional data (Xu & Matsuda,2020), and func-

tional data (Wynne et al.,2022). We refer the reader

to Anastasiou et al. (2021) for a review on the KSD’s ap-

plications outside goodness-of-ﬁt testing.

A limitation of the preceding Stein works is that they

require the model distribution to be deﬁned on a ﬁxed-

dimension space. For example, Kanagawa et al. (2023)

evaluate latent Dirichlet allocation models (Blei et al.,

2003) for text documents assuming a ﬁxed document

length, an assumption unlikely to hold in practice. Rele-

vant work has been accomplished by Wynne et al. (2022),

where they propose a KSD for distributions on an (inﬁnite-

dimensional) Hilbert space, and treat continuous-time mod-

els, such as Brownian motions and stochastic differential

equation models. While their test applies to sequential mod-

els, our target setting is different, as we focus on discrete-

time models such as Markov chains, and do not require a

density with respect to a Gaussian measure.

In the present work, we extend the KSD framework to

the variable-dimension setting. Speciﬁcally, we identify a

Stein operator for distributions on the set of all univariate

sequences of ﬁnite alphabets. This is to our knowledge the

ﬁrst such operator, and it may of independent interest. Our

Stein operator builds on the class of Zanella-Stein operators

recently proposed by Hodgkinson et al. (2020), which are

derived from a Markov jump process having the target dis-

tribution as invariant measure; we review the Zanella-Stein

operator in Section 2. In Section 3, we derive a new Stein

operator using a Markov process admitting transitions be-

tween sets of different sizes. Based on our Stein operator

and the associated KSD, we propose a novel goodness-of-

ﬁt test for sequential models. As in previous variants, the

proposed KSD does not require the model to be normalized,

allowing the evaluation of intractable models, including

Markov random ﬁelds (Section 4.5) and conditional gen-

erative models (Section 4.6). As the proposed operator in-

volves a number of tunable parameters, as our second con-

tribution, we offer guidance on how to select these based

on empirical power analysis.

2. Background

All Stein discrepancies are build around a particular Stein

operator. In this section, we recall a class of operators,

the Zanella-Stein operators, and describe the challenges in

constructing operators of this class in the sequential set-

ting. We also review the associated kernel Stein discrep-

ancy (KSD), a goodness-of-ﬁt measure that leverages a

given Stein operator to construct a test statistic.

Zanella-Stein Operator The Zanella-Stein (ZS) opera-

tors are a class of Stein operators for distributions on a

countable set X, proposed by Hodgkinson et al. (2020) fol-

lowing the generator method of Barbour (1988). In the fol-

lowing, we assume that the model Phas probability mass

function ppositive everywhere in X. For a real-valued

function fon X, a ZS operator APis deﬁned by

(APf) (x):=X

y∈∂x

gp(y)

p(x)(f(y)−f(x)) ,(1)

where ∂x ⊂ X is a set of neighborhood points, and

g: (0,∞)→(0,∞)is a balancing function, which must

satisfy g(t) = tg (1/t)for t > 0. Several choices of g

are known in the literature: the minimum probability ﬂow

operator discussed in (Barp et al.,2019) uses g(t) = √t;

the choice g(t) = t/(1 + t)is known as the Barker Stein

operator (Barker,1965;Shi et al.,2022).

A Zanella-Stein operator is the inﬁnitesimal generator of

a Markov jump process called a Zanella process (Zanella,

2019;Power & Goldman,2019) designed to have the tar-

get distribution Pas an invariant distribution. The process

is based on the idea of locally informed proposals: these

jump from a given point xto any of its neighbors y, at a

rate g(p(y)/p(x)) so that detailed balance is satisﬁed for

P, ensuring that Pis an invariant distribution. For the

operator to characterize P, we must specify our notion of

neighborhood. Let (V,E)be the directed graph with ver-

tices V=Xand edges E={(x, y)|y∈∂x}induced by

the chosen neighborhood structure. The graph (V,E)must

have the following properties:

1. Symmetry: (x, y)∈ E ⇔ (y, x)∈ E.

2. Strong connectivity: the transitive closure of Eis X ×

X, so that there is a path from every point to every

other point.

In fact, symmetry alone implies that Pis in the set of invari-

ant distributions; strong connectivity is required to ensure

that Pis the unique invariant distribution, as otherwise the

corresponding Stein discrepancy cannot distinguish Pfrom

other distributions even for a rich test function class. In

Hodgkinson et al. (2020), the aperiodicity condition is ad-

ditionally required. For a pure jump-type process, we only

A Kernel Stein Test of Goodness of Fit for Sequential Models

need the irreducibility of the process (Kallenberg,2021,

Theorem 13.12), and hence we may dispense with this

condition. Note that we can conveniently choose a sparse

neighborhood to accelerate the computation, although this

may result in reduced sensitivity of the discrepancy. In Sec-

tion 4, we discuss in more detail the tradeoff between com-

putational cost and test power.

Challenges in the Variable-Dimension Setting The gen-

erator method (as in the ZS operator above) allows us to

obtain an operator for any state space with an appropriate

Markov process. Indeed, the Zanella process is not the only

allowable choice: other Markov chains or processes have

also been considered for discrete spaces, including birth-

death processes (Shi et al.,2022), the Glauber dynamics

Markov chain (Reinert & Ross,2019;Bresler & Nagaraj,

2019) (we refer the reader to Shi et al.,2022, for a review).

Deﬁning a Markov process is often a challenge, however;

prior work has thus only considered processes in ﬁxed-

dimensional spaces, and has not dealt with sequential mod-

els such as those considered here. The ZS operators are

particularly attractive in our setting, since they may be de-

ﬁned on any discrete set. The main challenge in deriving

an operator in this class is the construction of an appropri-

ate neighborhood for models of variable-length sequences,

which is highly nontrivial. Our contribution is to develop

an effective neighborhood deﬁnition for the sequential set-

ting, which satisﬁes the three requirement outlined above,

and yields an associated test that has good statistical power

against alternatives.

Kernel Stein Discrepancy We next review the construc-

tion of a test statistic from the Stein operator. A computable

discrepancy measure may be deﬁned using a reproduc-

ing kernel Hilbert space (RKHS) (Aronszajn,1950;

Steinwart & Christmann,2008). Following (Oates et al.,

2016;Chwialkowski et al.,2016;Liu et al.,2016;

Gorham & Mackey,2017), we deﬁne the kernel Stein

discrepancy (KSD) associated with our (yet to be speci-

ﬁed) Zanella-Stein operator as follows:

KSDQkP:= sup

kfkH≤1EX∼Q[AP(f)(X)],(2)

where His the RKHS of real-valued functions on Xcor-

responding to a positive deﬁnite kernel k:X × X →

Rwith the natural norm kfkH=phf, fiHdeﬁned

by the inner product h·,·iH. By the vanishing property

EY∼P[APf(Y)] = 0, the KSD is zero if Q=P. The

other direction requires further assumptions on the opera-

tor and the RKHS. According to Hodgkinson et al. (2020,

Proposition 4), when the corresponding Zanella process

is exponentially ergodic and the RKHS is C0-universal,

we have KSDQkP= 0 only if P=Q. For a ﬁ-

nite state space X, exponential ergodicity is satisﬁed by

any irreducible Markov jump process. The C0-universality

is equivalent to the integrally strict positive deﬁniteness

(Sriperumbudur et al.,2011); in the particular case |X| <

∞, this condition states that the Gram matrix deﬁned over

all the conﬁgurations on Xis strictly positive deﬁnite.

The use of an RKHS yields a closed-form expression of the

KSD (Hodgkinson et al.,2020, see Proposition 1 and the

paragraph following the proof):

KSD2QkP=EX,X′∼Q⊗Q[hp(X, X′)],(3)

where X,X′are i.i.d. random variables with law Q, pro-

vided EX∼Q[hp(X, X)1/2]<∞. The proof of this re-

sult follows as in (Gorham & Mackey,2017, Proposition

2), noting that the inner product between fand APk(x, ·)

reproduces APf(x)for any f∈ H. The function hpis

called a Stein kernel, deﬁned by

hp(x, y)

ν∈NxX

˜ν∈Ny

gν(x)g˜ν(y){k(ν(x),˜ν(y)) + k(x, y)

−k(x, ˜ν(y)) −k(ν(x), y)}

where we identify each point in the neighborhood ∂x as a

mapping from xto itself, and gν(x) = g{p(ν(x))/p(x)}.

Given a sample {Xi}n

i=1

i.i.d.

∼Q, the squared KSD expres-

sion (3) admits a simple unbiased estimator

Un,P ({Xi}n

i=1):=1

n(n−1)

i6=j

hp(Xi, Xj),(4)

which is a U-statistic (Hoeffding,1948). By the

zero-mean property of the Stein kernel, the U-statistic

is degenerate if Q=P(Chwialkowski et al.,2016;

Liu et al.,2016). In this case, the scaled statis-

tic nUn,P ({Xi}n

i=1)asymptotically follows the law of

P∞

j=1 λj(Z2

j−1), where Zjare i.i.d. standard nor-

mal variables, and λjare eigenvalues given by the eigen-

value problem Px′∈X hp(x, x′)a(x′)p(x′) = λa(x)with

Px∈X a(x)2p(x)<∞(see, e.g., Serﬂing,2009, Section

5.5). One of our proposed tests in Section 3simulates this

distribution using a wild bootstrap procedure.

Kernels for Sequences The performance of the test de-

veloped in the next section depends on the choice of the re-

producing kernel. As mentioned above, the C0-universality

is a condition that guides kernel selection since it ren-

ders the test consistent (along with the exponential ergod-

icity). A trivial example of C0-universal kernels is the

Dirac kernel that outputs 1if two inputs are identical,

and 0otherwise; but this kernel might not be useful in

practice, since all data points in the corresponding feature

space are orthogonal, and provide no information about

A Kernel Stein Test of Goodness of Fit for Sequential Models

each other. As this example shows, kernels need not be

universal to provide powerful tests, as long as they en-

code relevant features to the setting where they are used.

The literature on sequence kernels is well-established (see,

e.g., Kir´aly & Oberhauser,2019, for a recent account), and

one could use existing kernels, such as global alignment

kernels (Cuturi et al.,2007;Cuturi,2011) and string ker-

nels (Haussler,1999;Lodhi et al.,2002;Leslie & Kuang,

2004). Alternatively, one may also deﬁne a kernel by the

inner product of explicit features (e.g., neural network em-

bedding of sequences) instead of known kernels.

3. Testing Sequential Models

We now address the design of a Stein operator for the

variable-dimension setting. We begin by formally deﬁn-

ing the sample space X. Let Sbe a nonempty ﬁnite set

of symbols. For an integer ℓ≥1, we denote by S(ℓ)the

set of all length-ℓsequences formed by symbols in S; i.e.,

S(ℓ)=Qℓ

j=1 S. Then, the sample space Xis given by

the set of all sequences Fℓmax

ℓ=1 S(ℓ), where Fdenotes the

disjoint union, and ℓmax is ∞or a ﬁnite positive integer.

In this setting, a random sample in Xis a sequence whose

length is randomly determined.

As noted in Section 2, to obtain a KSD in this setting,

we need to design a neighborhood structure which speci-

ﬁes our ZS operator. We ﬁrst introduce a simple structure

for sequences, which edits only the end of the string, to

illustrate the required connectivity principles. We then de-

scribe a more advanced neighborhood structure, allowing

for greater modiﬁcations, which will be the foundation of

our tests.

Neighborhood Choice For the Stein operator to uniquely

characterize the model, we require the strong connectivity

of the induced graph. We achieve this by introducing inter-

and intra-length state transitions. Speciﬁcally, for a length-

ℓsequence x(1:ℓ)= (x1,...,xℓ)∈ S(l), we propose the

following neighborhood:

∂x(1:ℓ)=Ix∪ {x(1:ℓ−1)} ∪ Rx(5)

with

Ix={(x1,...,xℓ, s) : s∈ S},

Rx={(x1,...,xl−1, s) : s∈ Nrep(xℓ)},

where Nrep(s)denotes a neighborhood chosen for a sym-

bol s∈ S; this is chosen such that the induced graph is

strongly connected. The ﬁrst two sets in (5) represent the

inter-length transitions: Each point of Ixcorresponds to

adding a symbol to the end of the given sequence x(1:ℓ),

whereas {x(1:l−1)}stands for the deletion of the last sym-

bol and shortening the sequence. The third set Rxrep-

resents intra-length transitions, where the last symbol of

the sequence is replaced. A trivial choice for satisfying

the strong connectivity is using the whole alphabet set S

for Nrep(s)for each s∈ S. This approach may be per-

mitted when the alphabet size is small. One might oth-

erwise want to consider using smaller sets to reduce the

computational cost. For example, we can use a smaller

Nrep(s), provided that the graph (with vertices S) induced

by the choice satisﬁes the three conditions required in Sec-

tion 2; e.g., as in (Yang et al.,2018), we may introduce a

cyclic order {0,...,|S| − 1}to S, and take two adjacent

points Nrep(s) = {s+ 1, s −1}, where s±1denotes

(s±1) mod |S| − 1.

The structure proposed above is a minimal neighborhood

choice that guarantees strong connectivity. Indeed, we can

edit the length of a sequence, and change the character in

each position to a desired one because of the strong con-

nectivity in the intra-length transition. As we will see in

Section 4, however, this minimal choice may not produce a

powerful test, as it only uses information of neighboring

length sequences. For example, the tail insertion opera-

tion by Ixonly encodes the structure of the target distri-

bution in terms of the end of the sequence. For sequen-

tial models such as Markov chain models, the longer a se-

quence is, the more evidence we should have in favor of, or

against, a given model. Editing only the ﬁnal element of

a sequence would therefore not make use of such accumu-

lated evidence. Indeed, this approach performs relatively

weakly in our benchmark experiments (Figure 2b, case of

J= 1).

Proposed Neighborhood Based on the above observa-

tion, we generalize the above neighborhood set. Speciﬁ-

cally, we consider expanding the neighborhood set by mod-

ifying sequence elements at different locations: we propose

the J-location modiﬁcation neighborhood

∂J

x(1:ℓ)=∪J

j=1Ix(1:ℓ),j ∪ Dx(1:ℓ),j ∪ Rx(1:ℓ),j ,(6)

where

Ix(1:ℓ),j ⊂ S(ℓ+1) ={ins(x(1:ℓ), j, s) : s∈ Nins},

Dx(1:ℓ),j ⊂ S(ℓ−1) ={del(x(1:ℓ), j)if xℓ−j+1 ∈ Nins},

Rx(1:ℓ),j ⊂ S(ℓ)={rep(x(1:ℓ), j, s) : s∈ Nrep(xℓ)}.

Here, ins(x(1:ℓ), j, s)denotes the sequence extending x(1:ℓ)

by inserting sat the (j−1)-th location counting back from

the end of the string, using symbols from a ﬁxed set Nins ⊂

S;del(x(1:ℓ), j)deletes the (j−1)-th element counting

back from the end of x(1:ℓ); and rep(x(1:ℓ), j, s)replaces

by sthe (j−1)-th element from the end of x(1:ℓ). Note

that the deletion and insertion operations must be paired to

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

arXiv:2210.10741v3[stat.ML]13Jul2023AKernelSteinTestofGoodnessofFitforSequentialModelsJeromeBaum*1HeishiroKanagawa*23ArthurGretton2AbstractWeproposeagoodness-of-ﬁtmeasureforproba-bilitydensitiesmodelingobservationswithvary-ingdimensionality,suchastextdocumentsofdifferinglengthsorvariable-lengthseque...

展开>> 收起<<

A Kernel Stein Test of Goodness of Fit for Sequential Models.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Kernel Stein Test of Goodness of Fit for Sequential Models

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: