Testing Independence of Exchangeable Random Variables Marcus Hutter

2025-05-02 0 0 1.08MB 42 页 10玖币

侵权投诉

Testing Independence of

Exchangeable Random Variables

Marcus Hutter

DeepMind

Latest version & more @

http://www.hutter1.net/oﬃcial/bib.htm#exiid

22 October 2022

Abstract

Given well-shuﬄed data, can we determine whether the data items are sta-

tistically (in)dependent? Formally, we consider the problem of testing whether

a set of exchangeable random variables are independent. We will show that this

is possible and develop tests that can conﬁdently reject the null hypothesis that

data is independent and identically distributed and have high power for (some)

exchangeable distributions. We will make no structural assumptions on the un-

derlying sample space. One potential application is in Deep Learning, where data

is often scraped from the whole internet, with duplications abound, which can

render data non-iid and test-set evaluation prone to give wrong answers.

Contents

1 Introduction 2

2 Examples & Potential Applications 3

3 Problem Formalization and Preliminaries 5

4 Reducing general Xto N8

5 I.I.D. Tests 10

6 Toy/Control Experiments 17

7 More/Alternative Tests 22

8 Conclusion 26

References 27

A Technical Lemmas 28

B I.I.D. Tests for Multinomial Distribution 32

C Technical Lemmas for Multinomial vs Poisson 35

D List of Notation 41

Keywords

independent; identically distributed; exchangeable random variables; statistical

tests; unstructured data.

arXiv:2210.12392v1 [math.ST] 22 Oct 2022

1 Introduction

We consider the problem of testing whether a set of exchangeable random variables

X1,...,Xnare independent, solely from observations x1:n:=x1x2...xnsampled from Q. A

distribution Q(x1,...,xn) is called (ﬁnitely) exchangeable if it is invariant under all (ﬁnite)

permutations of its argument. We make no structural assumptions on the underlying

probability space (Xn,Σ,Q) beyond Qbeing exchangeable, and of course that X ⊇

{x1,...,xn}. Less formally, assume we have observed x1:n, which we believe to be well-

shuﬄed, and want to know whether they originated from some iid distribution Pθ. The

shuﬄing implies that the Xtare identically distributed, but it does not make them

independent.

A priori one may think this is a hopeless problem. For instance, if we remove

the ‘identically distributed’ condition, every x1:nis independent w.r.t. some non-iid

distribution. One can always take P[Xt=xt]=1 ∀t, i.e. no valid test can reject the

hypothesis that x1:nare independent, unless one makes some further assumptions on P.

The exchangeability assumption implies that the only useful information in x1:nis

the counts nx:=|{xt:xt=x}| of each x∈X. They form a minimal suﬃcient statistic. Due

to the assumed lack of structure in X, the speciﬁc label xalso bears no information,

so we may as well injectively map each xtto a label from {1,...,d00}, where d00 ∈Nis

the number of diﬀerent xtin x1:n, and can even sort them e.g. w.r.t. decreasing nx.

That means, the only useful information in x1:nis actually the second-order counts

mk:=|{x:nx=k}|. All this will be made clear later. We are primarily interested in the

case of low duplicity, i.e. most nxare small, though our results are general.

Contents. We provide some motivating examples in Section 2, which also informally

show that the second-order counts mkcan indeed sometimes reveal that x1:ndid not

come from an iid process. A practical example is data duplication in machine learning

and the test set contamination problem it results in. In Section 3 we introduce ﬁrst-

order counts nxand second-order counts mk, exchangeable distributions, and the nature

of statistical tests in this context for ﬁnite and countable X. In particular we reduce iid

distributions to multinomial distributions, and then for our tests to a mixture of Poisson

distributions. We then show in Section 4 that we can reduce every Xwhatsoever to

X=Nor X={1 : d}with discrete σ-algebra Σ = 2X. After this lengthy preparation,

we are ﬁnally able to develop our statistical tests in Section 5. The tests we consider

are based on the observation that a mixture of Poisson distributions is “smooth”, so

if mkas a function of kis not suﬃciently smooth, this can be used as evidence for

dependence. The tests are summarized in Theorem 11. We experimentally verify our

tests in Section 6 on artiﬁcially generated data. In Section 7 we give an outlook on

alternative ways of deriving iid tests for exchangeable data. Section 8 concludes.

In Appendix A, we state/derive a number of technical lemmas we require to derive

our tests. For improving the power of our tests, in Appendix B we derive upper bounds

on our test statistics analogous to Section 5 but without the Poisson approximation,

i.e. directly for iid Xor the multinomial distribution. Details on the multinomial and

product of Poisson distributions and their relation can be found in Appendix C. They

are used to derive upper bounds on the variance of our tests in the multinomial model.

A list of notation can be found in Appendix D.

Unrelated work. Independence tests in the literature most often refer to testing

whether a pair of random variables (X,Y ) is independent, given a number of iid(!)

sample pairs {(xt,yt)}(mutual information and chi-square tests are popular). Our setup

is totally diﬀerent and much harder.

Another setup is stochastic processes. Dependence can be tested via estimating

auto-correlation coeﬃcients, but this requires ordered data and X=R. One could use

some other independence test on the pairs {(xt−1,xt)}without the X=Rassumption to

test a Markov vs iid hypothesis, and/or adapt auto-correlation tests to unordered data.

We brieﬂy remark on this in Section 7.

2 Examples & Potential Applications

In this section we will provide some motivating examples and potential applications.

This will also provide some intuition why rejecting the hypothesis Hiid that data is iid

is possible at all, but also the diﬃculty from not having any more structure available.

We consider biased coin ﬂips (binomial process), Black Jack, and data duplication. We

also discuss the relevance to machine learning, whose dominant training paradigm still

operates under the iid assumption.

Binomial. Consider a binary sequence x1:n=x1x2...xn∈ {0,1}of length n= 1000,

say. If x1:n/2= 1n/2and xn/2+1:n= 0n/2we conﬁdently reject the hypothesis Hiid that

x1:nwas sampled i.i.d. But we are unlikely to observe such a sequence if x1:nis well-

shuﬄed (sampled from an exchangeable process). If we shuﬄe 500 ones and 500 zeros,

a typical x1:n= 0111100101...0100100100 looks random, or does it? There are exactly

n0=n1=n/2 = 500 ones and zeros. While for a fair coin, we expect about n/2 ones,

would or should you believe anyone telling you that this is a sequence of fair coin ﬂips?

The probability of observing exactly n/2 ones in a sequence of nfair coin ﬂips is around

p2/πn = 2.5%, so a test for N1

=n/2 would conﬁdently reject the hypothesis that the

x1:nabove arose from a fair coin.

What about n= 100000000 and n1= 3140159. Obviously this is not from a fair coin,

but our aim is to test for iid, not fairness. Could such a sequence have been the result

of a biased coin? Since we assume the bits to be perfectly shuﬄed, n1is a suﬃcient

statistic, so any test plausibly should only depend on n1, and not the sequence x1:n

itself. The probability that N1=n1is ≤0.00086 for a coin of any bias, i.e. test N1

=n1

would reject Hiid. Of course, tests have to be designed before observing the data, and

a-priori n1=3140159 is unlikely, so such a test is unlikely to have any power. We can of

course combine tests and apply a union bound, but not too many, otherwise the tests

become too weak. n/2 seems special, so maybe we should put such a test in the mix, but

what about 3140159? It’s the ﬁrst 6 digits of π. Maybe this is too much numerology, but

what about testing for prime n1? The density of primes pis around 1/lnp, so a-priori we

should expect an n1around 3·105to be composite (with conﬁdence 1−1/ln(n1) ˙=93%).

Maybe this is just not enough to reject Hiid, but we could always up the numbers.

Imagine nso large that the binary representation of n1contains some encrypted

message or a long segment of Chaitin’s number of wisdom. In general, ﬁnding every

pattern in n1is an AI-complete problem. There are universal tests which in principle

could test for all such eventualities, and in some situations practical approximations

thereof can be very powerful. We discuss them brieﬂy in Section 7, but we were not

able to make them work as well as the speciﬁc tests develop in this paper, so will not

consider universal tests any further (nor will we delve into numerology any further).

Black Jack. A standard deck of cards without Jokers consists of 52 cards of 13 ranks,

each in four suits, two red and two black. If we shuﬄe together inﬁnitely many such

decks and then draw ncards, this equivalently to drawing cards uniformly iid from the

|X|= 52 diﬀerent card faces. If we have only one deck and draw all 52 cards from it,

we obviously observe every card exactly once. Such an outcome would be extremely

unlikely had we drawn 52 cards from an inﬁnite set of decks (see (5)).

Assume now an unknown number of decks have been shuﬄed together. Assume n

cards have been drawn so far from this pile and we remembered their face (called card

counting strategy). An interesting question is to infer the number of decks the cards

have been drawn from. Or consider the weaker question: Are x1:nconsistent with Hiid?

If yes, we cannot infer the next card face better than by chance (1/52), so should not

waste our time trying to do so, and wait with raising the stakes for when we have seen

more cards. The answer to this question is relevant even if we know the number of decks

c. For Black Jack, 1-8 decks are used, in casinos often 6. If nis small, we will not be

able to reject Hiid, but if we have seen all cards, then each card will appear exactly c

times, again ruling out Hiid. If we are close to the end of the pile, most faces will have

appeared ctimes, none more, and only a few signiﬁcantly less. Even mid-way through

the pile, each face has appeared at most ctimes, which is evidence against Hiid for small

c. For instance, the chance of seeing no face twice when drawing 26 cards iid from 52

faces is less than 0.2% (cf. the birthday paradox). That is, latest half-way through a

single deck, this fact is revealed. If cards are drawn from 2 decks, more than 52 cards

are needed to reveal that they are not iid (Figure 3 bottom right).

Data duplication. In modern Machine Learning, esp. Deep Learning, data x1:nis

abundant (large n) and observation spaces are huge (large X). For instance, ImageNet

consists of over 14 million images, usually resized or cropped to e.g. 224×224 pixels of

256×256×256 colors, i.e. X= 2563×224×224. We can as well assume that Xis inﬁnite.

Assume x1:ncontains no duplicate images and is well shuﬄed. As we will show later,

no valid test can reject Hiid in this case. But if x1:ncontains duplicates one may be

able to reject Hiid, similarly to the Black Jack example above, even without knowing

anything about the observation space X. For instance, assume every observation is

duplicated, i.e. every xthat appears in x1:nappears exactly twice. For uncountable X,

if xis sampled from a probability density, the probability of sampling the same xtwice

is 0. So duplications can only happen if Pθcontains point masses, i.e. is not purely

continuous, i.e. Pθ(x)>0 for some x. For ﬁnite or countable X, this is necessarily true.

But if Pθ(x)>0, then the frequency of seeing xis binomially distributed. While seeing

some xtwice is plausible, seeing all xexactly twice is very unlikely, so we can reject Hiid:

If data is iid and some items are duplicate, we should also see triples and quadruples,

etc.

Relevance for machine learning. The predominant training and evaluation proto-

col in Machine Learning in general and Deep Learning in particular is still to assume

the data is iid, train on most of the data and evaluate on the rest. Interestingly, this is

true even for models dealing with deﬁnitely non-iid text. For instance, for Transformers,

text is crudely chopped into chunks of equal length and then shuﬄed. The empirical

test loss is an unbiased estimator of the true loss, so is a proper way of comparing the

performance of diﬀerent models. Data sizes in modern machine learning are huge, so

that even 10% held-out data is so much that test noise is often of little concern. That’s

at least the general story.

But in Deep Learning, data these days is often scraped from the whole internet, and

duplications abound. For instance, assume the whole data set contains 3 copies of each

data item. If we randomly split oﬀ 10% as the test set, then the train set contains nearly

all (99%) of the test set items. With heldout-validation a pure memorizer without any

generalization capacity will perform nearly perfectly on the test set [BLH22], but will

fail in practice on future data. Indeed shuﬄing the data makes this problem the worst

[SEBF21, GB19]. The problem is known as test set contamination, and well known.

The standard solution is to decontaminate or clean the data, e.g. removing dupli-

cates, but this does not suﬃce. One has to remove approximate duplicates too. But

at what threshold for example should a document that cites a training-set document

verbatim be removed from the test-set? When are two images scraped from the inter-

net rescaled or cropped or jpeg compressed versions of the same image, and even if so,

should they be regarded duplicates? While approaches exist that meliorate the prob-

lem, in theory this problem is ill-deﬁned [Hut06, FAQ], and in practice a huge, actually

AI-complete, problem [BMR+20, App.C].

Test set evaluation is empirically sound for iid data, therefore the failure of this

paradigm must be attributed to the non-iid nature of the data. The strength and

weakness of the iid tests developed in this paper are that they are completely model-

and data-agnostic. This makes them universally applicable and valid, but also very

weak. On the other hand, as discussed above, ﬁnding good/perfect model- and data-

type-sensitive tests is itself a diﬃcult/impossible research question beyond the scope of

this article.

3 Problem Formalization and Preliminaries

We now introduce notation and concepts used throughout the paper: general notation,

the multinomial and Binomial distributions, ﬁrst-order and second-order counts, ex-

changeable distributions, and statistical tests. The reader familiar with these concepts

could skim this section to just pick up the notational convention used in this article.

Notation. We use calligraphic upper letters such as Xfor sets and |X| or #Xfor

the size of X. Probability spaces are denoted by (Xn,Σ,P ) with d=|X| ∈ N∪{∞}.

Pθdenotes iid distributions. Qdenotes exchangeable distributions. Capital letters

X,N,M,T,E,O,D,C,U,V,Z,Y denote random variables, and corresponding lower case

letters samples corresponding to them. We will use the shorthand P(x):=P[X=x] and

similarly for other random variables. The variance of Zis V[Z] := E[Z2]−E[Z]2and

the covariance of Yand Zis Cov[Y,Z]:=E[Y Z]−E[Y]E[Z]. In addition to the classical

O() notation, OP(f(n)) denotes all random functions F(n) for which ∀δ > 0∃c>0∀n:

P[|F(n)|≥c·|f(n)|]≤δ. We use f(n).g(n) to denote f(n)≤g(n)·[1±OP(1/√n)], and

similarly &and

. We use .even when stronger asymptotic or non-stochastic bounds

would be possible, since we mostly care about the leading-order approximation in nand

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TestingIndependenceofExchangeableRandomVariablesMarcusHutterDeepMindLatestversion&more@http://www.hutter1.net/ocial/bib.htm#exiid22October2022AbstractGivenwell-shueddata,canwedeterminewhetherthedataitemsaresta-tistically(in)dependent?Formally,weconsidertheproblemoftestingwhetherasetofexchangeabler...

展开>> 收起<<

Testing Independence of Exchangeable Random Variables Marcus Hutter.pdf

共42页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Testing Independence of Exchangeable Random Variables Marcus Hutter

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: