Testing Independence of Exchangeable Random Variables Marcus Hutter

2025-05-02 0 0 1.08MB 42 页 10玖币
侵权投诉
Testing Independence of
Exchangeable Random Variables
Marcus Hutter
DeepMind
Latest version & more @
http://www.hutter1.net/official/bib.htm#exiid
22 October 2022
Abstract
Given well-shuffled data, can we determine whether the data items are sta-
tistically (in)dependent? Formally, we consider the problem of testing whether
a set of exchangeable random variables are independent. We will show that this
is possible and develop tests that can confidently reject the null hypothesis that
data is independent and identically distributed and have high power for (some)
exchangeable distributions. We will make no structural assumptions on the un-
derlying sample space. One potential application is in Deep Learning, where data
is often scraped from the whole internet, with duplications abound, which can
render data non-iid and test-set evaluation prone to give wrong answers.
Contents
1 Introduction 2
2 Examples & Potential Applications 3
3 Problem Formalization and Preliminaries 5
4 Reducing general Xto N8
5 I.I.D. Tests 10
6 Toy/Control Experiments 17
7 More/Alternative Tests 22
8 Conclusion 26
References 27
A Technical Lemmas 28
B I.I.D. Tests for Multinomial Distribution 32
C Technical Lemmas for Multinomial vs Poisson 35
D List of Notation 41
Keywords
independent; identically distributed; exchangeable random variables; statistical
tests; unstructured data.
1
arXiv:2210.12392v1 [math.ST] 22 Oct 2022
1 Introduction
We consider the problem of testing whether a set of exchangeable random variables
X1,...,Xnare independent, solely from observations x1:n:=x1x2...xnsampled from Q. A
distribution Q(x1,...,xn) is called (finitely) exchangeable if it is invariant under all (finite)
permutations of its argument. We make no structural assumptions on the underlying
probability space (Xn,Σ,Q) beyond Qbeing exchangeable, and of course that X ⊇
{x1,...,xn}. Less formally, assume we have observed x1:n, which we believe to be well-
shuffled, and want to know whether they originated from some iid distribution Pθ. The
shuffling implies that the Xtare identically distributed, but it does not make them
independent.
A priori one may think this is a hopeless problem. For instance, if we remove
the ‘identically distributed’ condition, every x1:nis independent w.r.t. some non-iid
distribution. One can always take P[Xt=xt]=1 t, i.e. no valid test can reject the
hypothesis that x1:nare independent, unless one makes some further assumptions on P.
The exchangeability assumption implies that the only useful information in x1:nis
the counts nx:=|{xt:xt=x}| of each xX. They form a minimal sufficient statistic. Due
to the assumed lack of structure in X, the specific label xalso bears no information,
so we may as well injectively map each xtto a label from {1,...,d00}, where d00 Nis
the number of different xtin x1:n, and can even sort them e.g. w.r.t. decreasing nx.
That means, the only useful information in x1:nis actually the second-order counts
mk:=|{x:nx=k}|. All this will be made clear later. We are primarily interested in the
case of low duplicity, i.e. most nxare small, though our results are general.
Contents. We provide some motivating examples in Section 2, which also informally
show that the second-order counts mkcan indeed sometimes reveal that x1:ndid not
come from an iid process. A practical example is data duplication in machine learning
and the test set contamination problem it results in. In Section 3 we introduce first-
order counts nxand second-order counts mk, exchangeable distributions, and the nature
of statistical tests in this context for finite and countable X. In particular we reduce iid
distributions to multinomial distributions, and then for our tests to a mixture of Poisson
distributions. We then show in Section 4 that we can reduce every Xwhatsoever to
X=Nor X={1 : d}with discrete σ-algebra Σ = 2X. After this lengthy preparation,
we are finally able to develop our statistical tests in Section 5. The tests we consider
are based on the observation that a mixture of Poisson distributions is “smooth”, so
if mkas a function of kis not sufficiently smooth, this can be used as evidence for
dependence. The tests are summarized in Theorem 11. We experimentally verify our
tests in Section 6 on artificially generated data. In Section 7 we give an outlook on
alternative ways of deriving iid tests for exchangeable data. Section 8 concludes.
In Appendix A, we state/derive a number of technical lemmas we require to derive
our tests. For improving the power of our tests, in Appendix B we derive upper bounds
on our test statistics analogous to Section 5 but without the Poisson approximation,
i.e. directly for iid Xor the multinomial distribution. Details on the multinomial and
product of Poisson distributions and their relation can be found in Appendix C. They
are used to derive upper bounds on the variance of our tests in the multinomial model.
A list of notation can be found in Appendix D.
2
Unrelated work. Independence tests in the literature most often refer to testing
whether a pair of random variables (X,Y ) is independent, given a number of iid(!)
sample pairs {(xt,yt)}(mutual information and chi-square tests are popular). Our setup
is totally different and much harder.
Another setup is stochastic processes. Dependence can be tested via estimating
auto-correlation coefficients, but this requires ordered data and X=R. One could use
some other independence test on the pairs {(xt1,xt)}without the X=Rassumption to
test a Markov vs iid hypothesis, and/or adapt auto-correlation tests to unordered data.
We briefly remark on this in Section 7.
2 Examples & Potential Applications
In this section we will provide some motivating examples and potential applications.
This will also provide some intuition why rejecting the hypothesis Hiid that data is iid
is possible at all, but also the difficulty from not having any more structure available.
We consider biased coin flips (binomial process), Black Jack, and data duplication. We
also discuss the relevance to machine learning, whose dominant training paradigm still
operates under the iid assumption.
Binomial. Consider a binary sequence x1:n=x1x2...xn∈ {0,1}of length n= 1000,
say. If x1:n/2= 1n/2and xn/2+1:n= 0n/2we confidently reject the hypothesis Hiid that
x1:nwas sampled i.i.d. But we are unlikely to observe such a sequence if x1:nis well-
shuffled (sampled from an exchangeable process). If we shuffle 500 ones and 500 zeros,
a typical x1:n= 0111100101...0100100100 looks random, or does it? There are exactly
n0=n1=n/2 = 500 ones and zeros. While for a fair coin, we expect about n/2 ones,
would or should you believe anyone telling you that this is a sequence of fair coin flips?
The probability of observing exactly n/2 ones in a sequence of nfair coin flips is around
p2n = 2.5%, so a test for N1
?
=n/2 would confidently reject the hypothesis that the
x1:nabove arose from a fair coin.
What about n= 100000000 and n1= 3140159. Obviously this is not from a fair coin,
but our aim is to test for iid, not fairness. Could such a sequence have been the result
of a biased coin? Since we assume the bits to be perfectly shuffled, n1is a sufficient
statistic, so any test plausibly should only depend on n1, and not the sequence x1:n
itself. The probability that N1=n1is 0.00086 for a coin of any bias, i.e. test N1
?
=n1
would reject Hiid. Of course, tests have to be designed before observing the data, and
a-priori n1=3140159 is unlikely, so such a test is unlikely to have any power. We can of
course combine tests and apply a union bound, but not too many, otherwise the tests
become too weak. n/2 seems special, so maybe we should put such a test in the mix, but
what about 3140159? It’s the first 6 digits of π. Maybe this is too much numerology, but
what about testing for prime n1? The density of primes pis around 1/lnp, so a-priori we
should expect an n1around 3·105to be composite (with confidence 11/ln(n1) ˙=93%).
Maybe this is just not enough to reject Hiid, but we could always up the numbers.
Imagine nso large that the binary representation of n1contains some encrypted
message or a long segment of Chaitin’s number of wisdom. In general, finding every
pattern in n1is an AI-complete problem. There are universal tests which in principle
could test for all such eventualities, and in some situations practical approximations
thereof can be very powerful. We discuss them briefly in Section 7, but we were not
3
able to make them work as well as the specific tests develop in this paper, so will not
consider universal tests any further (nor will we delve into numerology any further).
Black Jack. A standard deck of cards without Jokers consists of 52 cards of 13 ranks,
each in four suits, two red and two black. If we shuffle together infinitely many such
decks and then draw ncards, this equivalently to drawing cards uniformly iid from the
|X|= 52 different card faces. If we have only one deck and draw all 52 cards from it,
we obviously observe every card exactly once. Such an outcome would be extremely
unlikely had we drawn 52 cards from an infinite set of decks (see (5)).
Assume now an unknown number of decks have been shuffled together. Assume n
cards have been drawn so far from this pile and we remembered their face (called card
counting strategy). An interesting question is to infer the number of decks the cards
have been drawn from. Or consider the weaker question: Are x1:nconsistent with Hiid?
If yes, we cannot infer the next card face better than by chance (1/52), so should not
waste our time trying to do so, and wait with raising the stakes for when we have seen
more cards. The answer to this question is relevant even if we know the number of decks
c. For Black Jack, 1-8 decks are used, in casinos often 6. If nis small, we will not be
able to reject Hiid, but if we have seen all cards, then each card will appear exactly c
times, again ruling out Hiid. If we are close to the end of the pile, most faces will have
appeared ctimes, none more, and only a few significantly less. Even mid-way through
the pile, each face has appeared at most ctimes, which is evidence against Hiid for small
c. For instance, the chance of seeing no face twice when drawing 26 cards iid from 52
faces is less than 0.2% (cf. the birthday paradox). That is, latest half-way through a
single deck, this fact is revealed. If cards are drawn from 2 decks, more than 52 cards
are needed to reveal that they are not iid (Figure 3 bottom right).
Data duplication. In modern Machine Learning, esp. Deep Learning, data x1:nis
abundant (large n) and observation spaces are huge (large X). For instance, ImageNet
consists of over 14 million images, usually resized or cropped to e.g. 224×224 pixels of
256×256×256 colors, i.e. X= 2563×224×224. We can as well assume that Xis infinite.
Assume x1:ncontains no duplicate images and is well shuffled. As we will show later,
no valid test can reject Hiid in this case. But if x1:ncontains duplicates one may be
able to reject Hiid, similarly to the Black Jack example above, even without knowing
anything about the observation space X. For instance, assume every observation is
duplicated, i.e. every xthat appears in x1:nappears exactly twice. For uncountable X,
if xis sampled from a probability density, the probability of sampling the same xtwice
is 0. So duplications can only happen if Pθcontains point masses, i.e. is not purely
continuous, i.e. Pθ(x)>0 for some x. For finite or countable X, this is necessarily true.
But if Pθ(x)>0, then the frequency of seeing xis binomially distributed. While seeing
some xtwice is plausible, seeing all xexactly twice is very unlikely, so we can reject Hiid:
If data is iid and some items are duplicate, we should also see triples and quadruples,
etc.
Relevance for machine learning. The predominant training and evaluation proto-
col in Machine Learning in general and Deep Learning in particular is still to assume
the data is iid, train on most of the data and evaluate on the rest. Interestingly, this is
true even for models dealing with definitely non-iid text. For instance, for Transformers,
4
text is crudely chopped into chunks of equal length and then shuffled. The empirical
test loss is an unbiased estimator of the true loss, so is a proper way of comparing the
performance of different models. Data sizes in modern machine learning are huge, so
that even 10% held-out data is so much that test noise is often of little concern. That’s
at least the general story.
But in Deep Learning, data these days is often scraped from the whole internet, and
duplications abound. For instance, assume the whole data set contains 3 copies of each
data item. If we randomly split off 10% as the test set, then the train set contains nearly
all (99%) of the test set items. With heldout-validation a pure memorizer without any
generalization capacity will perform nearly perfectly on the test set [BLH22], but will
fail in practice on future data. Indeed shuffling the data makes this problem the worst
[SEBF21, GB19]. The problem is known as test set contamination, and well known.
The standard solution is to decontaminate or clean the data, e.g. removing dupli-
cates, but this does not suffice. One has to remove approximate duplicates too. But
at what threshold for example should a document that cites a training-set document
verbatim be removed from the test-set? When are two images scraped from the inter-
net rescaled or cropped or jpeg compressed versions of the same image, and even if so,
should they be regarded duplicates? While approaches exist that meliorate the prob-
lem, in theory this problem is ill-defined [Hut06, FAQ], and in practice a huge, actually
AI-complete, problem [BMR+20, App.C].
Test set evaluation is empirically sound for iid data, therefore the failure of this
paradigm must be attributed to the non-iid nature of the data. The strength and
weakness of the iid tests developed in this paper are that they are completely model-
and data-agnostic. This makes them universally applicable and valid, but also very
weak. On the other hand, as discussed above, finding good/perfect model- and data-
type-sensitive tests is itself a difficult/impossible research question beyond the scope of
this article.
3 Problem Formalization and Preliminaries
We now introduce notation and concepts used throughout the paper: general notation,
the multinomial and Binomial distributions, first-order and second-order counts, ex-
changeable distributions, and statistical tests. The reader familiar with these concepts
could skim this section to just pick up the notational convention used in this article.
Notation. We use calligraphic upper letters such as Xfor sets and |X| or #Xfor
the size of X. Probability spaces are denoted by (Xn,Σ,P ) with d=|X| ∈ N{∞}.
Pθdenotes iid distributions. Qdenotes exchangeable distributions. Capital letters
X,N,M,T,E,O,D,C,U,V,Z,Y denote random variables, and corresponding lower case
letters samples corresponding to them. We will use the shorthand P(x):=P[X=x] and
similarly for other random variables. The variance of Zis V[Z] := E[Z2]E[Z]2and
the covariance of Yand Zis Cov[Y,Z]:=E[Y Z]E[Y]E[Z]. In addition to the classical
O() notation, OP(f(n)) denotes all random functions F(n) for which δ > 0c>0n:
P[|F(n)|c·|f(n)|]δ. We use f(n).g(n) to denote f(n)g(n)·[1±OP(1/n)], and
similarly &and
'
. We use .even when stronger asymptotic or non-stochastic bounds
would be possible, since we mostly care about the leading-order approximation in nand
5
摘要:

TestingIndependenceofExchangeableRandomVariablesMarcusHutterDeepMindLatestversion&more@http://www.hutter1.net/ocial/bib.htm#exiid22October2022AbstractGivenwell-shueddata,canwedeterminewhetherthedataitemsaresta-tistically(in)dependent?Formally,weconsidertheproblemoftestingwhetherasetofexchangeabler...

展开>> 收起<<
Testing Independence of Exchangeable Random Variables Marcus Hutter.pdf

共42页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:42 页 大小:1.08MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 42
客服
关注