able to make them work as well as the specific tests develop in this paper, so will not
consider universal tests any further (nor will we delve into numerology any further).
Black Jack. A standard deck of cards without Jokers consists of 52 cards of 13 ranks,
each in four suits, two red and two black. If we shuffle together infinitely many such
decks and then draw ncards, this equivalently to drawing cards uniformly iid from the
|X|= 52 different card faces. If we have only one deck and draw all 52 cards from it,
we obviously observe every card exactly once. Such an outcome would be extremely
unlikely had we drawn 52 cards from an infinite set of decks (see (5)).
Assume now an unknown number of decks have been shuffled together. Assume n
cards have been drawn so far from this pile and we remembered their face (called card
counting strategy). An interesting question is to infer the number of decks the cards
have been drawn from. Or consider the weaker question: Are x1:nconsistent with Hiid?
If yes, we cannot infer the next card face better than by chance (1/52), so should not
waste our time trying to do so, and wait with raising the stakes for when we have seen
more cards. The answer to this question is relevant even if we know the number of decks
c. For Black Jack, 1-8 decks are used, in casinos often 6. If nis small, we will not be
able to reject Hiid, but if we have seen all cards, then each card will appear exactly c
times, again ruling out Hiid. If we are close to the end of the pile, most faces will have
appeared ctimes, none more, and only a few significantly less. Even mid-way through
the pile, each face has appeared at most ctimes, which is evidence against Hiid for small
c. For instance, the chance of seeing no face twice when drawing 26 cards iid from 52
faces is less than 0.2% (cf. the birthday paradox). That is, latest half-way through a
single deck, this fact is revealed. If cards are drawn from 2 decks, more than 52 cards
are needed to reveal that they are not iid (Figure 3 bottom right).
Data duplication. In modern Machine Learning, esp. Deep Learning, data x1:nis
abundant (large n) and observation spaces are huge (large X). For instance, ImageNet
consists of over 14 million images, usually resized or cropped to e.g. 224×224 pixels of
256×256×256 colors, i.e. X= 2563×224×224. We can as well assume that Xis infinite.
Assume x1:ncontains no duplicate images and is well shuffled. As we will show later,
no valid test can reject Hiid in this case. But if x1:ncontains duplicates one may be
able to reject Hiid, similarly to the Black Jack example above, even without knowing
anything about the observation space X. For instance, assume every observation is
duplicated, i.e. every xthat appears in x1:nappears exactly twice. For uncountable X,
if xis sampled from a probability density, the probability of sampling the same xtwice
is 0. So duplications can only happen if Pθcontains point masses, i.e. is not purely
continuous, i.e. Pθ(x)>0 for some x. For finite or countable X, this is necessarily true.
But if Pθ(x)>0, then the frequency of seeing xis binomially distributed. While seeing
some xtwice is plausible, seeing all xexactly twice is very unlikely, so we can reject Hiid:
If data is iid and some items are duplicate, we should also see triples and quadruples,
etc.
Relevance for machine learning. The predominant training and evaluation proto-
col in Machine Learning in general and Deep Learning in particular is still to assume
the data is iid, train on most of the data and evaluate on the rest. Interestingly, this is
true even for models dealing with definitely non-iid text. For instance, for Transformers,
4