
3
B. Other Related Works
We remark that the two-phase test for hypothesis testing is closely related with the two-phase code for channel coding with
feedback [18], [19] by Yamamoto and Itoh [10]. Specifically, the two-phase code in [10] includes a message phase and a
confirmation phase. In the message phase, a message is transmitted using a fixed number nof channel uses. After receiving the
channel outputs, the receiver checks whether the message is transmitted correctly or not and sends either an ACK or NACK
signal to inform the transmitter. Once the NACK signal is received, the transmitter retransmits the message. Note that the
ACK or NACK signal corresponds to the non-reject and reject decisions of the first phase for hypothesis testing. However,
we would like to note that the second phase of both problems differ. In channel coding, the second phase is a retransmission
of the same message. In contrast, in hypothesis testing, the second phase collects another (k−1)ndata samples and makes
a final decision using all kn data samples from both phases. Furthermore, since the objectives of both problems are different,
the analyses are rather different and thus the theoretical analyses for channel coding with feedback [19]–[21] does not directly
imply the results for hypothesis testing, regardless whether the generating distribution is known or not.
We then provide a non-exhaustive list of related works on statistical classification. Zhou, Tan and Motani [17] analyzed the
finite sample performance of Gutman’s test. Hsu and Wang [22] generalized Gutman’s results to the mismatched setting where
the generating distributions of the training sequences and the test sequences deviate slightly. The Bayesian error probability of
binary classification was studied by Merhav and Ziv [23] in the asymptotic case and more recently by Saito [24], [25] in the finite
blocklength using Bayes codes. Gutman’s results are also generalized to multiple test sequences [26], distributed detection [27],
quickest change-point detection [28], outlier hypothesis testing [29], [30] and universal sequential classification [16].
Notation
We use R,R+,Nto denote the set of real numbers, non-negative real numbers, and natural numbers respectively. Given any
integer a∈N, we use [a]to denote [1,2, . . . , a]. Random variables and their realizations are denoted by upper case variables
(e.g., X) and lower case variables (e.g., x), respectively. All sets are denoted in calligraphic font (e.g., X). For any N∈N,
let XN:= (X1,...XN)be a random vector of length Nand let xN= (x1, . . . , xN)be a particular realization of XN. The
set of all probability distributions on a finite set Xis denoted as P(X). We use E[·]to denote expectation. Given a sequence
xn∈ X n, the type or empirical distribution is defined as ˆ
Txn(a) = 1
nPn
i=1 (xi) = a, ∀a∈ X . The set of types formed from
length-nsequences with alphabet Xis denoted as Pn(X). Given any P∈ Pn(X), the set of all sequences of length nwith
type P, the type class, is denoted as Tn
P.
II. PROBLEM FORMULATION, FIXED-LENGTH TEST AND SEQUENTIAL TEST
A. Problem Formulation
In M-ary classification, we are given Mtraining sequences XN:= (XN
1, . . . , XN
M)∈(XN)Mof length N∈Ngenerated
i.i.d. according to distinct unknown distributions P= (P1, . . . , PM)∈ P(X)Mand a test sequence Yτgenerated i.i.d.
according to one of the Mdistributions, where τis a random stopping time with respect to the filtration σ{XN, Y1,...Yn}.
For simplicity, we assume that N=⌈nα⌉2. The task is to design a test Φ = (τ, ϕτ), which consists of a random stopping
time τand a mapping ϕτ:XMN × X τ→ {H1,H2,...,HM}, to decide among the following Mhypotheses:
•Hj: the test sequence Yτis generated i.i.d. from the same distribution as the jth training sequence XN
j,j∈[M].
At the random stopping time τ, any mapping ϕτpartitions the sample space into Mdisjoint regions: {Aj(ϕτ)}j∈[M]where
Yτ∈ Aj(ϕτ)favors hypothesis Hj.
Given any test Φand any tuple of distributions P∈ P(X)M, we have the following Merror probabilities to evaluate the
performance of the test:
βj(Φ|P) := Pjϕτ(XN, Y τ)̸= Hj, j ∈[M],(1)
where for each j∈[M], we define Pj{·} := Pr{·|Hj}under which XN
i∼Pifor all i∈[M]and Yτ∼Pj. It is challenging to
characterize the non-asymptotic performance of error probabilities with finite sample size. As a compromise, to be consistent
2In subsequent analysis, without loss of generality, we ignore the integer requirement and drop the ceiling operation for simplicity.