
2
II. METHOD
Introduced in Ref. [5], CWoLa is a weakly-supervised technique for anomaly detection which aims to learn a
monotonic function of the Likelihood Ratio between Signal Sand Background Bprocesses for a set of features of
interest ~x,LS/B(~x) = p(~x|S)/p(~x|B), with the help of an additional feature yuncorrelated with ~x. The latter variable,
often but not necessarily the invariant mass of the event, can be used to define two regions of interest: the signal
region M1and the control (or side-band) region M2, where the signal-to-background ratio is assumed to be higher in
M1than in M2. A weakly-supervised algorithm, CWoLa trains a classifier to distinguish between M1and M2. The
obtained output function s(~x) can then be mapped to LM1/M2(~x) through the likelihood ratio trick. The orthogonality
of yand ~x guarantees that LM1/M2(~x) is a monotonous function of LS/B (~x) and thus possesses in principle optimal
statistical power.
Usual applications of CWoLa use the learned optimal classifier s(~x) to select events of interest and assign a certain
significance to the difference in selected events in M1and M2. The difference in the resulting selection efficiencies
M1,2is a smoking-gun for the presence of signal in M1(and also M2). However, this is only true in the limit of infinite
statistics. In a realistic setting where the dataset is finite, quantifying the degree to which the difference in efficiencies
relates to the presence of signal is non-trivial. One common strategy is to assume that there is no signal in M2and
assess the agreement between the selected events in M1and a background extrapolation from M2.
Our method constitutes an alternative to assess how the learned output s(~x) encodes differences between M1and
M2caused by the presence of a signal. To introduce it, we focus on the density estimation framing of CWoLa, which
clearly defines a background-only or null hypothesis. At its heart, CWoLa is a mixture model where ~x and yare
assumed to be conditionally independent given the process label z={S, B}. After defining M1and M2using y, the
trained classifier output is a function s(~x) that inherits the conditional independence with respect to y. The statistical
model can be explicitly written as
p(s(~x), y|π) = (1 −π)p(s(~x)|B)p(y|B) + π p(s(~x)|S)p(y|S),(1)
where πis the signal probability. The background-only hypothesis is explicitly written as p(s(~x), y|π= 0) and cor-
responds to the case where the observed data shows independence between s(~x) and y. This is the key observation
for our strategy. For a given measured dataset of pairs {s(~xi), yi}, one can assess whether they are statistically inde-
pendent. If statistical independence is ruled out, the background-only hypothesis is ruled out, provided conditional
independence holds. Conversely, if statistical independence cannot be ruled out, one has a clear statement about the
incapability of CWoLa to discern whether any difference between M1and M2originates from the presence of a signal
or is due to statistical fluctuations in the data.
Several tests of statistical independence exist for both discrete and continuous distributions, including mutual
information [10], Hoeffding’s D independence test [11] and distance correlation [12]. For simplicity, in the present
work we focus on the use of the estimated mutual information (MI) Iof the measured probability distribution. MI
encodes the exact property we want to test as it measures the difference between the joint distribution and the
marginals:
I(s, y) = DKL(p(s, y)||p(s)p(y)) (2)
=Zds dy p(s, y) log p(s, y)
p(s)p(y),(3)
where DKL(p, q) is the Kullback-Leibler divergence between two probability distributions, capturing how much in-
formation is lost when approximating the distribution pwith the distribution q. The MI thus captures how well one
can approximate the joint distribution by the product of its marginals and it is trivial to show that it vanishes for
independent variables. Conditional Independence can then be expressed as a vanishing MI conditioned on a given
process
I(s, y|z) = Zds dy p(s, y|z) log p(s, y|z)
p(s|z)p(y|z)= 0 .(4)
On the other hand, for the full dataset the possible mixture between the two processes encoded in π∈[0,1] results in
I(s, y)≥0,(5)
with the equality achieved when there is only one process or the two processes have the same probability distributions.
A very nice feature of the MI is that it has well behaved asymptotic properties in the limit of small MI and large
sample size [13]. Thus, we can estimate it from the measured sample of Nevents and obtain the p-value of said