
between several competing hypotheses which might generate a peak in the di-muon spectrum would
require recording new data with a dedicated trigger, which is both time consuming and expensive.
We propose an approach that bridges the full and partial event paradigms automatically with machine
learning. This is accomplished by training a neural network to learn a lossy event compression with
a tunable resolution parameter. An extreme version of this approach would be to save every event
at the highest resolution allowable by hardware (see e.g. Ref. [7] for autoencoders in hardware).
We present a more modest version in which we envision full event compression which could run
alongside partial event triggers to expand their utility for a larger range of offline analyses. Our
approach uses a optimal transport-based Variational Autoencoder (VAE) following Ref. [8].
In a proof-of-concept study, we compress and record a sample of simulated interactions which are
similar to those analyzed in Ref [6], preserving information which would otherwise be lost. We show
that this additional information can be used to effectively discriminate between two signal models
which are difficult to distinguish with only the muon kinematics. The overall structure of the proposal
is that first, a signal is discovered in a trigger-level analysis such as this dimuon resonance search.
Subsequently, a compressed version of the hadronic event data, which has been stored alongside the
muons, can be used to rule out or favor candidate signal models.
2 Related Work
An alternative to compressing individual events is compressing the entire dataset online [9], which
is methodologically and practically more challenging. An alternative to saving events for offline
analysis is to look for new particles automatically with online anomaly detection [10–13]. While we
build our VAE on the setup from Ref. [8] using the Sinkhorn approximation [14,15] to the Earth
Movers Distance, other possibilities have been explored, such as using graph neural networks [16].
We leave a comparison of the power of different approaches to future work.
3β-parameterized Variational Autoencoder
We represent each collider event
x
as a point cloud of 3-vectors
{pT/HT, η, φ}
, where
η
and
φ
are the
geometric coordinates of particles in the detector, and
pT
their transverse momenta which correspond
to the weights in the point cloud. These are normalized for each event using
HT=PipT,i
. We build
an EMD-VAE [8,17,18] trained to minimize a reconstruction error given by an approximation to the
2-Wasserstein distance between collider events xand reconstructed examples x0, with loss function
L=hS(x, x0(z))/β +DKL(q(z|x)||p(z))ip(x).(1)
An encoder network maps the input
x
to a Gaussian-parameterized distribution
q(z|x)
on 256-
dimensional latent coordinates
z
. This network is built as a Deepsets/Particle Flow Network (PFN) [19,
20]. A decoder x0(z)maps latent codes zto jets x0, parameterizing a posterior probability
log p(x|z)∝S(x, x0(z))/β ,
where
S(x, x0(z))
is a sharp Sinkhorn [15,21–23] approximation to the 2-Wasserstein distance
between event
x
and its decoded
x0
with ground distance given by
Mij = ∆R2
ij ≡(ηi−ηj)2+
(φi−φj)2
, and calculated using the same algorithm and parameters as in Ref [8]. This decoder
network is built as a dense neural network.
DKL(q(z|x)||p(z))
is the KL divergence between the
encoder probability
q(z|x)
and the prior
p(z)
, which we take to be a standard Gaussian. This KL
divergence can be expressed as a sum of contributions from each of the 256 latent space directions.
The details of the architecture is described in the Appendix.
The quantity
β
is typically taken to be a fixed hyperparameter of the network [24] which controls the
balance between reconstruction fidelity and degree of compression in the latent space. In this work,
we elevate
β
from a fixed hyperparameter to an input [25] of both the encoder and decoder networks
12
.
1The authors are grateful to Jesse Thaler for this suggestion.
2
Note added post-publication: A similar idea was pursued in [26], which was submitted for publication
concurrently with this work. The implementation in their study differs from ours by using a hypernetwork to
2