
UNSUPERVISED PARTICLE SORTING FOR CRYO-EM USING PROBABILISTIC PCA
Gili Weiss-Dicker ?Amitay Eldar †Yoel Shkolinsky†Tamir Bendory?
?School of Electrical Engineering, Tel Aviv University
†Department of Applied Mathematics, School of Mathematical Sciences, Tel Aviv University
ABSTRACT
Single-particle cryo-electron microscopy (cryo-EM) is a lead-
ing technology to resolve the structure of molecules. Early
in the process, the user detects potential particle images in
the raw data. Typically, there are many false detections as a
result of high levels of noise and contamination. Currently,
removing the false detections requires human intervention
to sort the hundred thousands of images. We propose a
statistically-established unsupervised algorithm to remove
non-particle images. We model the particle images as a union
of low-dimensional subspaces, assuming non-particle images
are arbitrarily scattered in the high-dimensional space. The
algorithm is based on an extension of the probabilistic PCA
framework to robustly learn a non-linear model of union of
subspaces. This provides a flexible model for cryo-EM data,
and allows to automatically remove images that correspond
to pure noise and contamination. Numerical experiments
corroborate the effectiveness of the sorting algorithm.
Index Terms—Unsupervised learning, single-particle
cryo-EM, probabilistic PCA, expectation-maximization
1 INTRODUCTION
Single-particle cryo-electron microscopy (cryo-EM) is an
emerging technology to determine the structure of molecules.
In the cryo-EM process, the acquired “raw data” image, called
a micrograph, contains a few dozens of 2-D tomographic par-
ticle projection images with unknown random orientations
and locations. The micrograph suffers from low signal-to-
noise ratio (SNR), as low as 1
100 . Typically, it also contains
undesired contamination. For the purpose of this paper, the
pixels in a micrograph can be broadly divided into three cat-
egories: regions of particles with additive noise, regions of
contamination, and regions of noise only.
During the cryo-EM workflow, particle images are detected
and extracted from micrographs in a process called particle
picking [1, 2]. The extracted images are the individual parti-
cles within each micrograph. If only particles were picked,
the images chosen by the particle picker would have been
used to construct the 3-D molecular structure. Figure 1 illus-
trates a schematic sequence of computational steps typically
used to convert the raw data into 3-D molecular structures.
While many particle picking algorithms were developed, e.g.,
[3, 4, 5], due to very low SNR levels they result in contam-
ination and pure noise images picked along with the particle
images. Typical images chosen by a picking algorithm can be
seen in Figure 2.
A common approach to remove non-particle images, called
“2-D classification,” is semi-automatic and involves an ex-
pert practitioner; it relies heavily on subjective criteria that
are neither consistent nor reproducible among different users.
We propose an automatic and statistically-established unsu-
pervised algorithm to remove non-particle images from the
data. Specifically, we assume that all particle images approx-
imately lie on a union of subspaces, whereas the non-particle
images are scattered in the high-dimensional space. Similar
parsimonious models are ubiquitous in many signal process-
ing tasks, and specifically in different stages of the cryo-EM
computational pipeline [7, 8]. The main computational tool
in this work is an extension of principal component analysis
(PCA). PCA has been applied to cryo-EM data for several
tasks [7, 9]. However, PCA is limited since it learns a single
subspace. We build on a maximum likelihood formulation,
called probabilistic PCA (PPCA) [10, 11, 12]. In particu-
lar, we iteratively estimate the union of subspaces using an
expectation-maximization (EM) algorithm, while sorting out
images that do not lie on the subspaces.
PPCA offers several attractive advantages over PCA. First,
PPCA can be readily extended to multiple subspaces, leading
to a nonlinear flexible mixture model. Second, we work in the
dimension of the problem (i.e., the number of parameters that
define the sought subspaces) in contrast to standard PCA that
requires estimating the full covariance matrix.
Fig. 1: A typical data processing pipeline for single-particle cryo-EM. The main contribution of this work is the fully automated sorting block,
aiming to replace the need of an expert practitioner involved in the sorting step.
arXiv:2210.12811v2 [eess.IV] 7 Mar 2023