4 Yu Cai et al. /Medical Image Analysis (2023)
normal images during training, which is the well-known OCC
setting (Ruffet al.,2018).
Classical anomaly detection methods, OC-SVM (Sch¨
olkopf
et al.,1999) and SVDD (Tax and Duin,2004), often fail in high-
dimensional data due to bad computational scalability and the
curse of dimensionality. Their derived Deep SVDD (Ruffet al.,
2018) utilizes neural networks to constrain the normal sam-
ples in a hypersphere with minimum volume, handling high-
dimensional data better but suffering from the mode collapse.
Most recent state-of-the-art anomaly detection methods focus
on reconstruction and self-supervised learning. As techniques
highly related to our work, ensemble-based uncertainty esti-
mates and semi-supervised learning for anomaly detection are
also described in this section.
2.1. Reconstruction-based Anomaly Detection
Reconstruction-based methods are one of the most popular
families in anomaly detection, especially for medical images
(Baur et al.,2021). They usually utilize generative models, such
as generative adversarial networks (GANs) (Goodfellow et al.,
2014), auto-encoders (AEs) or their variants, to learn a map-
ping function to reconstruct normal images, while the unseen
abnormal images are assumed unable to be reconstructed well
by these models trained with only normal images, and in turn
yield high reconstruction error.
Schlegl et al. (2017) are the first to use GANs for anomaly
detection. They proposed AnoGAN to learn the manifold of
normal images. For a query image, a latent feature is found
via an iterative process to generate an image most similar to
the query image. The query image will be identified as abnor-
mal if there is a large difference with the best generated image.
To replace the time-consuming iterative process in the testing
phase, Schlegl et al. (2019) further utilized an encoder to learn
the mapping from the retinal OCT image to the latent space, and
derived a fast version of AnoGAN, named f-AnoGAN. How-
ever, these GAN-based methods could suffer from memoriza-
tion pitfalls, causing reconstructions to differ anatomically from
the actual input.
Various approaches also used variants of AEs for anomaly
detection, including Variational AE (VAE) (Zimmerer et al.,
2018), Adversarial AE (AAE) (Chen and Konukoglu,2018),
Vector Quantized VAE (VQ-VAE) (Marimont and Tarroni,
2021), etc. To avoid abnormal images being well reconstructed,
Gong et al. (2019) proposed to augment the AE with a mem-
ory module, which can store the latent features of normal train-
ing samples. The reconstruction is obtained from a few most
relevant memory records, thus tending to be close to a normal
image and enlarging the reconstruction errors of abnormal im-
ages. Compared with GAN-based methods, AE-based methods
can preserve more anatomical coherence, but usually generate
blurry reconstructions (Baur et al.,2021), leading to false posi-
tive detection around high-frequency regions (e.g., boundaries).
To mitigate this problem, Mao et al. (2020) proposed to auto-
matically estimate the pixel-level uncertainty of reconstruction
using an AE, which is used to normalize the reconstruction error
and suppress the false positive detection in CXRs significantly.
Recently, incorporating adversarial training into AEs has be-
come popular, as it combines the advantages of both. Baur et al.
(2018) demonstrated that AEs with spatial bottlenecks can re-
construct important fine details better than those with dense bot-
tlenecks, and combined the spatial VAE with GAN to improve
the realism of reconstructed normal samples for anomaly detec-
tion in brain MRIs. In addition to adversarial training, Akcay
et al. (2018) used an extra encoder to map the reconstructed
image to the latent space again, and minimized reconstruction
errors in both the image space and latent space during training
to aid in learning the data distribution for the normal samples.
Zaheer et al. (2020) proposed to transform the fundamental role
of a discriminator from identifying real and fake data to distin-
guishing between good and bad quality reconstructions, which
is highly desirable in anomaly detection as a trained AE would
not produce as good reconstructions for abnormal images as
they would for normal images conforming to the learned repre-
sentations.
2.2. Self-Supervised Learning-based Anomaly Detection
Self-supervised learning (Jing and Tian,2020), referring to
learning methods in which networks are explicitly trained us-
ing pretext tasks with generated pseudo labels, has also been
extensively studied for anomaly detection. Sohn et al. (2020)
proposed to first learn self-supervised representations from one-
class data and then build one-class classifiers on learned repre-
sentations. Based on their proposed framework, they applied
distribution augmentation (Jun et al.,2020) for one-class con-
trastive learning to reduce the uniformity of representations.
Further, Tian et al. (2021) combined distribution-augmented
contrastive learning (Sohn et al.,2020), augmentation predic-
tion (Golan and El-Yaniv,2018), and position prediction (Do-
ersch et al.,2015) to learn feature representations for anomaly-
sensitive detection models. Moreover, Li et al. (2021) proposed
to learn representations by classifying normal data from their
designed CutPaste, and then build a Gaussian density estimator
on learned representations.
In addition to the aforementioned representation-based meth-
ods, some works (Tan et al.,2020,2021;Schl¨
uter et al.,2022)
proposed to manually synthesize defects to train models to de-
tect irregularities. Various image processing approaches have
been designed to synthesize abnormal images, including Cut-
Paste (Li et al.,2021), Foreign Patch Interpolation (FPI) (Tan
et al.,2020), Poisson Image Interpolation (PII) (Tan et al.,
2021), etc. Recently, Schl¨
uter et al. (2022) integrated Pois-
son image editing with rescaling, shifting and a new Gamma-
distribution-based patch shape sampling strategy to synthesize
natural and diverse anomalies. Background constraints and
pixel-level labels derived from the resulting difference to the
normal image were designed to make the results more relevant
to the task. However, these methods may not generalize well
due to the inherent reliance on the similarity between synthetic
abnormal patterns and real anomalies.
Also, Zavrtanik et al. (2021) proposed to combine the re-
construction network with a self-supervised network. It feeds
the concatenation of the original image and reconstruction re-
sult to a segmentation network trained via self-supervised learn-
ing, which is expected to learn a distance function between the
original and reconstructed anomaly appearance. However, the