trained network[9–15] or the models obtained through self-
supervised tasks[16–18], as feature extractors. For the pre-
trained network, several studies[10,15] have found that
it’s important to select appropriate hierarchy levels of fea-
tures, because the low-level features lack global awareness,
while the extremely high-level features may be biased to
the pre-trained task itself. Also, the pre-trained network
can be used as a teacher network to detect anomalies by
knowledge distillation[9]. For the self-supervised based
methods, the key is to design suitable auxiliary tasks.
Li et al.[16] propose to use Cutpaste augmentation to
train a one-class classifier. Other auxiliary tasks includes
the position prediction[17], the geometric transformation
prediction[18], etc.
Overall, benefiting from the powerful representation
capabilities of deep features, feature-based methods
can achieve better performance compared to existing
reconstruction-based methods. In particular, [10] achieves
state-of-the-art performance on the MVTec AD. However,
these methods are hard to be optimized for the specific
situation since those deep features are too abstract to in-
troduce prior knowledge.
2.2. Reconstruction based methods
Reconstruction-based methods commonly leverage gen-
erative models such as autoencoders[4,19,20], VAEs[21],
GANs[5,22], etc., to detect anomalies in the image space.
Generally, these methods contain two steps: 1. Recon-
struct the image; 2. Compare the original and recon-
structed images to get anomaly maps.
Reconstruct the image. Early works mainly lever-
age denoising autoencoders[20,23,24] to help the network
better capture the normal distribution and avoid learning
an identity mapping. In the training phase, these methods
corrupt the original image with certain noise and make the
network eliminate it. In addition to some low-level noise
like Gaussian noise, cutout, stain, etc., the image can also
be corrupted by some semantic transformations, like the
geometric transformation[25–27], color transformation[1],
inpainting masks[4,28,29] etc., which are summarized into
an attribute removal-and-restoration framework by Ye et
al. [1]. They argue that the network can learn more robust
features during the process of restoring the previously re-
moved attributes. Following this paradigm, we propose a
specific attribute removal-restoration task where the low-
frequency and color attributes are the main attributes to
be restored.
Compare the images. After the reconstruction, the
anomalies can be detected by comparing the original and
reconstructed images. Early comparing functions include
l2distance, structure similarity (SSIM) [30], etc. Further-
more, Zavrtanik et al. [4] introduce a multi-scale gradi-
ent map (MSGMS) anomaly evaluation function which
significantly boosts the performance. However, MSGMS
performs poorly on those low-frequency color anomalies.
Later, Zavrtanik et al. [31] further propose to use a sep-
arate discriminative network (DRAEM) which takes the
concatenation of the original and reconstructed images as
input and detects the anomalies via image segmentation.
While DRAEM achieves remarkable performance on the
MVTec AD, the additional discriminative network intro-
duces extra latent features and therefore makes the seg-
mentation results less interpretable. Similarly, the current
state-of-the-art reconstruction-based method OCR-GAN
[22] also leverage latent space features and combine them
with l1distance to detect anomalies. Differently, in this
paper, we still focus on hand-crafted anomaly score func-
tions, which are more interpretable and adjustable. Con-
cretely, we propose a new color comparing function and
combine it with the existing MSGMS function. The pro-
posed function can effectively detect various anomalies.
3. Methods
Our reconstruction framework is based on an UNet-type
encoder-decoder network with the corrupted grayscale
edge as input. Specifically, we first corrupted the original
image with certain noises; then we convert the corrupted
image into a grayscale edge; after that, we train a network
to reconstruct the original image from its corrupted edge;
finally, we discuss how to design the anomaly evaluation
function.
3.1. Get the corrupted edges
Our basic idea is to formulate an attribute removal-and-
restoration task that can be suitable for various industrial
anomaly detection scenarios. Specifically, we construct a
‘grayscale edge to RGB image’ task where we remove the
low frequency and color attributes in the original image
and train a network to restore them. This design is based
on two considerations. First, low-frequency and color con-
tents are general attributes in various images. We notice
that there also exist other tasks such as the restoration of
the geometrically transformed image [18,25–27], However,
compared with our design, these methods are less general,
e.g., the above geometric transformation framework can-
not be applied to spatially invariant textures, while our
design can be applied to both texture and object images.
Second, preserving edge information enables the network
to better reconstruct the details in normal patterns, which
can effectively reduce the false positive rate in complex
normal areas. On the other hand, preserving the edges
may also lead to the model producing identity mappings
of the original high-frequency components. To avoid this,
we first corrupt the original image with certain noise.
We adopt the strategy proposed in [31] to generate sim-
ulated anomalies whose textures are from external tex-
ture dataset DTD [32] with the shapes of randomly gen-
erated Perlin noise. However, we observe that if only
use these out-of-distribution textures as pseudo anoma-
lies, the model cannot well distinguish the foreground and
background areas. This makes it difficult to detect struc-
tural defects caused by missing components. Therefore,
3