
1 Introduction
Cascaded diffusion models are a recent breakthrough in image generation. Models such as DALL
·
E
[
14
] or Imagen [
16
] use this architecture to generate realistic images from natural language descrip-
tions. Many such models have been kept closed-source, partly due to perceived safety risks of an
open release [
11
]. Stability AI recently released a comparable model publicly: Stable Diffusion [
15
].
The model has been used by a diverse community from children [7] to professional artists [3, 1].
Due to possible safety risks, Stable Diffusion does include a post-hoc safety filter that blocks explicit
images [
6
,
25
]. Unfortunately, the filter’s design is not documented. From inspecting the source code,
we find that the filter blocks out any generated image that is too close (in the embedding space of
OpenAI’s CLIP model [13]) to at least one of 17 pre-defined “sensitive concepts”.
To make matters worse, while the safety filter implementation is public, the concepts to be filtered
out are obfuscated: only the CLIP embedding vector of each of these 17 sensitive concepts, not the
concept itself, is provided. These embeddings can be seen as a “hash” of the sensitive concepts. To
overcome the lack of documentation, we reverse engineer the safety filter and invert the embeddings
for the sensitive concepts. Surprisingly, we find that the current filter only checks for images of
a sexual nature, ignoring other problematic content such as violence or gore. Moreover, simple
prompt-engineering reliably bypasses the filter even on the concepts that it does aim to block.
We conclude that the Stable Diffusion safety filter is likely not suitable for use in downstream
applications that require high safety standards. Worryingly, the lack of proper documentation on the
filter has so far prevented application developers from properly assessing safety risks and applying
additional mitigations (e.g., stronger content blockers) if needed [
26
]. Security by obscurity is rarely
warranted [
17
], and can amplify other risks (e.g., obfuscated “unsafe” concepts could be repurposed
for censorship). We encourage future releases (both open or closed source) of machine learning
models to adopt proven practices from computer security, such as open documentation of safety
features and their limitations, and the adoption of proper vulnerability disclosure channels.
2 How the safety filter works
The Stable Diffusion safety filter [
6
] is not documented, but we can deduce how it works from the
code in the public repository
1
. Here is a simplified outline for the safety filter in Stable Diffusion
v1.4 (see Figure 1, and Appendix Cfor the pseudocode):
•
The user provides a prompt, say
“a photograph of an astronaut riding a horse”
. The
Stable Diffusion model then creates an image conditioned on this prompt.
•
Before being shown to the user, the image is run through CLIP’s image encoder [
13
] to obtain an
embedding; i.e., a high-dimensional vector representation of the input.
•
Then, the cosine similarity between this embedding and 17 different fixed embedding vectors is
computed. Each of these fixed vectors represents some pre-defined sensitive concepts.
•
Every concept has a prespecified similarity threshold. If the cosine similarity between the image
and any of the concepts is larger than the respective threshold, the image is discarded.
The vector representations of the unsafe concepts are embeddings of unknown text prompts using
CLIP’s text model. Because CLIP is trained to match the embeddings of images and corresponding
textual captions, it is expected that the textual embedding of some unsafe concept (e.g.,
“nudity”
)
will be close to the image embeddings of depictions of this same concept. The 17 text prompts that
were used to produce pre-computed CLIP embeddings fully determine the unsafe concepts that Stable
Diffusion’s safety filter looks for. Unfortunately, these prompts have not been published. Thus, the
whole safety classification logic is contained in some obfuscated static high-dimensional embeddings.
Special care concepts.
In addition to the procedure outlined above, the filter considers a higher
level of particularly sensitive "special care concepts". Specifically, if a generated image is close (in
CLIP’s latent space) to any fixed special care concept, then the similarity threshold for the above 17
sensitive concepts is lowered, so that filtering is more aggressive. This behavior is also undocumented.
The code shows there are three special care concepts, which again are only provided as embeddings.
1https://github.com/huggingface/diffusers/blob/84b9df5/src/diffusers/pipelines/
stable_diffusion/safety_checker.py. Accessed 29/09/2022.
2