Red-Teaming the Stable Diffusion Safety Filter Javier Rando ETH Zurich

2025-04-29 0 0 1.38MB 10 页 10玖币
侵权投诉
Red-Teaming the Stable Diffusion Safety Filter
Javier Rando
ETH Zurich
jrando@ethz.ch
Daniel Paleka
ETH Zurich
daniel.paleka@inf.ethz.ch
David Lindner
ETH Zurich
david.lindner@inf.ethz.ch
Lennart Heim
Centre for the Governance of AI
lennart.heim@governance.ai
Florian Tramèr
ETH Zurich
florian.tramer@inf.ethz.ch
Abstract
Stable Diffusion is a recent open-source image generation model comparable to
proprietary models such as DALL
·
E, Imagen, or Parti. Stable Diffusion comes with
a safety filter that aims to prevent generating explicit images. Unfortunately, the
filter is obfuscated and poorly documented. This makes it hard for users to prevent
misuse in their applications, and to understand the filter’s limitations and improve it.
We first show that it is easy to generate disturbing content that bypasses the safety
filter. We then reverse-engineer the filter and find that while it aims to prevent
sexual content, it ignores violence, gore, and other similarly disturbing content.
Based on our analysis, we argue safety measures in future model releases should
strive to be fully open and properly documented to stimulate security contributions
from the community.
Figure 1: Simplified safety filter algorithm implemented in Stable Diffusion v1.4. Images are mapped
to a CLIP latent space, where they are compared against pre-computed embeddings of 17 unsafe
concepts (see full list in Appendix E). If the cosine similarity between the output image and any of
the concepts is above a certain threshold, the image is considered unsafe and blacked-out.
ML Safety Workshop, 36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.04610v5 [cs.AI] 10 Nov 2022
1 Introduction
Cascaded diffusion models are a recent breakthrough in image generation. Models such as DALL
·
E
[
14
] or Imagen [
16
] use this architecture to generate realistic images from natural language descrip-
tions. Many such models have been kept closed-source, partly due to perceived safety risks of an
open release [
11
]. Stability AI recently released a comparable model publicly: Stable Diffusion [
15
].
The model has been used by a diverse community from children [7] to professional artists [3, 1].
Due to possible safety risks, Stable Diffusion does include a post-hoc safety filter that blocks explicit
images [
6
,
25
]. Unfortunately, the filter’s design is not documented. From inspecting the source code,
we find that the filter blocks out any generated image that is too close (in the embedding space of
OpenAI’s CLIP model [13]) to at least one of 17 pre-defined “sensitive concepts”.
To make matters worse, while the safety filter implementation is public, the concepts to be filtered
out are obfuscated: only the CLIP embedding vector of each of these 17 sensitive concepts, not the
concept itself, is provided. These embeddings can be seen as a “hash” of the sensitive concepts. To
overcome the lack of documentation, we reverse engineer the safety filter and invert the embeddings
for the sensitive concepts. Surprisingly, we find that the current filter only checks for images of
a sexual nature, ignoring other problematic content such as violence or gore. Moreover, simple
prompt-engineering reliably bypasses the filter even on the concepts that it does aim to block.
We conclude that the Stable Diffusion safety filter is likely not suitable for use in downstream
applications that require high safety standards. Worryingly, the lack of proper documentation on the
filter has so far prevented application developers from properly assessing safety risks and applying
additional mitigations (e.g., stronger content blockers) if needed [
26
]. Security by obscurity is rarely
warranted [
17
], and can amplify other risks (e.g., obfuscated “unsafe” concepts could be repurposed
for censorship). We encourage future releases (both open or closed source) of machine learning
models to adopt proven practices from computer security, such as open documentation of safety
features and their limitations, and the adoption of proper vulnerability disclosure channels.
2 How the safety filter works
The Stable Diffusion safety filter [
6
] is not documented, but we can deduce how it works from the
code in the public repository
1
. Here is a simplified outline for the safety filter in Stable Diffusion
v1.4 (see Figure 1, and Appendix Cfor the pseudocode):
The user provides a prompt, say
“a photograph of an astronaut riding a horse”
. The
Stable Diffusion model then creates an image conditioned on this prompt.
Before being shown to the user, the image is run through CLIP’s image encoder [
13
] to obtain an
embedding; i.e., a high-dimensional vector representation of the input.
Then, the cosine similarity between this embedding and 17 different fixed embedding vectors is
computed. Each of these fixed vectors represents some pre-defined sensitive concepts.
Every concept has a prespecified similarity threshold. If the cosine similarity between the image
and any of the concepts is larger than the respective threshold, the image is discarded.
The vector representations of the unsafe concepts are embeddings of unknown text prompts using
CLIP’s text model. Because CLIP is trained to match the embeddings of images and corresponding
textual captions, it is expected that the textual embedding of some unsafe concept (e.g.,
“nudity”
)
will be close to the image embeddings of depictions of this same concept. The 17 text prompts that
were used to produce pre-computed CLIP embeddings fully determine the unsafe concepts that Stable
Diffusion’s safety filter looks for. Unfortunately, these prompts have not been published. Thus, the
whole safety classification logic is contained in some obfuscated static high-dimensional embeddings.
Special care concepts.
In addition to the procedure outlined above, the filter considers a higher
level of particularly sensitive "special care concepts". Specifically, if a generated image is close (in
CLIP’s latent space) to any fixed special care concept, then the similarity threshold for the above 17
sensitive concepts is lowered, so that filtering is more aggressive. This behavior is also undocumented.
The code shows there are three special care concepts, which again are only provided as embeddings.
1https://github.com/huggingface/diffusers/blob/84b9df5/src/diffusers/pipelines/
stable_diffusion/safety_checker.py. Accessed 29/09/2022.
2
摘要:

Red-TeamingtheStableDiffusionSafetyFilterJavierRandoETHZurichjrando@ethz.chDanielPalekaETHZurichdaniel.paleka@inf.ethz.chDavidLindnerETHZurichdavid.lindner@inf.ethz.chLennartHeimCentrefortheGovernanceofAIlennart.heim@governance.aiFlorianTramèrETHZurichflorian.tramer@inf.ethz.chAbstractStableDiffusio...

展开>> 收起<<
Red-Teaming the Stable Diffusion Safety Filter Javier Rando ETH Zurich.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:1.38MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注