Red-Teaming the Stable Diffusion Safety Filter Javier Rando ETH Zurich

2025-04-29 0 0 1.38MB 10 页 10玖币

侵权投诉

Red-Teaming the Stable Diffusion Safety Filter

Javier Rando

ETH Zurich

jrando@ethz.ch

Daniel Paleka

ETH Zurich

daniel.paleka@inf.ethz.ch

David Lindner

ETH Zurich

david.lindner@inf.ethz.ch

Lennart Heim

Centre for the Governance of AI

lennart.heim@governance.ai

Florian Tramèr

ETH Zurich

florian.tramer@inf.ethz.ch

Abstract

Stable Diffusion is a recent open-source image generation model comparable to

proprietary models such as DALL

E, Imagen, or Parti. Stable Diffusion comes with

a safety ﬁlter that aims to prevent generating explicit images. Unfortunately, the

ﬁlter is obfuscated and poorly documented. This makes it hard for users to prevent

misuse in their applications, and to understand the ﬁlter’s limitations and improve it.

We ﬁrst show that it is easy to generate disturbing content that bypasses the safety

ﬁlter. We then reverse-engineer the ﬁlter and ﬁnd that while it aims to prevent

sexual content, it ignores violence, gore, and other similarly disturbing content.

Based on our analysis, we argue safety measures in future model releases should

strive to be fully open and properly documented to stimulate security contributions

from the community.

Figure 1: Simpliﬁed safety ﬁlter algorithm implemented in Stable Diffusion v1.4. Images are mapped

to a CLIP latent space, where they are compared against pre-computed embeddings of 17 unsafe

concepts (see full list in Appendix E). If the cosine similarity between the output image and any of

the concepts is above a certain threshold, the image is considered unsafe and blacked-out.

ML Safety Workshop, 36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.04610v5 [cs.AI] 10 Nov 2022

1 Introduction

Cascaded diffusion models are a recent breakthrough in image generation. Models such as DALL

[

] or Imagen [

] use this architecture to generate realistic images from natural language descrip-

tions. Many such models have been kept closed-source, partly due to perceived safety risks of an

open release [

]. Stability AI recently released a comparable model publicly: Stable Diffusion [

The model has been used by a diverse community from children [7] to professional artists [3, 1].

Due to possible safety risks, Stable Diffusion does include a post-hoc safety ﬁlter that blocks explicit

images [

]. Unfortunately, the ﬁlter’s design is not documented. From inspecting the source code,

we ﬁnd that the ﬁlter blocks out any generated image that is too close (in the embedding space of

OpenAI’s CLIP model [13]) to at least one of 17 pre-deﬁned “sensitive concepts”.

To make matters worse, while the safety ﬁlter implementation is public, the concepts to be ﬁltered

out are obfuscated: only the CLIP embedding vector of each of these 17 sensitive concepts, not the

concept itself, is provided. These embeddings can be seen as a “hash” of the sensitive concepts. To

overcome the lack of documentation, we reverse engineer the safety ﬁlter and invert the embeddings

for the sensitive concepts. Surprisingly, we ﬁnd that the current ﬁlter only checks for images of

a sexual nature, ignoring other problematic content such as violence or gore. Moreover, simple

prompt-engineering reliably bypasses the ﬁlter even on the concepts that it does aim to block.

We conclude that the Stable Diffusion safety ﬁlter is likely not suitable for use in downstream

applications that require high safety standards. Worryingly, the lack of proper documentation on the

ﬁlter has so far prevented application developers from properly assessing safety risks and applying

additional mitigations (e.g., stronger content blockers) if needed [

]. Security by obscurity is rarely

warranted [

], and can amplify other risks (e.g., obfuscated “unsafe” concepts could be repurposed

for censorship). We encourage future releases (both open or closed source) of machine learning

models to adopt proven practices from computer security, such as open documentation of safety

features and their limitations, and the adoption of proper vulnerability disclosure channels.

2 How the safety ﬁlter works

The Stable Diffusion safety ﬁlter [

] is not documented, but we can deduce how it works from the

code in the public repository

. Here is a simpliﬁed outline for the safety ﬁlter in Stable Diffusion

v1.4 (see Figure 1, and Appendix Cfor the pseudocode):

•

The user provides a prompt, say

“a photograph of an astronaut riding a horse”

. The

Stable Diffusion model then creates an image conditioned on this prompt.

•

Before being shown to the user, the image is run through CLIP’s image encoder [

] to obtain an

embedding; i.e., a high-dimensional vector representation of the input.

•

Then, the cosine similarity between this embedding and 17 different ﬁxed embedding vectors is

computed. Each of these ﬁxed vectors represents some pre-deﬁned sensitive concepts.

•

Every concept has a prespeciﬁed similarity threshold. If the cosine similarity between the image

and any of the concepts is larger than the respective threshold, the image is discarded.

The vector representations of the unsafe concepts are embeddings of unknown text prompts using

CLIP’s text model. Because CLIP is trained to match the embeddings of images and corresponding

textual captions, it is expected that the textual embedding of some unsafe concept (e.g.,

“nudity”

)

will be close to the image embeddings of depictions of this same concept. The 17 text prompts that

were used to produce pre-computed CLIP embeddings fully determine the unsafe concepts that Stable

Diffusion’s safety ﬁlter looks for. Unfortunately, these prompts have not been published. Thus, the

whole safety classiﬁcation logic is contained in some obfuscated static high-dimensional embeddings.

Special care concepts.

In addition to the procedure outlined above, the ﬁlter considers a higher

level of particularly sensitive "special care concepts". Speciﬁcally, if a generated image is close (in

CLIP’s latent space) to any ﬁxed special care concept, then the similarity threshold for the above 17

sensitive concepts is lowered, so that ﬁltering is more aggressive. This behavior is also undocumented.

The code shows there are three special care concepts, which again are only provided as embeddings.

1https://github.com/huggingface/diffusers/blob/84b9df5/src/diffusers/pipelines/

stable_diffusion/safety_checker.py. Accessed 29/09/2022.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Red-TeamingtheStableDiffusionSafetyFilterJavierRandoETHZurichjrando@ethz.chDanielPalekaETHZurichdaniel.paleka@inf.ethz.chDavidLindnerETHZurichdavid.lindner@inf.ethz.chLennartHeimCentrefortheGovernanceofAIlennart.heim@governance.aiFlorianTramèrETHZurichflorian.tramer@inf.ethz.chAbstractStableDiffusio...

展开>> 收起<<

Red-Teaming the Stable Diffusion Safety Filter Javier Rando ETH Zurich.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Red-Teaming the Stable Diffusion Safety Filter Javier Rando ETH Zurich

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: