What the DAAM Interpreting Stable Diffusion Using Cross Attention Raphael Tang1Linqing Liu2Akshat Pandey1Zhiying Jiang3Gefei Yang1 Karun Kumar1Pontus Stenetorp2Jimmy Lin3Ferhan Ture1

2025-04-29 0 0 2.49MB 13 页 10玖币

侵权投诉

What the DAAM: Interpreting Stable Diffusion Using Cross Attention

Raphael Tang,∗1Linqing Liu,∗2Akshat Pandey,1Zhiying Jiang,3Gefei Yang,1

Karun Kumar,1Pontus Stenetorp,2Jimmy Lin,3Ferhan Ture1

1Comcast Applied AI 2University College London 3University of Waterloo

1{raphael_tang,akshat_pandey,gefei_yang,karun_kumar,ferhan_ture}@comcast.com

2{linqing.liu,p.stenetorp}@cs.ucl.ac.uk 3{zhiying.jiang,jimmylin}@uwaterloo.ca

Abstract

Large-scale diffusion neural networks repre-

sent a substantial milestone in text-to-image

generation, but they remain poorly understood,

lacking interpretability analyses. In this pa-

per, we perform a text–image attribution anal-

ysis on Stable Diffusion, a recently open-

sourced model. To produce pixel-level attri-

bution maps, we upscale and aggregate cross-

attention word–pixel scores in the denoising

subnetwork, naming our method DAAM. We

evaluate its correctness by testing its seman-

tic segmentation ability on nouns, as well as

its generalized attribution quality on all parts

of speech, rated by humans. We then ap-

ply DAAM to study the role of syntax in the

pixel space, characterizing head–dependent

heat map interaction patterns for ten common

dependency relations. Finally, we study sev-

eral semantic phenomena using DAAM, with a

focus on feature entanglement, where we ﬁnd

that cohyponyms worsen generation quality

and descriptive adjectives attend too broadly.

To our knowledge, we are the ﬁrst to interpret

large diffusion models from a visuolinguistic

perspective, which enables future lines of re-

search. Our code is at https://github.com/

castorini/daam.

1 Introduction

Diffusion neural networks trained on billions of

image–caption pairs represent the state of the art

in text-to-image generation (Yang et al.,2022),

with some achieving realism comparable to pho-

tographs in human evaluation, such as Google’s

Imagen (Saharia et al.,2022) and OpenAI’s DALL-

E2(Ramesh et al.,2022). However, despite their

quality and popularity, the dynamics of their image

synthesis remain undercharacterized. Citing ethi-

cal concerns, these organizations have restricted

the general public from using the models and

their weights, preventing effective white-box (or

even blackbox) analysis. To overcome this barrier,

∗Equal contribution.

Figure 1: The original synthesized image and three

DAAM maps for “monkey,” “hat,” and “walking,” from

the prompt, “monkey with hat walking.”

Stability AI recently open-sourced Stable Diffu-

sion (Rombach et al.,2022), a 1.1 billion-parameter

latent diffusion model pretrained and ﬁne-tuned on

the LAION 5-billion image dataset (Schuhmann

et al.,2022).

We probe Stable Diffusion to provide insight

into the workings of large diffusion models. With

a focus on text-to-image attribution, our central

research question is, “How does an input word

inﬂuence parts of a generated image?” To this,

we ﬁrst propose to produce two-dimensional attri-

bution maps for each word by combining cross-

attention maps in the model, as delineated in Sec-

tion 2.2. A related work in prompt-guided editing

from Hertz et al. (2022) conjectures that per-head

cross attention relates words to areas in Imagen-

generated images, but they fall short of construct-

ing global per-word attribution maps. We name

our method diffusion attentive attribution maps, or

DAAM for short—see Figure 1for an example.

To evaluate the veracity of DAAM, we apply it to

a semantic segmentation task (Lin et al.,2014) on

generated imagery, comparing DAAM maps with

annotated segments. We attain a 58.9–64.8 mean

intersection over union (mIoU) score, which is com-

petitive with unsupervised segmentation models,

described in Section 3.1. We further bolster these

noun attribution results using a generalized study

covering all parts of speech, such as adjectives and

verbs. Through human annotation, we show that

the mean opinion score (MOS) is above fair to

good (3.4–4.2) on interpretable words.

arXiv:2210.04885v5 [cs.CV] 8 Dec 2022

Next, we characterize how relationships in the

syntactic space of prompts relate to those in the

pixel space of images. We assess head–dependent

DAAM map interactions across ten common syn-

tactic relationships, ﬁnding that, for some, the heat

map of the dependent strongly subsumes that of the

head, while the opposite is true for others. For still

others, such as coreferent word pairs, the words’

maps greatly overlap, indicating identity. We as-

sign visual intuition to our observations; for exam-

ple, we conjecture that the maps of verbs contain

those of their subjects, because verbs often contex-

tualize both the subjects and their surroundings.

Finally, we form hypotheses to further examine

our syntactic ﬁndings, studying semantic phenom-

ena through the lens of DAAM, particularly those

affecting the generation quality. In Section 5.1,

we demonstrate that, in constructed prompts with

two distinct nouns, cohyponyms have worse qual-

ity, e.g., “a giraffe and a zebra” generates either a

giraffe or a zebra, but not both. We observe that co-

hyponym status and generation incorrectness each

increases the amount of overlap between the heat

maps. We also show in Section 5.2 that descriptive

adjectives attend too broadly across the image, far

beyond the nouns they modify. If we hold the scene

layout ﬁxed (Hertz et al.,2022) and vary only the

adjective, the entire image changes, not just the

noun. These two phenomena suggest feature entan-

glement, where objects are entangled with both the

scene and other objects.

In summary, our contributions are as follows:

(1)

we propose and evaluate an attribution method,

novel within the context of interpreting diffusion

models, measuring which parts of the generated

image the words inﬂuence most;

(2)

we provide

new insight into how syntactic relationships map to

generated pixels, ﬁnding evidence for directional

imbalance in head–dependent DAAM map overlap,

alongside visual intuition (and counterintuition) in

the behaviors of nominals, modiﬁers, and function

words; and

(3)

we shine light on failure cases in

diffusion models, showing that descriptive adjecti-

val modiﬁers and cohyponyms result in entangled

features and DAAM maps.

2 Our Approach

2.1 Preliminaries

Latent diffusion models (Rombach et al.,2022)

are a class of denoising generative models that are

trained to synthesize high-ﬁdelity images from ran-

dom noise through a gradual denoising process, op-

tionally conditioned on text. They generally com-

prise three components: a deep language model

like CLIP (Radford et al.,2021) for producing

word embeddings; a variational autoencoder (VAE;

Kingma and Welling,2013) which encodes and

decodes latent vectors for images; and a time-

conditional U-Net (Ronneberger et al.,2015) for

gradually denoising latent vectors. To generate an

image, we initialize the latent vectors to random

noise, feed in a text prompt, then iteratively denoise

the latent vectors with the U-Net and decode the

ﬁnal vector into an image with the VAE.

Formally, given an image, the VAE encodes it

as a latent vector

`t0∈Rd

. Deﬁne a forward

“noise injecting” Markov chain

p(`ti|`ti−1) :=

N(`ti;√1−αti`t0, αtiI)

where

{αti}T

i=1

is de-

ﬁned following a schedule so that

p(`tT)

is approx-

imately zero-mean isotropic. The corresponding

denoising reverse chain is then parameterized as

p(`ti−1|`ti) := N(`ti−1;1

√1−αti

(`ti+αtiθ(`ti, ti)), αtiI),

(1)

for some denoising neural network

θ(`, t)

with

parameters

. Intuitively, the forward process it-

eratively adds noise to some signal at a ﬁxed rate,

while the reverse process, equipped with a neural

network, removes noise until recovering the signal.

To train the network, given caption–image pairs,

we optimize

minθPT

i=1 ζiEp(`ti|`t0)kθ(`ti, ti)− ∇`tilog p(`ti|`t0)k2

(2)

where

{ζi}T

i=1

are constants computed as

ζi:=

1−Qi

j=1(1 −αj)

. The objective is a reweighted

form of the evidence lower bound for score match-

ing (Song et al.,2021). To generate a latent vector,

we initialize ˆ

`tTas Gaussian noise and iterate

`ti−1=1

√1−αti

(ˆ

`ti+αtiθ(ˆ

`ti, ti)) + √αtizti.

(3)

In practice, we apply various optimizations to im-

prove the convergence of the above step, like mod-

eling the reverse process as an ODE (Song et al.,

2021), but this deﬁnition sufﬁces for us. We can

additionally condition the latent vectors on text

and pass word embeddings

X:= [x1;··· ;xlW]

θ(`, t;X)

. Finally, the VAE decodes the de-

noised latent

`t0

to an image. For this paper, we

use the publicly available weights of the state-of-

the-art, 1.1 billion-parameter Stable Diffusion 2.0

model (Rombach et al.,2022), trained on 5 bil-

lion caption–image pairs (Schuhmann et al.,2022)

and implemented in HuggingFace’s Diffusers li-

brary (von Platen et al.,2022).

2.2 Diffusion Attentive Attribution Maps

Given a large-scale latent diffusion model for text-

to-image synthesis, which parts of an image does

each word inﬂuence most? One way to achieve

this would be attribution approaches, which are

mainly perturbation- and gradient-based (Alvarez-

Melis and Jaakkola,2018;Selvaraju et al.,2017),

where saliency maps are constructed either from

the ﬁrst derivative of the output with respect to the

input, or from input perturbation to see how the

output changes. Unfortunately, gradient methods

prove intractable due to needing a backpropagation

pass for every pixel for all

time steps, and even

minor perturbations result in signiﬁcantly different

images in our pilot experiments.

Instead, we use ideas from natural language pro-

cessing, where word attention was found to indi-

cate lexical attribution (Clark et al.,2019), as well

as the spatial layout of Imagen’s images (Hertz

et al.,2022). In diffusion models, attention mech-

anisms cross-contextualize text embeddings with

coordinate-aware latent representations (Rombach

et al.,2022) of the image, outputting scores for

each token–image patch pair. Attention scores lend

themselves readily to interpretation since they are

already normalized in

[0,1]

.Thus, for pixelwise

attribution, we propose to aggregate these scores

over the spatiotemporal dimensions and interpolate

them across the image.

We turn our attention to the denoising network

θ(`, t;X)

responsible for the synthesis. While

the subnetwork can take any form, U-Nets remain

the popular choice (Ronneberger et al.,2015) due

to their strong image segmentation ability. They

consist of a series of downsampling convolutional

blocks, each of which preserves some local context,

followed by upsampling deconvolutional blocks,

which restore the original input size to the out-

put. Speciﬁcally, given a 2D latent

`t∈Rw×h

the downsampling blocks output a series of vec-

tors

{h↓

i,t}K

i=1

, where

h↓

i,t ∈Rdw

cie×d h

cie

for some

c > 1

. The upsampling blocks then iteratively

upscale

h↓

K,t

{h↑

i,t}0

i=K−1∈Rdw

cie×d h

cie

. To

condition these representations on word embed-

dings, Rombach et al. (2022) use multi-headed

cross-attention layers (Vaswani et al.,2017)

h↓

i,t := F(i)

t(ˆ

h↓

i,t,X)·(W(i)

vX),(4)

F(i)

t(ˆ

h↓

i,t,X) := softmax (W(i)

qˆ

h↓

i,t)(W(i)

kX)T/√d,

(5)

where

F(i)↓

t∈Rdw

cie×d h

cie×lH×lW

and

are projection matrices with

attention

Figure 2: Illustration of computing DAAM for some

word: the multiscale attention arrays from Eqn. (5) (see

A); the bicubic interpolation (B) resulting in expanded

maps (C); summing the heat maps across the layers (D),

as in Eqn. (6); and the thresholding (E) from Eqn. (7).

heads. The same mechanism applies when upsam-

pling

h↑

. For brevity, we denote the respective

attention score arrays as

F(i)↓

and

F(i)↑

, and we

implicitly broadcast matrix multiplications as per

NumPy convention (Harris et al.,2020).

Spatiotemporal aggregation. F(i)↓

t[x, y, `, k]

normalized to

[0,1]

and connects the

kth

word to

the intermediate coordinate

(x, y)

for the

ith

down-

sampling block and

`th

head. Due to the fully con-

volutional nature of U-Net (and the VAE), the inter-

mediate coordinates locally map to a surrounding

affected square area in the ﬁnal image, the scores

thus relating each word to that image patch. How-

ever, different layers produce heat maps with vary-

ing scales, deepest ones being the coarsest (e.g.,

h↓

K,t

and

h↑

K−1,t

), requiring spatial normalization

to create a single heat map. To do this, we upscale

all intermediate attention score arrays to the orig-

inal image size using bicubic interpolation, then

sum them over the heads, layers, and time steps:

k[x, y] := X

i,j,`

F(i)↓

tj,k,`[x, y] + ˜

F(i)↑

tj,k,`[x, y],(6)

where

is the

kth

word and

F(i)↓

tj,k,`[x, y]

is short-

hand for

F(i)↓

t[x, y, `, k]

, bicubically upscaled to

ﬁxed size

(w, h)

Since

is positive and scale

normalized (summing normalized values preserves

linear scale), we can visualize it as a soft heat map,

with higher values having greater attribution. To

generate a hard, binary heat map (either a pixel is

inﬂuenced or not), we can threshold DR

kas

DIτ

k[x, y] := IDR

k[x, y]≥τmax

i,j DR

k[i, j],(7)

where

I(·)

is the indicator function and

τ∈[0,1]

See Figure 2for an illustration of DAAM.

We show that aggregating across all time steps and layers

is indeed necessary in Section A.1.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WhattheDAAM:InterpretingStableDiffusionUsingCrossAttentionRaphaelTang,1LinqingLiu,2AkshatPandey,1ZhiyingJiang,3GefeiYang,1KarunKumar,1PontusStenetorp,2JimmyLin,3FerhanTure11ComcastAppliedAI2UniversityCollegeLondon3UniversityofWaterloo1{raphael_tang,akshat_pandey,gefei_yang,karun_kumar,ferhan_ture}...

展开>> 收起<<

What the DAAM Interpreting Stable Diffusion Using Cross Attention Raphael Tang1Linqing Liu2Akshat Pandey1Zhiying Jiang3Gefei Yang1 Karun Kumar1Pontus Stenetorp2Jimmy Lin3Ferhan Ture1.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

What the DAAM Interpreting Stable Diffusion Using Cross Attention Raphael Tang1Linqing Liu2Akshat Pandey1Zhiying Jiang3Gefei Yang1 Karun Kumar1Pontus Stenetorp2Jimmy Lin3Ferhan Ture1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: