What the DAAM Interpreting Stable Diffusion Using Cross Attention Raphael Tang1Linqing Liu2Akshat Pandey1Zhiying Jiang3Gefei Yang1 Karun Kumar1Pontus Stenetorp2Jimmy Lin3Ferhan Ture1

2025-04-29 0 0 2.49MB 13 页 10玖币
侵权投诉
What the DAAM: Interpreting Stable Diffusion Using Cross Attention
Raphael Tang,1Linqing Liu,2Akshat Pandey,1Zhiying Jiang,3Gefei Yang,1
Karun Kumar,1Pontus Stenetorp,2Jimmy Lin,3Ferhan Ture1
1Comcast Applied AI 2University College London 3University of Waterloo
1{raphael_tang,akshat_pandey,gefei_yang,karun_kumar,ferhan_ture}@comcast.com
2{linqing.liu,p.stenetorp}@cs.ucl.ac.uk 3{zhiying.jiang,jimmylin}@uwaterloo.ca
Abstract
Large-scale diffusion neural networks repre-
sent a substantial milestone in text-to-image
generation, but they remain poorly understood,
lacking interpretability analyses. In this pa-
per, we perform a text–image attribution anal-
ysis on Stable Diffusion, a recently open-
sourced model. To produce pixel-level attri-
bution maps, we upscale and aggregate cross-
attention word–pixel scores in the denoising
subnetwork, naming our method DAAM. We
evaluate its correctness by testing its seman-
tic segmentation ability on nouns, as well as
its generalized attribution quality on all parts
of speech, rated by humans. We then ap-
ply DAAM to study the role of syntax in the
pixel space, characterizing head–dependent
heat map interaction patterns for ten common
dependency relations. Finally, we study sev-
eral semantic phenomena using DAAM, with a
focus on feature entanglement, where we find
that cohyponyms worsen generation quality
and descriptive adjectives attend too broadly.
To our knowledge, we are the first to interpret
large diffusion models from a visuolinguistic
perspective, which enables future lines of re-
search. Our code is at https://github.com/
castorini/daam.
1 Introduction
Diffusion neural networks trained on billions of
image–caption pairs represent the state of the art
in text-to-image generation (Yang et al.,2022),
with some achieving realism comparable to pho-
tographs in human evaluation, such as Google’s
Imagen (Saharia et al.,2022) and OpenAI’s DALL-
E2(Ramesh et al.,2022). However, despite their
quality and popularity, the dynamics of their image
synthesis remain undercharacterized. Citing ethi-
cal concerns, these organizations have restricted
the general public from using the models and
their weights, preventing effective white-box (or
even blackbox) analysis. To overcome this barrier,
Equal contribution.
Figure 1: The original synthesized image and three
DAAM maps for “monkey,” “hat,” and “walking,” from
the prompt, “monkey with hat walking.
Stability AI recently open-sourced Stable Diffu-
sion (Rombach et al.,2022), a 1.1 billion-parameter
latent diffusion model pretrained and fine-tuned on
the LAION 5-billion image dataset (Schuhmann
et al.,2022).
We probe Stable Diffusion to provide insight
into the workings of large diffusion models. With
a focus on text-to-image attribution, our central
research question is, “How does an input word
influence parts of a generated image?” To this,
we first propose to produce two-dimensional attri-
bution maps for each word by combining cross-
attention maps in the model, as delineated in Sec-
tion 2.2. A related work in prompt-guided editing
from Hertz et al. (2022) conjectures that per-head
cross attention relates words to areas in Imagen-
generated images, but they fall short of construct-
ing global per-word attribution maps. We name
our method diffusion attentive attribution maps, or
DAAM for short—see Figure 1for an example.
To evaluate the veracity of DAAM, we apply it to
a semantic segmentation task (Lin et al.,2014) on
generated imagery, comparing DAAM maps with
annotated segments. We attain a 58.9–64.8 mean
intersection over union (mIoU) score, which is com-
petitive with unsupervised segmentation models,
described in Section 3.1. We further bolster these
noun attribution results using a generalized study
covering all parts of speech, such as adjectives and
verbs. Through human annotation, we show that
the mean opinion score (MOS) is above fair to
good (3.4–4.2) on interpretable words.
arXiv:2210.04885v5 [cs.CV] 8 Dec 2022
Next, we characterize how relationships in the
syntactic space of prompts relate to those in the
pixel space of images. We assess head–dependent
DAAM map interactions across ten common syn-
tactic relationships, finding that, for some, the heat
map of the dependent strongly subsumes that of the
head, while the opposite is true for others. For still
others, such as coreferent word pairs, the words’
maps greatly overlap, indicating identity. We as-
sign visual intuition to our observations; for exam-
ple, we conjecture that the maps of verbs contain
those of their subjects, because verbs often contex-
tualize both the subjects and their surroundings.
Finally, we form hypotheses to further examine
our syntactic findings, studying semantic phenom-
ena through the lens of DAAM, particularly those
affecting the generation quality. In Section 5.1,
we demonstrate that, in constructed prompts with
two distinct nouns, cohyponyms have worse qual-
ity, e.g., “a giraffe and a zebra” generates either a
giraffe or a zebra, but not both. We observe that co-
hyponym status and generation incorrectness each
increases the amount of overlap between the heat
maps. We also show in Section 5.2 that descriptive
adjectives attend too broadly across the image, far
beyond the nouns they modify. If we hold the scene
layout fixed (Hertz et al.,2022) and vary only the
adjective, the entire image changes, not just the
noun. These two phenomena suggest feature entan-
glement, where objects are entangled with both the
scene and other objects.
In summary, our contributions are as follows:
(1)
we propose and evaluate an attribution method,
novel within the context of interpreting diffusion
models, measuring which parts of the generated
image the words influence most;
(2)
we provide
new insight into how syntactic relationships map to
generated pixels, finding evidence for directional
imbalance in head–dependent DAAM map overlap,
alongside visual intuition (and counterintuition) in
the behaviors of nominals, modifiers, and function
words; and
(3)
we shine light on failure cases in
diffusion models, showing that descriptive adjecti-
val modifiers and cohyponyms result in entangled
features and DAAM maps.
2 Our Approach
2.1 Preliminaries
Latent diffusion models (Rombach et al.,2022)
are a class of denoising generative models that are
trained to synthesize high-fidelity images from ran-
dom noise through a gradual denoising process, op-
tionally conditioned on text. They generally com-
prise three components: a deep language model
like CLIP (Radford et al.,2021) for producing
word embeddings; a variational autoencoder (VAE;
Kingma and Welling,2013) which encodes and
decodes latent vectors for images; and a time-
conditional U-Net (Ronneberger et al.,2015) for
gradually denoising latent vectors. To generate an
image, we initialize the latent vectors to random
noise, feed in a text prompt, then iteratively denoise
the latent vectors with the U-Net and decode the
final vector into an image with the VAE.
Formally, given an image, the VAE encodes it
as a latent vector
`t0Rd
. Define a forward
“noise injecting” Markov chain
p(`ti|`ti1) :=
N(`ti;1αti`t0, αtiI)
where
{αti}T
i=1
is de-
fined following a schedule so that
p(`tT)
is approx-
imately zero-mean isotropic. The corresponding
denoising reverse chain is then parameterized as
p(`ti1|`ti) := N(`ti1;1
1αti
(`ti+αtiθ(`ti, ti)), αtiI),
(1)
for some denoising neural network
θ(`, t)
with
parameters
θ
. Intuitively, the forward process it-
eratively adds noise to some signal at a fixed rate,
while the reverse process, equipped with a neural
network, removes noise until recovering the signal.
To train the network, given caption–image pairs,
we optimize
minθPT
i=1 ζiEp(`ti|`t0)kθ(`ti, ti)− ∇`tilog p(`ti|`t0)k2
2,
(2)
where
{ζi}T
i=1
are constants computed as
ζi:=
1Qi
j=1(1 αj)
. The objective is a reweighted
form of the evidence lower bound for score match-
ing (Song et al.,2021). To generate a latent vector,
we initialize ˆ
`tTas Gaussian noise and iterate
ˆ
`ti1=1
1αti
(ˆ
`ti+αtiθ(ˆ
`ti, ti)) + αtizti.
(3)
In practice, we apply various optimizations to im-
prove the convergence of the above step, like mod-
eling the reverse process as an ODE (Song et al.,
2021), but this definition suffices for us. We can
additionally condition the latent vectors on text
and pass word embeddings
X:= [x1;··· ;xlW]
to
θ(`, t;X)
. Finally, the VAE decodes the de-
noised latent
ˆ
`t0
to an image. For this paper, we
use the publicly available weights of the state-of-
the-art, 1.1 billion-parameter Stable Diffusion 2.0
model (Rombach et al.,2022), trained on 5 bil-
lion caption–image pairs (Schuhmann et al.,2022)
and implemented in HuggingFace’s Diffusers li-
brary (von Platen et al.,2022).
2.2 Diffusion Attentive Attribution Maps
Given a large-scale latent diffusion model for text-
to-image synthesis, which parts of an image does
each word influence most? One way to achieve
this would be attribution approaches, which are
mainly perturbation- and gradient-based (Alvarez-
Melis and Jaakkola,2018;Selvaraju et al.,2017),
where saliency maps are constructed either from
the first derivative of the output with respect to the
input, or from input perturbation to see how the
output changes. Unfortunately, gradient methods
prove intractable due to needing a backpropagation
pass for every pixel for all
T
time steps, and even
minor perturbations result in significantly different
images in our pilot experiments.
Instead, we use ideas from natural language pro-
cessing, where word attention was found to indi-
cate lexical attribution (Clark et al.,2019), as well
as the spatial layout of Imagen’s images (Hertz
et al.,2022). In diffusion models, attention mech-
anisms cross-contextualize text embeddings with
coordinate-aware latent representations (Rombach
et al.,2022) of the image, outputting scores for
each token–image patch pair. Attention scores lend
themselves readily to interpretation since they are
already normalized in
[0,1]
.Thus, for pixelwise
attribution, we propose to aggregate these scores
over the spatiotemporal dimensions and interpolate
them across the image.
We turn our attention to the denoising network
θ(`, t;X)
responsible for the synthesis. While
the subnetwork can take any form, U-Nets remain
the popular choice (Ronneberger et al.,2015) due
to their strong image segmentation ability. They
consist of a series of downsampling convolutional
blocks, each of which preserves some local context,
followed by upsampling deconvolutional blocks,
which restore the original input size to the out-
put. Specifically, given a 2D latent
`tRw×h
,
the downsampling blocks output a series of vec-
tors
{h
i,t}K
i=1
, where
h
i,t Rdw
cie×d h
cie
for some
c > 1
. The upsampling blocks then iteratively
upscale
h
K,t
to
{h
i,t}0
i=K1Rdw
cie×d h
cie
. To
condition these representations on word embed-
dings, Rombach et al. (2022) use multi-headed
cross-attention layers (Vaswani et al.,2017)
h
i,t := F(i)
t(ˆ
h
i,t,X)·(W(i)
vX),(4)
F(i)
t(ˆ
h
i,t,X) := softmax (W(i)
qˆ
h
i,t)(W(i)
kX)T/d,
(5)
where
F(i)
tRdw
cie×d h
cilH×lW
and
Wk
,
Wq
,
and
Wv
are projection matrices with
lH
attention
A
BD
C
E
Figure 2: Illustration of computing DAAM for some
word: the multiscale attention arrays from Eqn. (5) (see
A); the bicubic interpolation (B) resulting in expanded
maps (C); summing the heat maps across the layers (D),
as in Eqn. (6); and the thresholding (E) from Eqn. (7).
heads. The same mechanism applies when upsam-
pling
h
i
. For brevity, we denote the respective
attention score arrays as
F(i)
t
and
F(i)
t
, and we
implicitly broadcast matrix multiplications as per
NumPy convention (Harris et al.,2020).
Spatiotemporal aggregation. F(i)
t[x, y, `, k]
is
normalized to
[0,1]
and connects the
kth
word to
the intermediate coordinate
(x, y)
for the
ith
down-
sampling block and
`th
head. Due to the fully con-
volutional nature of U-Net (and the VAE), the inter-
mediate coordinates locally map to a surrounding
affected square area in the final image, the scores
thus relating each word to that image patch. How-
ever, different layers produce heat maps with vary-
ing scales, deepest ones being the coarsest (e.g.,
h
K,t
and
h
K1,t
), requiring spatial normalization
to create a single heat map. To do this, we upscale
all intermediate attention score arrays to the orig-
inal image size using bicubic interpolation, then
sum them over the heads, layers, and time steps:
DR
k[x, y] := X
i,j,`
˜
F(i)
tj,k,`[x, y] + ˜
F(i)
tj,k,`[x, y],(6)
where
k
is the
kth
word and
˜
F(i)
tj,k,`[x, y]
is short-
hand for
F(i)
t[x, y, `, k]
, bicubically upscaled to
fixed size
(w, h)
.
1
Since
DR
k
is positive and scale
normalized (summing normalized values preserves
linear scale), we can visualize it as a soft heat map,
with higher values having greater attribution. To
generate a hard, binary heat map (either a pixel is
influenced or not), we can threshold DR
kas
DIτ
k[x, y] := IDR
k[x, y]τmax
i,j DR
k[i, j],(7)
where
I(·)
is the indicator function and
τ[0,1]
.
See Figure 2for an illustration of DAAM.
1
We show that aggregating across all time steps and layers
is indeed necessary in Section A.1.
摘要:

WhattheDAAM:InterpretingStableDiffusionUsingCrossAttentionRaphaelTang,1LinqingLiu,2AkshatPandey,1ZhiyingJiang,3GefeiYang,1KarunKumar,1PontusStenetorp,2JimmyLin,3FerhanTure11ComcastAppliedAI2UniversityCollegeLondon3UniversityofWaterloo1{raphael_tang,akshat_pandey,gefei_yang,karun_kumar,ferhan_ture}...

展开>> 收起<<
What the DAAM Interpreting Stable Diffusion Using Cross Attention Raphael Tang1Linqing Liu2Akshat Pandey1Zhiying Jiang3Gefei Yang1 Karun Kumar1Pontus Stenetorp2Jimmy Lin3Ferhan Ture1.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:2.49MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注