
Next, we characterize how relationships in the
syntactic space of prompts relate to those in the
pixel space of images. We assess head–dependent
DAAM map interactions across ten common syn-
tactic relationships, finding that, for some, the heat
map of the dependent strongly subsumes that of the
head, while the opposite is true for others. For still
others, such as coreferent word pairs, the words’
maps greatly overlap, indicating identity. We as-
sign visual intuition to our observations; for exam-
ple, we conjecture that the maps of verbs contain
those of their subjects, because verbs often contex-
tualize both the subjects and their surroundings.
Finally, we form hypotheses to further examine
our syntactic findings, studying semantic phenom-
ena through the lens of DAAM, particularly those
affecting the generation quality. In Section 5.1,
we demonstrate that, in constructed prompts with
two distinct nouns, cohyponyms have worse qual-
ity, e.g., “a giraffe and a zebra” generates either a
giraffe or a zebra, but not both. We observe that co-
hyponym status and generation incorrectness each
increases the amount of overlap between the heat
maps. We also show in Section 5.2 that descriptive
adjectives attend too broadly across the image, far
beyond the nouns they modify. If we hold the scene
layout fixed (Hertz et al.,2022) and vary only the
adjective, the entire image changes, not just the
noun. These two phenomena suggest feature entan-
glement, where objects are entangled with both the
scene and other objects.
In summary, our contributions are as follows:
(1)
we propose and evaluate an attribution method,
novel within the context of interpreting diffusion
models, measuring which parts of the generated
image the words influence most;
(2)
we provide
new insight into how syntactic relationships map to
generated pixels, finding evidence for directional
imbalance in head–dependent DAAM map overlap,
alongside visual intuition (and counterintuition) in
the behaviors of nominals, modifiers, and function
words; and
(3)
we shine light on failure cases in
diffusion models, showing that descriptive adjecti-
val modifiers and cohyponyms result in entangled
features and DAAM maps.
2 Our Approach
2.1 Preliminaries
Latent diffusion models (Rombach et al.,2022)
are a class of denoising generative models that are
trained to synthesize high-fidelity images from ran-
dom noise through a gradual denoising process, op-
tionally conditioned on text. They generally com-
prise three components: a deep language model
like CLIP (Radford et al.,2021) for producing
word embeddings; a variational autoencoder (VAE;
Kingma and Welling,2013) which encodes and
decodes latent vectors for images; and a time-
conditional U-Net (Ronneberger et al.,2015) for
gradually denoising latent vectors. To generate an
image, we initialize the latent vectors to random
noise, feed in a text prompt, then iteratively denoise
the latent vectors with the U-Net and decode the
final vector into an image with the VAE.
Formally, given an image, the VAE encodes it
as a latent vector
`t0∈Rd
. Define a forward
“noise injecting” Markov chain
p(`ti|`ti−1) :=
N(`ti;√1−αti`t0, αtiI)
where
{αti}T
i=1
is de-
fined following a schedule so that
p(`tT)
is approx-
imately zero-mean isotropic. The corresponding
denoising reverse chain is then parameterized as
p(`ti−1|`ti) := N(`ti−1;1
√1−αti
(`ti+αtiθ(`ti, ti)), αtiI),
(1)
for some denoising neural network
θ(`, t)
with
parameters
θ
. Intuitively, the forward process it-
eratively adds noise to some signal at a fixed rate,
while the reverse process, equipped with a neural
network, removes noise until recovering the signal.
To train the network, given caption–image pairs,
we optimize
minθPT
i=1 ζiEp(`ti|`t0)kθ(`ti, ti)− ∇`tilog p(`ti|`t0)k2
2,
(2)
where
{ζi}T
i=1
are constants computed as
ζi:=
1−Qi
j=1(1 −αj)
. The objective is a reweighted
form of the evidence lower bound for score match-
ing (Song et al.,2021). To generate a latent vector,
we initialize ˆ
`tTas Gaussian noise and iterate
ˆ
`ti−1=1
√1−αti
(ˆ
`ti+αtiθ(ˆ
`ti, ti)) + √αtizti.
(3)
In practice, we apply various optimizations to im-
prove the convergence of the above step, like mod-
eling the reverse process as an ODE (Song et al.,
2021), but this definition suffices for us. We can
additionally condition the latent vectors on text
and pass word embeddings
X:= [x1;··· ;xlW]
to
θ(`, t;X)
. Finally, the VAE decodes the de-
noised latent
ˆ
`t0
to an image. For this paper, we
use the publicly available weights of the state-of-
the-art, 1.1 billion-parameter Stable Diffusion 2.0
model (Rombach et al.,2022), trained on 5 bil-
lion caption–image pairs (Schuhmann et al.,2022)
and implemented in HuggingFace’s Diffusers li-
brary (von Platen et al.,2022).