DIFFUSION DB A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models Zijie J. Wang1Evan Montoya1David Munechika1

2025-05-03 0 0 4.53MB 17 页 10玖币
侵权投诉
DIFFUSIONDB: A Large-scale Prompt Gallery Dataset for Text-to-Image
Generative Models
Zijie J. Wang1Evan Montoya1David Munechika1
Haoyang Yang1Benjamin Hoover1,2Duen Horng Chau1
1Georgia Tech 2IBM Research
{jayw|emontoya30|david.munechika|alexanderyang|bhoov|polo}@gatech.edu
Abstract
With recent advancements in diffusion models,
users can generate high-quality images by writ-
ing text prompts in natural language. However,
generating images with desired details requires
proper prompts, and it is often unclear how a
model reacts to different prompts or what the
best prompts are. To help researchers tackle
these critical challenges, we introduce DIF-
FUSIONDB, the first large-scale text-to-image
prompt dataset totaling 6.5TB, containing 14
million images generated by Stable Diffusion,
1.8 million unique prompts, and hyperpa-
rameters specified by real users. We analyze
the syntactic and semantic characteristics of
prompts. We pinpoint specific hyperparameter
values and prompt styles that can lead to model
errors and present evidence of potentially
harmful model usage, such as the generation
of misinformation. The unprecedented scale
and diversity of this human-actuated dataset
provide exciting research opportunities in
understanding the interplay between prompts
and generative models, detecting deepfakes,
and designing human-AI interaction tools
to help users more easily use these models.
DIFFUSIONDB is publicly available at:
https:
//poloclub.github.io/diffusiondb.
1 Introduction
Recent diffusion models have gained immense pop-
ularity by enabling high-quality and controllable
image generation based on text prompts written in
natural language (Rombach et al.,2022;Ramesh
et al.,2022;Saharia et al.,2022). Since the re-
lease of these models, people from different do-
mains have quickly applied them to create award-
winning artworks (Roose,2022), synthetic radi-
ology images (Chambon et al.,2022), and even
hyper-realistic videos (Ho et al.,2022).
However, generating images with desired de-
tails is difficult, as it requires users to write proper
prompts specifying the exact expected results. De-
veloping such prompts requires trial and error,
Fig. 1: DIFFUSIONDB is the first large-scale dataset
featuring 6.5TB data including 1.8 million unique Stable
Diffusion prompts and 14 million generated images with
accompanying hyperparameters. It provides exciting
research opportunities in prompt engineering, deepfake
detection, and understanding large generative models.
and can often feel random and unprincipled (Liu
and Chilton,2022). Willison et al. (2022) analo-
gize writing prompts to wizards learning “magical
spells”: users do not understand why some prompts
work, but they will add these prompts to their “spell
book. For example, to generate highly-detailed im-
ages, it has become a common practice to add spe-
cial keywords such as “
trending on artstation
and “unreal engine” in the prompt.
Prompt engineering has become a field of study
in the context of text-to-text generation, where re-
searchers systematically investigate how to con-
struct prompts to effectively solve different down-
stream tasks (Branwen,2020;Reynolds and Mc-
Donell,2021). As large text-to-image models are
relatively new, there is a pressing need to under-
stand how these models react to prompts, how to
write effective prompts, and how to design tools to
help users generate images (Liu and Chilton,2022).
Our work helps researchers tackle these critical
challenges, through three major contributions:
DIFFUSIONDB (Fig. 1), the first large-scale
prompt dataset totaling 6.5TB, containing
14 million images generated by Stable Diffu-
sion (Rombach et al.,2022) using 1.8 million
1
arXiv:2210.14896v4 [cs.CV] 6 Jul 2023
Fig. 2: DIFFUSIONDB contains 14 million Stable Diffusion images, 1.8 million unique text prompts, and all model
hyperparameters:
seed
,
step
,
CFG scale
,
sampler
, and
image size
. Each image also has a unique filename,
a hash of its creator’s Discord username, and a creation timestamp. To help researchers filter out potentially unsafe
or harmful content, we employ state-of-the-art models to compute an NSFW score for each image and prompt.
unique prompts and hyperparameters specified
by real users. We construct this dataset by collect-
ing images shared on the Stable Diffusion public
Discord server (§ 2). We release DIFFUSIONDB
with a CC0 1.0 license, allowing users to flexi-
bly share and adapt the dataset for their use. In
addition, we open-source our code
1
that collects,
processes, and analyzes the images and prompts.
Revealing prompt patterns and model errors.
The unprecedented scale of DIFFUSIONDB
paves the path for researchers to systematically
investigate diverse prompts and associated im-
ages that were previously not possible. By char-
acterizing prompts and images, we discover com-
mon prompt patterns and find different distribu-
tions of the semantic representations of prompts
and images. Our error analysis highlights partic-
ular hyperparameters and prompt styles can lead
to model errors. Finally, we provide evidence of
image generative models being used for poten-
tially harmful purposes such as generating misin-
formation and nonconsensual pornography (§ 3).
Highlighting new research directions. As the
first-of-its-kind text-to-image prompt dataset,
DIFFUSIONDB opens up unique opportunities
for researchers from natural language processing
(NLP), computer vision, and human-computer
interaction (HCI) communities. The scale and
diversity of this human-actuated dataset will
provide new research opportunities in better
tooling for prompt engineering, explaining large
generative models, and detecting deepfakes (§ 4).
We believe DIFFUSIONDB will serve as an im-
portant resource for researchers to study the roles
of prompts in text-to-image generation and design
next-generation human-AI interaction tools.
1Code: https://github.com/poloclub/diffusiondb
2 Constructing DIFFUSIONDB
We construct DIFFUSIONDB (Fig. 2) by scraping
user-generated images from the official Stable Dif-
fusion Discord server. We choose Stable Diffusion
as it is currently the only open-source large text-to-
image generative model, and all generated images
have a CC0 1.0 license that allows uses for any pur-
pose (StabilityAI,2022b). We choose the official
public Discord server as it has strict rules against
generating illegal, hateful, or NSFW (not suitable
for work, such as sexual and violent content) im-
ages, and it prohibits sharing prompts with personal
information (StabilityAI,2022a).
Our construction process includes collecting im-
ages (§ 2.1), linking them to prompts and hyperpa-
rameters (§ 2.2), applying NSFW detectors (§ 2.3),
creating a flexible file structure (§ 2.4), and dis-
tributing the dataset (§ 2.5). We discuss DIFFU-
SIONDB’s limitations and broader impacts in § 7,
§ 8, and a Data Sheet (Gebru et al.,2020) (‡ A).
2.1 Collecting User Generated Images
We download chat messages from the Stable
Diffusion Discord channels with DiscordChatEx-
porter (Holub,2017), saving them as HTML files.
We focus on channels where users can command
a bot to run Stable Diffusion Version 1 to generate
images by typing a prompt, hyperparameters, and
the number of images. The bot then replies with
the generated images and used random seeds.
2.2 Extracting Image Metadata
We use Beautiful Soup (Richardson,2007) to parse
HTML files, mapping generated images with their
prompts, hyperparameters, seeds, timestamps, and
the requester’s Discord usernames. Some images
are collages, where the bot combines
n
generated
2
images as a grid (e.g., a
3×3
grid of
n= 9
images);
these images have the same prompt and hyperpa-
rameters but different seeds. We use Pillow (Clark,
2015) to split a collage into
n
individual images and
assign them with the correct metadata and unique
filenames. Finally, we compress all images in DIF-
FUSIONDB using lossless WebP (Google,2010).
2.3 Identifying NSFW Content
The Stable Diffusion Discord server prohibits gen-
erating NSFW images (StabilityAI,2022a). Also,
Stable Diffusion has a built-in NSFW filter that
automatically blurs generated images if it detects
NSFW content. However, we find DIFFUSIONDB
still includes NSFW images that were not detected
by the built-in filter or removed by server moder-
ators. To help researchers filter these images, we
apply state-of-the-art NSFW classifiers to compute
NSFW scores for each prompt and image. Re-
searchers can determine a suitable threshold to fil-
ter out potentially unsafe data for their tasks.
NSFW Prompts. We use a pre-trained multilin-
gual toxicity prediction model to detect unsafe
prompts (Hanu and Unitary team,2020). This
model outputs the probabilities of a sentence be-
ing toxic, obscene, threat, insult, identity attack,
and sexually explicit. We compute the text NSFW
score by taking the maximum of the probabilities
of being toxic and sexually explicit (Fig. 3 Top).
NSFW Images. We use a pre-trained Efficient-
Net classifier to detect images with sexual con-
tent (Schuhmann et al.,2022). This model predicts
the probabilities of five image types: drawing, hen-
tai, neutral, sexual, or porn. We compute the image
NSFW score by summing the probabilities of hen-
tai, sexual, and porn. We use a Laplacian convolu-
tion kernel with a threshold of
10
to detect images
that have already been blurred by Stable Diffusion
and assign them a score of
2.0
(Fig. 3 Bottom). As
Stable Diffusion’s blur effect is strong, our blurred
image detector has high precision and recall (both
100% on 50k randomly sampled images).
NSFW Detector Accuracy. To access the accu-
racy of these two pre-trained state-of-the-art NSFW
detectors, we randomly sample 5k images and 2k
prompt texts and manually annotate them with two
binary NSFW labels (one for image and one for
prompt) and analyze the results. As the percent-
age of samples predicted as NSFW (score > 0.5) is
small, we up-sample positive samples for annota-
Fig. 3: To help researchers filter out potentially unsafe
data in DIFFUSIONDB, we apply NSFW detectors to
predict the probability that an image-prompt pair con-
tains NSFW content. For images, a score of
2.0
indi-
cates the image has been blurred by Stable Diffusion.
tion, where we have an equal number of positive
and negative examples in our annotation sample.
After annotation, we compute the precisions and
recalls. Because we have up-sampled positive pre-
dictions, we adjust the recalls by multiplying false
negatives by a scalar to adjust the sampling bias.
The up-sampling does not affect precisions. Fi-
nally, the precisions, recalls and adjusted recalls
are
0.3604
,
0.9565
, and
0.6661
for the prompt
NSFW detector, and
0.315
,
0.9722
, and
0.3037
for the image NSFW detector. Our results suggest
two detectors are progressive classifiers. The lower
adjusted recall of the prompt NSFW detector can
be attributed to several potential factors, including
the use of a fixed binary threshold and the poten-
tial discrepancy in the definition of NSFW prompts
between the detector and our annotation process.
2.4 Organizing DIFFUSIONDB
We organize DIFFUSIONDB using a flexible file
structure. We first give each image a unique file-
name using Universally Unique Identifier (UUID,
Version 4) (Leach et al.,2005). Then, we or-
ganize images into 14,000 sub-folders—each in-
cludes 1,000 images. Each sub-folder also includes
a JSON file that contains 1,000 key-value pairs
mapping an image name to its metadata. An exam-
ple of this image-prompt pair can be seen in Fig. 2.
This modular file structure enables researchers to
flexibly use a subset of DIFFUSIONDB.
We create a metadata table in Apache Parquet
format (Apache,2013) with 13 columns:
unique
image name
,
image path
,
prompt
,
seed
,
CFG
scale
,
sampler
,
width
,
height
,
username hash
,
timestamp
,
image NSFW score
, and
prompt NSFW
3
Fig. 4: The distribution of token counts for all 1.8 mil-
lion unique prompts in DIFFUSIONDB. It is worth not-
ing that Stable Diffusion truncates prompts at 75 tokens.
score
. We store the table in a column-based format
for efficient querying of individual columns.
2.5 Distributing DIFFUSIONDB
We distribute DIFFUSIONDB by bundling each im-
age sub-folder as a Zip file. We collect Discord
usernames of image creators (§ 2.2), but only in-
clude their SHA256 hashes in the distribution—as
some prompts may include sensitive information,
and explicitly linking them to their creators can
cause harm. We host our dataset on a publicly ac-
cessible repository
2
under a CC0 1.0 license. We
provide scripts that allow users to download and
load DIFFUSIONDB by writing two lines of code.
We discuss the broader impacts of our distribution
in § 7,§ 8, and the Data Sheet (‡ A). To mitigate
the potential harms, we provide a form for people
to report harmful content for removal. Image cre-
ators can also use this form to remove their images.
3 Data Analysis
To gain a comprehensive understanding of the
dataset, we analyze it from different perspectives.
We examine prompt length (§ 3.1), language (§ 3.2),
characteristics of both prompts (§ 3.3) and im-
ages (§ 3.4). We conduct an error analysis on
misaligned prompt-image pairs (§ 3.5) and provide
empirical evidence of potentially harmful uses of
image generative models (§ 3.6).
3.1 Prompt Length
We collect prompts from Discord, where users can
submit one prompt to generate multiple images and
experiment with different hyperparameters. Our
dataset contains
1,819,808
unique prompts. We
tokenize prompts using the same tokenizer as used
in Stable Diffusion (Platen et al.,2022). This
tokenizer truncates tokenized prompts at
75
to-
kens, excluding special tokens
<|startoftext|>
2
Public dataset repository:
https://huggingface.co/
datasets/poloclub/diffusiondb
and
<|endoftext|>
. We measure the length of
prompts by their tokenized length. The prompt
length distribution (Fig. 4) indicates that shorter
prompts (e.g., around 6 to 12 tokens) are the most
popular. The spike at 75 suggests many users sub-
mitted prompts longer than the model’s limit, high-
lighting the need for user interfaces guiding users
to write prompts within the token limit.
3.2 Prompt Language
We use a pre-trained language detector (Joulin et al.,
2017) to identify the languages used in prompts.
98.3% of the unique prompts in our dataset are
written in English. However, we also find a large
number of non-English languages, with the top
four being German (5.2k unique prompts), French
(4.6k), Italian (3.2k), and Spanish (3k). The lan-
guage detector identifies 34 languages with at least
100 unique prompts in total. Stable Diffusion is
trained on
LAION-2B(en)
(Schuhmann et al.,2022)
that primarily includes images with English de-
scriptions, thus our findings suggest that expanding
the training data’s language coverage to improve
the user experience for non-English communities.
3.3 Characterizing Prompts
In this section, we explore the characteristics of
prompts in DIFFUSIONDB. We examine the syn-
tactic (§ 3.3.1) and semantic (§ 3.3.2) features
of prompt text via interactive data visualizations.
Lastly, We discuss the implications of our findings
and suggest future research directions.
3.3.1 Prompt Syntactic Features
To characterize the composition of prompts, we
parse
phrases
from all 1.8M unique prompts. We
split each prompt by commas and then extract
named entities (NE) and noun phrases (NP) from
each separated component using use Spacy (Hon-
nibal et al.,2020). If there is no noun phrase in a
comma-separated component, we extract the whole
component (C) as a phrase. We keep track of each
NP’s root to create a hierarchy of noun phrases.
For example, for the prompt “
draw baby yoda
in a loading screen for grand theft auto
5, highly detailed, digital art, concept
art
,” we extract six phrases:
baby yoda
” (NE),
a loading screen
” (NP with root “screen”),
grand theft auto 5
” (NE),
highly detailed
(C),
digital art
’ (NP with root “art”), and
concept art
” (NP with root “
art
”). We group
4
摘要:

DIFFUSIONDB:ALarge-scalePromptGalleryDatasetforText-to-ImageGenerativeModelsZijieJ.Wang1EvanMontoya1DavidMunechika1HaoyangYang1BenjaminHoover1,2DuenHorngChau11GeorgiaTech2IBMResearch{jayw|emontoya30|david.munechika|alexanderyang|bhoov|polo}@gatech.eduAbstractWithrecentadvancementsindiffusionmodels,u...

展开>> 收起<<
DIFFUSION DB A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models Zijie J. Wang1Evan Montoya1David Munechika1.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:4.53MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注