DIFFUSION DB A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models Zijie J. Wang1Evan Montoya1David Munechika1

2025-05-03 0 0 4.53MB 17 页 10玖币

侵权投诉

DIFFUSIONDB: A Large-scale Prompt Gallery Dataset for Text-to-Image

Generative Models

Zijie J. Wang1Evan Montoya1David Munechika1

Haoyang Yang1Benjamin Hoover1,2Duen Horng Chau1

1Georgia Tech 2IBM Research

{jayw|emontoya30|david.munechika|alexanderyang|bhoov|polo}@gatech.edu

Abstract

With recent advancements in diffusion models,

users can generate high-quality images by writ-

ing text prompts in natural language. However,

generating images with desired details requires

proper prompts, and it is often unclear how a

model reacts to different prompts or what the

best prompts are. To help researchers tackle

these critical challenges, we introduce DIF-

FUSIONDB, the ﬁrst large-scale text-to-image

prompt dataset totaling 6.5TB, containing 14

million images generated by Stable Diffusion,

1.8 million unique prompts, and hyperpa-

rameters speciﬁed by real users. We analyze

the syntactic and semantic characteristics of

prompts. We pinpoint speciﬁc hyperparameter

values and prompt styles that can lead to model

errors and present evidence of potentially

harmful model usage, such as the generation

of misinformation. The unprecedented scale

and diversity of this human-actuated dataset

provide exciting research opportunities in

understanding the interplay between prompts

and generative models, detecting deepfakes,

and designing human-AI interaction tools

to help users more easily use these models.

DIFFUSIONDB is publicly available at:

https:

//poloclub.github.io/diffusiondb.

1 Introduction

Recent diffusion models have gained immense pop-

ularity by enabling high-quality and controllable

image generation based on text prompts written in

natural language (Rombach et al.,2022;Ramesh

et al.,2022;Saharia et al.,2022). Since the re-

lease of these models, people from different do-

mains have quickly applied them to create award-

winning artworks (Roose,2022), synthetic radi-

ology images (Chambon et al.,2022), and even

hyper-realistic videos (Ho et al.,2022).

However, generating images with desired de-

tails is difﬁcult, as it requires users to write proper

prompts specifying the exact expected results. De-

veloping such prompts requires trial and error,

Fig. 1: DIFFUSIONDB is the ﬁrst large-scale dataset

featuring 6.5TB data including 1.8 million unique Stable

Diffusion prompts and 14 million generated images with

accompanying hyperparameters. It provides exciting

research opportunities in prompt engineering, deepfake

detection, and understanding large generative models.

and can often feel random and unprincipled (Liu

and Chilton,2022). Willison et al. (2022) analo-

gize writing prompts to wizards learning “magical

spells”: users do not understand why some prompts

work, but they will add these prompts to their “spell

book.” For example, to generate highly-detailed im-

ages, it has become a common practice to add spe-

cial keywords such as “

trending on artstation

”

and “unreal engine” in the prompt.

Prompt engineering has become a ﬁeld of study

in the context of text-to-text generation, where re-

searchers systematically investigate how to con-

struct prompts to effectively solve different down-

stream tasks (Branwen,2020;Reynolds and Mc-

Donell,2021). As large text-to-image models are

relatively new, there is a pressing need to under-

stand how these models react to prompts, how to

write effective prompts, and how to design tools to

help users generate images (Liu and Chilton,2022).

Our work helps researchers tackle these critical

challenges, through three major contributions:

•

DIFFUSIONDB (Fig. 1), the ﬁrst large-scale

prompt dataset totaling 6.5TB, containing

14 million images generated by Stable Diffu-

sion (Rombach et al.,2022) using 1.8 million

arXiv:2210.14896v4 [cs.CV] 6 Jul 2023

Fig. 2: DIFFUSIONDB contains 14 million Stable Diffusion images, 1.8 million unique text prompts, and all model

hyperparameters:

seed

step

CFG scale

sampler

, and

image size

. Each image also has a unique ﬁlename,

a hash of its creator’s Discord username, and a creation timestamp. To help researchers ﬁlter out potentially unsafe

or harmful content, we employ state-of-the-art models to compute an NSFW score for each image and prompt.

unique prompts and hyperparameters speciﬁed

by real users. We construct this dataset by collect-

ing images shared on the Stable Diffusion public

Discord server (§ 2). We release DIFFUSIONDB

with a CC0 1.0 license, allowing users to ﬂexi-

bly share and adapt the dataset for their use. In

addition, we open-source our code

that collects,

processes, and analyzes the images and prompts.

•

Revealing prompt patterns and model errors.

The unprecedented scale of DIFFUSIONDB

paves the path for researchers to systematically

investigate diverse prompts and associated im-

ages that were previously not possible. By char-

acterizing prompts and images, we discover com-

mon prompt patterns and ﬁnd different distribu-

tions of the semantic representations of prompts

and images. Our error analysis highlights partic-

ular hyperparameters and prompt styles can lead

to model errors. Finally, we provide evidence of

image generative models being used for poten-

tially harmful purposes such as generating misin-

formation and nonconsensual pornography (§ 3).

•

Highlighting new research directions. As the

ﬁrst-of-its-kind text-to-image prompt dataset,

DIFFUSIONDB opens up unique opportunities

for researchers from natural language processing

(NLP), computer vision, and human-computer

interaction (HCI) communities. The scale and

diversity of this human-actuated dataset will

provide new research opportunities in better

tooling for prompt engineering, explaining large

generative models, and detecting deepfakes (§ 4).

We believe DIFFUSIONDB will serve as an im-

portant resource for researchers to study the roles

of prompts in text-to-image generation and design

next-generation human-AI interaction tools.

1Code: https://github.com/poloclub/diffusiondb

2 Constructing DIFFUSIONDB

We construct DIFFUSIONDB (Fig. 2) by scraping

user-generated images from the ofﬁcial Stable Dif-

fusion Discord server. We choose Stable Diffusion

as it is currently the only open-source large text-to-

image generative model, and all generated images

have a CC0 1.0 license that allows uses for any pur-

pose (StabilityAI,2022b). We choose the ofﬁcial

public Discord server as it has strict rules against

generating illegal, hateful, or NSFW (not suitable

for work, such as sexual and violent content) im-

ages, and it prohibits sharing prompts with personal

information (StabilityAI,2022a).

Our construction process includes collecting im-

ages (§ 2.1), linking them to prompts and hyperpa-

rameters (§ 2.2), applying NSFW detectors (§ 2.3),

creating a ﬂexible ﬁle structure (§ 2.4), and dis-

tributing the dataset (§ 2.5). We discuss DIFFU-

SIONDB’s limitations and broader impacts in § 7,

§ 8, and a Data Sheet (Gebru et al.,2020) (‡ A).

2.1 Collecting User Generated Images

We download chat messages from the Stable

Diffusion Discord channels with DiscordChatEx-

porter (Holub,2017), saving them as HTML ﬁles.

We focus on channels where users can command

a bot to run Stable Diffusion Version 1 to generate

images by typing a prompt, hyperparameters, and

the number of images. The bot then replies with

the generated images and used random seeds.

2.2 Extracting Image Metadata

We use Beautiful Soup (Richardson,2007) to parse

HTML ﬁles, mapping generated images with their

prompts, hyperparameters, seeds, timestamps, and

the requester’s Discord usernames. Some images

are collages, where the bot combines

generated

images as a grid (e.g., a

3×3

grid of

n= 9

images);

these images have the same prompt and hyperpa-

rameters but different seeds. We use Pillow (Clark,

2015) to split a collage into

individual images and

assign them with the correct metadata and unique

ﬁlenames. Finally, we compress all images in DIF-

FUSIONDB using lossless WebP (Google,2010).

2.3 Identifying NSFW Content

The Stable Diffusion Discord server prohibits gen-

erating NSFW images (StabilityAI,2022a). Also,

Stable Diffusion has a built-in NSFW ﬁlter that

automatically blurs generated images if it detects

NSFW content. However, we ﬁnd DIFFUSIONDB

still includes NSFW images that were not detected

by the built-in ﬁlter or removed by server moder-

ators. To help researchers ﬁlter these images, we

apply state-of-the-art NSFW classiﬁers to compute

NSFW scores for each prompt and image. Re-

searchers can determine a suitable threshold to ﬁl-

ter out potentially unsafe data for their tasks.

NSFW Prompts. We use a pre-trained multilin-

gual toxicity prediction model to detect unsafe

prompts (Hanu and Unitary team,2020). This

model outputs the probabilities of a sentence be-

ing toxic, obscene, threat, insult, identity attack,

and sexually explicit. We compute the text NSFW

score by taking the maximum of the probabilities

of being toxic and sexually explicit (Fig. 3 Top).

NSFW Images. We use a pre-trained Efﬁcient-

Net classiﬁer to detect images with sexual con-

tent (Schuhmann et al.,2022). This model predicts

the probabilities of ﬁve image types: drawing, hen-

tai, neutral, sexual, or porn. We compute the image

NSFW score by summing the probabilities of hen-

tai, sexual, and porn. We use a Laplacian convolu-

tion kernel with a threshold of

to detect images

that have already been blurred by Stable Diffusion

and assign them a score of

2.0

(Fig. 3 Bottom). As

Stable Diffusion’s blur effect is strong, our blurred

image detector has high precision and recall (both

100% on 50k randomly sampled images).

NSFW Detector Accuracy. To access the accu-

racy of these two pre-trained state-of-the-art NSFW

detectors, we randomly sample 5k images and 2k

prompt texts and manually annotate them with two

binary NSFW labels (one for image and one for

prompt) and analyze the results. As the percent-

age of samples predicted as NSFW (score > 0.5) is

small, we up-sample positive samples for annota-

Fig. 3: To help researchers ﬁlter out potentially unsafe

data in DIFFUSIONDB, we apply NSFW detectors to

predict the probability that an image-prompt pair con-

tains NSFW content. For images, a score of

2.0

indi-

cates the image has been blurred by Stable Diffusion.

tion, where we have an equal number of positive

and negative examples in our annotation sample.

After annotation, we compute the precisions and

recalls. Because we have up-sampled positive pre-

dictions, we adjust the recalls by multiplying false

negatives by a scalar to adjust the sampling bias.

The up-sampling does not affect precisions. Fi-

nally, the precisions, recalls and adjusted recalls

are

0.3604

0.9565

, and

0.6661

for the prompt

NSFW detector, and

0.315

0.9722

, and

0.3037

for the image NSFW detector. Our results suggest

two detectors are progressive classiﬁers. The lower

adjusted recall of the prompt NSFW detector can

be attributed to several potential factors, including

the use of a ﬁxed binary threshold and the poten-

tial discrepancy in the deﬁnition of NSFW prompts

between the detector and our annotation process.

2.4 Organizing DIFFUSIONDB

We organize DIFFUSIONDB using a ﬂexible ﬁle

structure. We ﬁrst give each image a unique ﬁle-

name using Universally Unique Identiﬁer (UUID,

Version 4) (Leach et al.,2005). Then, we or-

ganize images into 14,000 sub-folders—each in-

cludes 1,000 images. Each sub-folder also includes

a JSON ﬁle that contains 1,000 key-value pairs

mapping an image name to its metadata. An exam-

ple of this image-prompt pair can be seen in Fig. 2.

This modular ﬁle structure enables researchers to

ﬂexibly use a subset of DIFFUSIONDB.

We create a metadata table in Apache Parquet

format (Apache,2013) with 13 columns:

unique

image name

image path

prompt

seed

CFG

scale

sampler

width

height

username hash

timestamp

image NSFW score

, and

prompt NSFW

Fig. 4: The distribution of token counts for all 1.8 mil-

lion unique prompts in DIFFUSIONDB. It is worth not-

ing that Stable Diffusion truncates prompts at 75 tokens.

score

. We store the table in a column-based format

for efﬁcient querying of individual columns.

2.5 Distributing DIFFUSIONDB

We distribute DIFFUSIONDB by bundling each im-

age sub-folder as a Zip ﬁle. We collect Discord

usernames of image creators (§ 2.2), but only in-

clude their SHA256 hashes in the distribution—as

some prompts may include sensitive information,

and explicitly linking them to their creators can

cause harm. We host our dataset on a publicly ac-

cessible repository

under a CC0 1.0 license. We

provide scripts that allow users to download and

load DIFFUSIONDB by writing two lines of code.

We discuss the broader impacts of our distribution

in § 7,§ 8, and the Data Sheet (‡ A). To mitigate

the potential harms, we provide a form for people

to report harmful content for removal. Image cre-

ators can also use this form to remove their images.

3 Data Analysis

To gain a comprehensive understanding of the

dataset, we analyze it from different perspectives.

We examine prompt length (§ 3.1), language (§ 3.2),

characteristics of both prompts (§ 3.3) and im-

ages (§ 3.4). We conduct an error analysis on

misaligned prompt-image pairs (§ 3.5) and provide

empirical evidence of potentially harmful uses of

image generative models (§ 3.6).

3.1 Prompt Length

We collect prompts from Discord, where users can

submit one prompt to generate multiple images and

experiment with different hyperparameters. Our

dataset contains

1,819,808

unique prompts. We

tokenize prompts using the same tokenizer as used

in Stable Diffusion (Platen et al.,2022). This

tokenizer truncates tokenized prompts at

to-

kens, excluding special tokens

<|startoftext|>

Public dataset repository:

https://huggingface.co/

datasets/poloclub/diffusiondb

and

<|endoftext|>

. We measure the length of

prompts by their tokenized length. The prompt

length distribution (Fig. 4) indicates that shorter

prompts (e.g., around 6 to 12 tokens) are the most

popular. The spike at 75 suggests many users sub-

mitted prompts longer than the model’s limit, high-

lighting the need for user interfaces guiding users

to write prompts within the token limit.

3.2 Prompt Language

We use a pre-trained language detector (Joulin et al.,

2017) to identify the languages used in prompts.

98.3% of the unique prompts in our dataset are

written in English. However, we also ﬁnd a large

number of non-English languages, with the top

four being German (5.2k unique prompts), French

(4.6k), Italian (3.2k), and Spanish (3k). The lan-

guage detector identiﬁes 34 languages with at least

100 unique prompts in total. Stable Diffusion is

trained on

LAION-2B(en)

(Schuhmann et al.,2022)

that primarily includes images with English de-

scriptions, thus our ﬁndings suggest that expanding

the training data’s language coverage to improve

the user experience for non-English communities.

3.3 Characterizing Prompts

In this section, we explore the characteristics of

prompts in DIFFUSIONDB. We examine the syn-

tactic (§ 3.3.1) and semantic (§ 3.3.2) features

of prompt text via interactive data visualizations.

Lastly, We discuss the implications of our ﬁndings

and suggest future research directions.

3.3.1 Prompt Syntactic Features

To characterize the composition of prompts, we

parse

phrases

from all 1.8M unique prompts. We

split each prompt by commas and then extract

named entities (NE) and noun phrases (NP) from

each separated component using use Spacy (Hon-

nibal et al.,2020). If there is no noun phrase in a

comma-separated component, we extract the whole

component (C) as a phrase. We keep track of each

NP’s root to create a hierarchy of noun phrases.

For example, for the prompt “

draw baby yoda

in a loading screen for grand theft auto

5, highly detailed, digital art, concept

art

,” we extract six phrases: “

baby yoda

” (NE),

“

a loading screen

” (NP with root “screen”),

“

grand theft auto 5

” (NE),“

highly detailed

”

(C),“

digital art

’ (NP with root “art”), and

“

concept art

” (NP with root “

art

”). We group

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DIFFUSIONDB:ALarge-scalePromptGalleryDatasetforText-to-ImageGenerativeModelsZijieJ.Wang1EvanMontoya1DavidMunechika1HaoyangYang1BenjaminHoover1,2DuenHorngChau11GeorgiaTech2IBMResearch{jayw|emontoya30|david.munechika|alexanderyang|bhoov|polo}@gatech.eduAbstractWithrecentadvancementsindiffusionmodels,u...

展开>> 收起<<

DIFFUSION DB A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models Zijie J. Wang1Evan Montoya1David Munechika1.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DIFFUSION DB A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models Zijie J. Wang1Evan Montoya1David Munechika1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: