Generated Faces in the Wild Quantitative Comparison of Stable Diffusion Midjourney and DALL-E 2 Ali Borji

2025-05-06 0 0 4.69MB 9 页 10玖币
侵权投诉
Generated Faces in the Wild: Quantitative Comparison of
Stable Diffusion, Midjourney and DALL-E 2
Ali Borji
Quintic AI, San Francisco, CA
aliborji@gmail.com
June 7, 2023
Abstract
The field of image synthesis has made great strides in the last couple of years. Recent
models are capable of generating images with astonishing quality. Fine-grained evaluation of
these models on some interesting categories such as faces is still missing. Here, we conduct a
quantitative comparison of three popular systems including Stable Diffusion, Midjourney, and
DALL-E 2 in their ability to generate photorealistic faces in the wild. We find that Stable
Diffusion generates better faces than the other systems, according to the FID score. We also
introduce a dataset of generated faces in the wild dubbed GFW, including a total of 15,076 faces.
Furthermore, we hope that our study spurs further research in assessing the generative models
and improving them. Data and code are available at data and code, respectively.
1 Introduction
The field of image synthesis has made great strides in the last couple of years. Variations of Generative
Adversarial Networks (GAN) [
4
] and Variational Autoencoders (VAE) [
8
] were the first to generate
high quality images. Recent diffusion-based models trained on massive datasets have transcended
progress and have attracted a lot of attention both among AI scientists and the public. Several
blog posts (e.g. here, here, and here) and articles (e.g. [
9
]) have offered anecdotal evaluations.
Fine-grained evaluation of these models on some interesting categories still needs more work.
Quantitative evaluation of generative models is mostly concerned with fidelity and diversity of
the entire scenes rather than scene components or individual objects (with few exceptions [
14
,
1
]).
Here, we emphasize on fine-grained evaluation of models, in particular their ability to generate
photorealistic faces. Notice that our work is different from those studies that attempt to build models
for generating portraits or evaluating such models (e.g. StyleGAN [
7
]). Instead, here we are interested
in evaluating the quality of generated faces in cluttered scenes containing multiple objects. To this
end, we utilize text to image generative models to synthesize scenes. We then use a face detector to
detect faces in these images. Finally, we use the well-established Fréchet Inception Distance (FID) [
5
]
to evaluate the quality of the generated faces against a set of real faces.
2 Comparison
2.1 Models
We consider the following three models:
1.
Stable Diffusion [
11
]
1
. Released by StabilityAI in 2022, this model is primarily used to
generate detailed images conditioned on text descriptions. It can also be applied to other tasks
such as inpainting, outpainting, and image translation. This model is trained on 512
×
512
images from a subset of the LAION-5B database. It uses a frozen CLIP ViT-L/14 text encoder
to condition the model on text prompts. With its 860M UNet and 123M text encoder, the
1https://en.wikipedia.org/wiki/Stable_Diffusion
1
arXiv:2210.00586v2 [cs.CV] 5 Jun 2023
model is relatively lightweight and runs on a GPU with at least 10GB VRAM. We use the
Colab notebook provided by Hugging Face2to run Stable Diffusion.
2.
Midjourney (
https://www.midjourney.com/
)
3
. This model was created by an independent
research lab with the same name. It can synthesize images from textual descriptions and is
currently in open beta. Midjourney tends to generate surrealistic images and is popular among
artists. We used a collection of images generated by this model available via this Kaggle link.
3.
DALL-E 2 [
10
]
4
is created by OpenAI and is a successor of DALL-E. It can create more
realistic images than DALL-E at higher resolutions and can combine concepts, attributes, and
styles. DALL-E 2 is trained on approximately 650 million image-text pairs scraped from the
Internet. Since DALL-E 2 code is not available, we were not able to generate images on a large
scale. We accessed the system via their portal by manually entering the prompts and saving
the results.
2.2 Data
To evaluate the models, two sets of faces are required: a) generated faces, and b) real faces.
Generated faces. We avoided crawling the web for generated faces to reduce the potential
biases. Indeed, people usually tend to post high quality faces in social media (a.k.a. cherry
picking). Instead, we use the captions from the COCO dataset
5
(captions_train2017.json)
as prompts to synthesize images. Images generated by Stable Diffusion and DALL-E 2 have
size of 512
×
512, while images by Midjourney are of variable size. To increase the chance of
generated images to contain faces, we chose captions that had any word from the following
list of words:
[‘person’, ‘man’, ‘woman’, ‘men’, ‘women’, ‘kid’, ‘child’, ‘face’,
‘girl’, ‘boy’]
. We then ran the MediaPipe face detector
6
twice: first on the entire image
to detect faces, and a second time on individual detections to prune false positives. Finally,
we manually removed the remaining false positives. Detected faces were resized to 100
×
100
pixels. In total, we collected 15,076 generated faces, including 8,050 by Stable Diffusion, 6,350
by Midjourney, and 676 by DALL-E 2.
Real faces. We ran the face detector on the COCO training set (train2017.zip) similar to
above to extract faces. In addition, we added 13,233 faces from the Labeled Faces in the Wild
(LFW)
7
dataset [
6
]. We cropped the central 100
×
100 pixels area from the 250
×
250 faces of
LFW. In total, we collected 30,000 real faces.
We removed faces (both generated and real) that were highly occluded (e.g. person eating a
large piece of food, or fruit or wearing a mask) as well as too dark or too blurry faces. We also
removed dull faces, animal faces, drawings, and cartoons. Faces with eyeglasses were kept.
2.3 Evaluation Scheme
Several scores have been proposed for evaluating generative models [
2
,
3
]. The FID score is the
most commonly used one. We utilize the implementation from
https://github.com/mseitzer/
pytorch-fid
. Each time we shuffled the faces (generated and real) and randomly selected 5,000
faces from each set, and computed the FID score between the two sets. We repeated this procedure
10 times and computed the mean and standard deviation of FID across the runs. Since DALL-E 2
faces are less than 5K, we selected faces with replacement for this model. To compute the FID for
real faces, we first shuffle the set of real faces and then split the first 10K faces into two sets of 5K
faces. The FID between these two sets is then computed.
2https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/stable_
diffusion.ipynb
3https://en.wikipedia.org/wiki/Midjourney
4https://en.wikipedia.org/wiki/DALL-E,https://openai.com/dall-e-2/
5https://cocodataset.org/
6https://google.github.io/mediapipe/solutions/face_detection.html
7http://vis-www.cs.umass.edu/lfw/
2
摘要:

GeneratedFacesintheWild:QuantitativeComparisonofStableDiffusion,MidjourneyandDALL-E2AliBorjiQuinticAI,SanFrancisco,CAaliborji@gmail.comJune7,2023AbstractThefieldofimagesynthesishasmadegreatstridesinthelastcoupleofyears.Recentmodelsarecapableofgeneratingimageswithastonishingquality.Fine-grainedevalua...

展开>> 收起<<
Generated Faces in the Wild Quantitative Comparison of Stable Diffusion Midjourney and DALL-E 2 Ali Borji.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:4.69MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注