Generated Faces in the Wild Quantitative Comparison of Stable Diffusion Midjourney and DALL-E 2 Ali Borji

2025-05-06 0 0 4.69MB 9 页 10玖币

侵权投诉

Generated Faces in the Wild: Quantitative Comparison of

Stable Diﬀusion, Midjourney and DALL-E 2

Ali Borji

Quintic AI, San Francisco, CA

aliborji@gmail.com

June 7, 2023

Abstract

The ﬁeld of image synthesis has made great strides in the last couple of years. Recent

models are capable of generating images with astonishing quality. Fine-grained evaluation of

these models on some interesting categories such as faces is still missing. Here, we conduct a

quantitative comparison of three popular systems including Stable Diﬀusion, Midjourney, and

DALL-E 2 in their ability to generate photorealistic faces in the wild. We ﬁnd that Stable

Diﬀusion generates better faces than the other systems, according to the FID score. We also

introduce a dataset of generated faces in the wild dubbed GFW, including a total of 15,076 faces.

Furthermore, we hope that our study spurs further research in assessing the generative models

and improving them. Data and code are available at data and code, respectively.

1 Introduction

The ﬁeld of image synthesis has made great strides in the last couple of years. Variations of Generative

Adversarial Networks (GAN) [

] and Variational Autoencoders (VAE) [

] were the ﬁrst to generate

high quality images. Recent diﬀusion-based models trained on massive datasets have transcended

progress and have attracted a lot of attention both among AI scientists and the public. Several

blog posts (e.g. here, here, and here) and articles (e.g. [

]) have oﬀered anecdotal evaluations.

Fine-grained evaluation of these models on some interesting categories still needs more work.

Quantitative evaluation of generative models is mostly concerned with ﬁdelity and diversity of

the entire scenes rather than scene components or individual objects (with few exceptions [

]).

Here, we emphasize on ﬁne-grained evaluation of models, in particular their ability to generate

photorealistic faces. Notice that our work is diﬀerent from those studies that attempt to build models

for generating portraits or evaluating such models (e.g. StyleGAN [

]). Instead, here we are interested

in evaluating the quality of generated faces in cluttered scenes containing multiple objects. To this

end, we utilize text to image generative models to synthesize scenes. We then use a face detector to

detect faces in these images. Finally, we use the well-established Fréchet Inception Distance (FID) [

]

to evaluate the quality of the generated faces against a set of real faces.

2 Comparison

2.1 Models

We consider the following three models:

Stable Diﬀusion [

]

. Released by StabilityAI in 2022, this model is primarily used to

generate detailed images conditioned on text descriptions. It can also be applied to other tasks

such as inpainting, outpainting, and image translation. This model is trained on 512

512

images from a subset of the LAION-5B database. It uses a frozen CLIP ViT-L/14 text encoder

to condition the model on text prompts. With its 860M UNet and 123M text encoder, the

1https://en.wikipedia.org/wiki/Stable_Diffusion

arXiv:2210.00586v2 [cs.CV] 5 Jun 2023

model is relatively lightweight and runs on a GPU with at least 10GB VRAM. We use the

Colab notebook provided by Hugging Face2to run Stable Diﬀusion.

Midjourney (

https://www.midjourney.com/

)

. This model was created by an independent

research lab with the same name. It can synthesize images from textual descriptions and is

currently in open beta. Midjourney tends to generate surrealistic images and is popular among

artists. We used a collection of images generated by this model available via this Kaggle link.

DALL-E 2 [

]

is created by OpenAI and is a successor of DALL-E. It can create more

realistic images than DALL-E at higher resolutions and can combine concepts, attributes, and

styles. DALL-E 2 is trained on approximately 650 million image-text pairs scraped from the

Internet. Since DALL-E 2 code is not available, we were not able to generate images on a large

scale. We accessed the system via their portal by manually entering the prompts and saving

the results.

2.2 Data

To evaluate the models, two sets of faces are required: a) generated faces, and b) real faces.

•

Generated faces. We avoided crawling the web for generated faces to reduce the potential

biases. Indeed, people usually tend to post high quality faces in social media (a.k.a. cherry

picking). Instead, we use the captions from the COCO dataset

(captions_train2017.json)

as prompts to synthesize images. Images generated by Stable Diﬀusion and DALL-E 2 have

size of 512

512, while images by Midjourney are of variable size. To increase the chance of

generated images to contain faces, we chose captions that had any word from the following

list of words:

[‘person’, ‘man’, ‘woman’, ‘men’, ‘women’, ‘kid’, ‘child’, ‘face’,

‘girl’, ‘boy’]

. We then ran the MediaPipe face detector

twice: ﬁrst on the entire image

to detect faces, and a second time on individual detections to prune false positives. Finally,

we manually removed the remaining false positives. Detected faces were resized to 100

100

pixels. In total, we collected 15,076 generated faces, including 8,050 by Stable Diﬀusion, 6,350

by Midjourney, and 676 by DALL-E 2.

•

Real faces. We ran the face detector on the COCO training set (train2017.zip) similar to

above to extract faces. In addition, we added 13,233 faces from the Labeled Faces in the Wild

(LFW)

dataset [

]. We cropped the central 100

100 pixels area from the 250

250 faces of

LFW. In total, we collected 30,000 real faces.

We removed faces (both generated and real) that were highly occluded (e.g. person eating a

large piece of food, or fruit or wearing a mask) as well as too dark or too blurry faces. We also

removed dull faces, animal faces, drawings, and cartoons. Faces with eyeglasses were kept.

2.3 Evaluation Scheme

Several scores have been proposed for evaluating generative models [

]. The FID score is the

most commonly used one. We utilize the implementation from

https://github.com/mseitzer/

pytorch-fid

. Each time we shuﬄed the faces (generated and real) and randomly selected 5,000

faces from each set, and computed the FID score between the two sets. We repeated this procedure

10 times and computed the mean and standard deviation of FID across the runs. Since DALL-E 2

faces are less than 5K, we selected faces with replacement for this model. To compute the FID for

real faces, we ﬁrst shuﬄe the set of real faces and then split the ﬁrst 10K faces into two sets of 5K

faces. The FID between these two sets is then computed.

2https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/stable_

diffusion.ipynb

3https://en.wikipedia.org/wiki/Midjourney

4https://en.wikipedia.org/wiki/DALL-E,https://openai.com/dall-e-2/

5https://cocodataset.org/

6https://google.github.io/mediapipe/solutions/face_detection.html

7http://vis-www.cs.umass.edu/lfw/

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GeneratedFacesintheWild:QuantitativeComparisonofStableDiffusion,MidjourneyandDALL-E2AliBorjiQuinticAI,SanFrancisco,CAaliborji@gmail.comJune7,2023AbstractThefieldofimagesynthesishasmadegreatstridesinthelastcoupleofyears.Recentmodelsarecapableofgeneratingimageswithastonishingquality.Fine-grainedevalua...

展开>> 收起<<

Generated Faces in the Wild Quantitative Comparison of Stable Diffusion Midjourney and DALL-E 2 Ali Borji.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Generated Faces in the Wild Quantitative Comparison of Stable Diffusion Midjourney and DALL-E 2 Ali Borji

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: