CLIP-D IFFUSION -LM A PPLY DIFFUSION MODEL ON IMAGE CAPTIONING Shitong Xu

2025-04-27 0 0 476.7KB 10 页 10玖币

侵权投诉

CLIP-DIFFUSION-LM: APPLY DIFFUSION MODEL ON IMAGE

CAPTIONING

Shitong Xu

Imperial College London

shitong.xu19@imperial.ac.uk

ABSTRACT

Image captioning task has been extensively researched by previous work. However, limited experi-

ments focus on generating captions based on non-autoregressive text decoder. Inspired by the recent

success of the denoising diffusion model on image synthesis tasks, we apply denoising diffusion prob-

abilistic models to text generation in image captioning tasks. We show that our CLIP-Diffusion-LM

is capable of generating image captions using signiﬁcantly fewer inference steps than autoregressive

models. On the Flickr8k dataset, the model achieves 0.1876 BLEU-4 score. By training on the

combined Flickr8k and Flickr30k dataset, our model achieves 0.2470 BLEU-4 score. Our code is

available at https://github.com/xu-shitong/diffusion-image-captioning.

Keywords Diffusion model ·CLIP ·Non-autoregressive text generation

1 Introduction

Image captioning has been a focus of research over recent years. Among the previous proposed works, the text encoder

used can be classiﬁed into 2 general classes, i.e. autoregressive and non-autoregressive class. Most of the state-of-the-art

models fall in the autoregressive class [

]. However, autoregressive generation suffer from 1) the slow

generation speed due to the generation step is token by token, and 2) not capable of reﬁning preﬁx of sentences based on

the later generated tokens. Multiple attempts have experimented using a non-autoregressive model in the text generation

steps [

]. The closest work to ours is Masked Non-Autoregressive Image Captioning by Gao et al. [

], which uses

a BERT model as the generator and involves 2 steps-reﬁnement on the generated sequence. Masked Language Model

(MLM) is used in their work to supervise the generation of captions.

In contrast to MLM, which is a language model based on discrete tokens embedding prediction, diffusion models

based on continuous latent embedding have been thriving in image and audio generation tasks [

]. To

the best of our knowledge, there has not been previous work on generating caption embedding based on a diffusion

language model. Our work aim at employing a model to reﬁne generated token continuously on sequence embedding,

and provide empirical insight on useful tricks to improve the generated captions. In particular, we use pre-trained

CLIP [

] model for extracting image and text features, and DistilBert [

] model based on Diffusion-LM [

] for

text sequence generation. Our contribution consists of proposing a diffusion-based image captioning model (Sec.4),

and experiments on effectiveness of multiple model design and hyperparameter settings, including classiﬁcation free

guidance in Sec.5.1, learning rate in Sec.5.2, weight assign in loss function terms in Sec.5.3,

-prediction in Sec.5.4

and feature fusion method in Sec.5.5.

2 Related Work

2.1 Autoregressive image captioning

Mao et al. proposed the mRNN model [

], which uses CNN for image extraction and RNN for text generation. Xu

et al. [

] applied the LSTM to text generation and experimented on soft and hard attention for early fusion between

image feature and text feature. Based on this early fusion method, Lu et al. [

] experimented with the late fusion of

arXiv:2210.04559v1 [cs.CV] 10 Oct 2022

image and text features, allowing model to attend on either image or text modality in the late phase of generation. Wu

and Hu [

] experimented on reversing the generated caption, allowing its backend model to reﬁne the former tokens

based on later caption tokens. Le et al. [

] used attention model to combine local and global features from images,

so that captions can more accurately identify occluded objects. Similarly, Wei et al. [

] also used image features

from both high and low generality, and combined them using cross-attention. Their work also involves multi-step

reﬁning of the generated text caption. Feng et al. [

] trained image caption in a GAN framework, with a LSTM

discriminator reproducing the original image feature from generated text sequence. Similarly, Guo et al. [

] proposed

GAN based method to train model predicting stylized text. Multiple discriminators are used to supervise if generated

text captured image-related feature, is in the desired style, and is similar to a caption made by human. Kim et al. [

]

used variational autoencoder for extracting image information, their model allows multi-caption generation by sampling

from the learned image feature distribution, thus produce various captions for a single image. He et al. [

] used POS

tagging to help the generation of text. The image feature is used as additional input when the model is predicting tokens

related to image-speciﬁc information, i.e. object, colour, relative position of objects. Mokady, Hertz and Bermano [

]

experimented with using a pre-trained CLIP image feature for sequence generation. The CLIP features are transformed

to a sequence of token and used as preﬁx for a GPT-2 model in generation. Li et al. [

] introduced skipped connections

between transformer layers to address the information asymmetry between vision and language modality. The model

achieves state-of-the-art performance and strong zero-shot ability on various tasks. Nguyen et al. [

] experimented

with changing the cross attention part of transformer decoder to use both Regional feature from Faster-RCNN [

] and

Grid features from Swin transformer [26].

2.2 Non autoregressive image captioning

In contrast, non-autoregressive models beneﬁt from the attention models’ ability to pass textural information in both

directions during generation. The text generated in former timesteps could adjust based on text in later timesteps,

thus is expected to achieve better performance. Gao et al. [

] used BERT [

] as text decoder and employed a 2

step generation method. Based on this work, Partially Non-Autoregressive Image Captioning by Fei [

] and semi

Non-Autoregressive Image Captioning by Xu et al. [

] partitioned the generated text in subgroups. Words in the

same group are generated non-autoregressively and different groups are generated autoregressively. Our model falls in

non-autoregressive category and is most close to the Masked Non-Autoregressive Image Captioning [

]. The difference

is we choose to use the diffusion model as the non-autoregressive generation model.

2.3 Diffusion models

Diffusion model aims at training a model that denoise Gaussial noise incrementally to reproduce original features. Ho,

Jain and Abbeel [

] proposed the Denoising Diffusion Probabilistic Model (DDPM) to simplify the loss function by

only letting models predict the noise in generation steps, and proposed an alternative loss function by removing the

weight coefﬁcients. In the following explanation, we refer to diffusion model as DDPM for simplicity. Nichol and

Dhariwal [

] proposed several improvements based on DDPM, including setting variance to be learn-able parameters,

apply cosine instead of linear noise schedule, and speed up forward process by reducing forward steps. Song, Meng and

Ermon [

] experimented on reducing the variance in forward process. The result shows that by reducing variance to 0,

the deterministic model achieves higher FID score in image generation on both CIFAR10 and CelebA. Diffusion-LM by

Li et al. [

] is a recent work on applying continuous diffusion model on text generation. Their work explored various

techniques to improve the performance of continuous diffusion model on text generation.

Dhariwal and Nichol [

] proposed classiﬁer guidance for improving generated image FID score. In a classiﬁer-guided

diffusion model, a classiﬁer model is pretrained to predict noised images’ object class. During training, the classiﬁer

provides gradient on which direction to optimise the generated image, so that the generate image resembles an object

closer to the target class.

To avoid training classiﬁer for guiding model, Jonathan and Tim [

] proposed classiﬁer-free guidance model. In

classiﬁer-free guidance, the difference between outputs of generative model when provided with either guided and

unguided context information is used as implicit guidance. By using classiﬁer-free diffusion model as text-to-image

generator, DALL-E2 [

], GLIDE [

] and High-Resolution Image Synthesis With Latent Diffusion Models [

] model

achieves signiﬁcant image generation performance. In particular, DALL-E2 use CLIP model for extracting feature

from text, predict the corresponding image CLIP feature through prior network, then use predicted image CLIP feature

for ﬁnal image generation. The model achieves signiﬁcant novelty in generated images and also inspired us to train a

image-to-text model with diffusion model in generation step.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CLIP-DIFFUSION-LM:APPLYDIFFUSIONMODELONIMAGECAPTIONINGShitongXuImperialCollegeLondonshitong.xu19@imperial.ac.ukABSTRACTImagecaptioningtaskhasbeenextensivelyresearchedbypreviouswork.However,limitedexperi-mentsfocusongeneratingcaptionsbasedonnon-autoregressivetextdecoder.Inspiredbytherecentsuccessofth...

展开>> 收起<<

CLIP-D IFFUSION -LM A PPLY DIFFUSION MODEL ON IMAGE CAPTIONING Shitong Xu.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CLIP-D IFFUSION -LM A PPLY DIFFUSION MODEL ON IMAGE CAPTIONING Shitong Xu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: