CLIP-D IFFUSION -LM A PPLY DIFFUSION MODEL ON IMAGE CAPTIONING Shitong Xu

2025-04-27 0 0 476.7KB 10 页 10玖币
侵权投诉
CLIP-DIFFUSION-LM: APPLY DIFFUSION MODEL ON IMAGE
CAPTIONING
Shitong Xu
Imperial College London
shitong.xu19@imperial.ac.uk
ABSTRACT
Image captioning task has been extensively researched by previous work. However, limited experi-
ments focus on generating captions based on non-autoregressive text decoder. Inspired by the recent
success of the denoising diffusion model on image synthesis tasks, we apply denoising diffusion prob-
abilistic models to text generation in image captioning tasks. We show that our CLIP-Diffusion-LM
is capable of generating image captions using significantly fewer inference steps than autoregressive
models. On the Flickr8k dataset, the model achieves 0.1876 BLEU-4 score. By training on the
combined Flickr8k and Flickr30k dataset, our model achieves 0.2470 BLEU-4 score. Our code is
available at https://github.com/xu-shitong/diffusion-image-captioning.
Keywords Diffusion model ·CLIP ·Non-autoregressive text generation
1 Introduction
Image captioning has been a focus of research over recent years. Among the previous proposed works, the text encoder
used can be classified into 2 general classes, i.e. autoregressive and non-autoregressive class. Most of the state-of-the-art
models fall in the autoregressive class [
1
,
2
,
3
,
4
,
5
]. However, autoregressive generation suffer from 1) the slow
generation speed due to the generation step is token by token, and 2) not capable of refining prefix of sentences based on
the later generated tokens. Multiple attempts have experimented using a non-autoregressive model in the text generation
steps [
6
,
7
,
8
]. The closest work to ours is Masked Non-Autoregressive Image Captioning by Gao et al. [
6
], which uses
a BERT model as the generator and involves 2 steps-refinement on the generated sequence. Masked Language Model
(MLM) is used in their work to supervise the generation of captions.
In contrast to MLM, which is a language model based on discrete tokens embedding prediction, diffusion models
based on continuous latent embedding have been thriving in image and audio generation tasks [
9
,
10
,
11
,
12
,
13
]. To
the best of our knowledge, there has not been previous work on generating caption embedding based on a diffusion
language model. Our work aim at employing a model to refine generated token continuously on sequence embedding,
and provide empirical insight on useful tricks to improve the generated captions. In particular, we use pre-trained
CLIP [
14
] model for extracting image and text features, and DistilBert [
15
] model based on Diffusion-LM [
16
] for
text sequence generation. Our contribution consists of proposing a diffusion-based image captioning model (Sec.4),
and experiments on effectiveness of multiple model design and hyperparameter settings, including classification free
guidance in Sec.5.1, learning rate in Sec.5.2, weight assign in loss function terms in Sec.5.3,
x0
-prediction in Sec.5.4
and feature fusion method in Sec.5.5.
2 Related Work
2.1 Autoregressive image captioning
Mao et al. proposed the mRNN model [
1
], which uses CNN for image extraction and RNN for text generation. Xu
et al. [
2
] applied the LSTM to text generation and experimented on soft and hard attention for early fusion between
image feature and text feature. Based on this early fusion method, Lu et al. [
3
] experimented with the late fusion of
arXiv:2210.04559v1 [cs.CV] 10 Oct 2022
image and text features, allowing model to attend on either image or text modality in the late phase of generation. Wu
and Hu [
17
] experimented on reversing the generated caption, allowing its backend model to refine the former tokens
based on later caption tokens. Le et al. [
18
] used attention model to combine local and global features from images,
so that captions can more accurately identify occluded objects. Similarly, Wei et al. [
19
] also used image features
from both high and low generality, and combined them using cross-attention. Their work also involves multi-step
refining of the generated text caption. Feng et al. [
20
] trained image caption in a GAN framework, with a LSTM
discriminator reproducing the original image feature from generated text sequence. Similarly, Guo et al. [
21
] proposed
GAN based method to train model predicting stylized text. Multiple discriminators are used to supervise if generated
text captured image-related feature, is in the desired style, and is similar to a caption made by human. Kim et al. [
22
]
used variational autoencoder for extracting image information, their model allows multi-caption generation by sampling
from the learned image feature distribution, thus produce various captions for a single image. He et al. [
4
] used POS
tagging to help the generation of text. The image feature is used as additional input when the model is predicting tokens
related to image-specific information, i.e. object, colour, relative position of objects. Mokady, Hertz and Bermano [
23
]
experimented with using a pre-trained CLIP image feature for sequence generation. The CLIP features are transformed
to a sequence of token and used as prefix for a GPT-2 model in generation. Li et al. [
5
] introduced skipped connections
between transformer layers to address the information asymmetry between vision and language modality. The model
achieves state-of-the-art performance and strong zero-shot ability on various tasks. Nguyen et al. [
24
] experimented
with changing the cross attention part of transformer decoder to use both Regional feature from Faster-RCNN [
25
] and
Grid features from Swin transformer [26].
2.2 Non autoregressive image captioning
In contrast, non-autoregressive models benefit from the attention models’ ability to pass textural information in both
directions during generation. The text generated in former timesteps could adjust based on text in later timesteps,
thus is expected to achieve better performance. Gao et al. [
6
] used BERT [
27
] as text decoder and employed a 2
step generation method. Based on this work, Partially Non-Autoregressive Image Captioning by Fei [
7
] and semi
Non-Autoregressive Image Captioning by Xu et al. [
8
] partitioned the generated text in subgroups. Words in the
same group are generated non-autoregressively and different groups are generated autoregressively. Our model falls in
non-autoregressive category and is most close to the Masked Non-Autoregressive Image Captioning [
6
]. The difference
is we choose to use the diffusion model as the non-autoregressive generation model.
2.3 Diffusion models
Diffusion model aims at training a model that denoise Gaussial noise incrementally to reproduce original features. Ho,
Jain and Abbeel [
28
] proposed the Denoising Diffusion Probabilistic Model (DDPM) to simplify the loss function by
only letting models predict the noise in generation steps, and proposed an alternative loss function by removing the
weight coefficients. In the following explanation, we refer to diffusion model as DDPM for simplicity. Nichol and
Dhariwal [
29
] proposed several improvements based on DDPM, including setting variance to be learn-able parameters,
apply cosine instead of linear noise schedule, and speed up forward process by reducing forward steps. Song, Meng and
Ermon [
30
] experimented on reducing the variance in forward process. The result shows that by reducing variance to 0,
the deterministic model achieves higher FID score in image generation on both CIFAR10 and CelebA. Diffusion-LM by
Li et al. [
16
] is a recent work on applying continuous diffusion model on text generation. Their work explored various
techniques to improve the performance of continuous diffusion model on text generation.
Dhariwal and Nichol [
31
] proposed classifier guidance for improving generated image FID score. In a classifier-guided
diffusion model, a classifier model is pretrained to predict noised images’ object class. During training, the classifier
provides gradient on which direction to optimise the generated image, so that the generate image resembles an object
closer to the target class.
To avoid training classifier for guiding model, Jonathan and Tim [
32
] proposed classifier-free guidance model. In
classifier-free guidance, the difference between outputs of generative model when provided with either guided and
unguided context information is used as implicit guidance. By using classifier-free diffusion model as text-to-image
generator, DALL-E2 [
9
], GLIDE [
11
] and High-Resolution Image Synthesis With Latent Diffusion Models [
33
] model
achieves significant image generation performance. In particular, DALL-E2 use CLIP model for extracting feature
from text, predict the corresponding image CLIP feature through prior network, then use predicted image CLIP feature
for final image generation. The model achieves significant novelty in generated images and also inspired us to train a
image-to-text model with diffusion model in generation step.
2
摘要:

CLIP-DIFFUSION-LM:APPLYDIFFUSIONMODELONIMAGECAPTIONINGShitongXuImperialCollegeLondonshitong.xu19@imperial.ac.ukABSTRACTImagecaptioningtaskhasbeenextensivelyresearchedbypreviouswork.However,limitedexperi-mentsfocusongeneratingcaptionsbasedonnon-autoregressivetextdecoder.Inspiredbytherecentsuccessofth...

展开>> 收起<<
CLIP-D IFFUSION -LM A PPLY DIFFUSION MODEL ON IMAGE CAPTIONING Shitong Xu.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:476.7KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注