image and text features, allowing model to attend on either image or text modality in the late phase of generation. Wu
and Hu [
17
] experimented on reversing the generated caption, allowing its backend model to refine the former tokens
based on later caption tokens. Le et al. [
18
] used attention model to combine local and global features from images,
so that captions can more accurately identify occluded objects. Similarly, Wei et al. [
19
] also used image features
from both high and low generality, and combined them using cross-attention. Their work also involves multi-step
refining of the generated text caption. Feng et al. [
20
] trained image caption in a GAN framework, with a LSTM
discriminator reproducing the original image feature from generated text sequence. Similarly, Guo et al. [
21
] proposed
GAN based method to train model predicting stylized text. Multiple discriminators are used to supervise if generated
text captured image-related feature, is in the desired style, and is similar to a caption made by human. Kim et al. [
22
]
used variational autoencoder for extracting image information, their model allows multi-caption generation by sampling
from the learned image feature distribution, thus produce various captions for a single image. He et al. [
4
] used POS
tagging to help the generation of text. The image feature is used as additional input when the model is predicting tokens
related to image-specific information, i.e. object, colour, relative position of objects. Mokady, Hertz and Bermano [
23
]
experimented with using a pre-trained CLIP image feature for sequence generation. The CLIP features are transformed
to a sequence of token and used as prefix for a GPT-2 model in generation. Li et al. [
5
] introduced skipped connections
between transformer layers to address the information asymmetry between vision and language modality. The model
achieves state-of-the-art performance and strong zero-shot ability on various tasks. Nguyen et al. [
24
] experimented
with changing the cross attention part of transformer decoder to use both Regional feature from Faster-RCNN [
25
] and
Grid features from Swin transformer [26].
2.2 Non autoregressive image captioning
In contrast, non-autoregressive models benefit from the attention models’ ability to pass textural information in both
directions during generation. The text generated in former timesteps could adjust based on text in later timesteps,
thus is expected to achieve better performance. Gao et al. [
6
] used BERT [
27
] as text decoder and employed a 2
step generation method. Based on this work, Partially Non-Autoregressive Image Captioning by Fei [
7
] and semi
Non-Autoregressive Image Captioning by Xu et al. [
8
] partitioned the generated text in subgroups. Words in the
same group are generated non-autoregressively and different groups are generated autoregressively. Our model falls in
non-autoregressive category and is most close to the Masked Non-Autoregressive Image Captioning [
6
]. The difference
is we choose to use the diffusion model as the non-autoregressive generation model.
2.3 Diffusion models
Diffusion model aims at training a model that denoise Gaussial noise incrementally to reproduce original features. Ho,
Jain and Abbeel [
28
] proposed the Denoising Diffusion Probabilistic Model (DDPM) to simplify the loss function by
only letting models predict the noise in generation steps, and proposed an alternative loss function by removing the
weight coefficients. In the following explanation, we refer to diffusion model as DDPM for simplicity. Nichol and
Dhariwal [
29
] proposed several improvements based on DDPM, including setting variance to be learn-able parameters,
apply cosine instead of linear noise schedule, and speed up forward process by reducing forward steps. Song, Meng and
Ermon [
30
] experimented on reducing the variance in forward process. The result shows that by reducing variance to 0,
the deterministic model achieves higher FID score in image generation on both CIFAR10 and CelebA. Diffusion-LM by
Li et al. [
16
] is a recent work on applying continuous diffusion model on text generation. Their work explored various
techniques to improve the performance of continuous diffusion model on text generation.
Dhariwal and Nichol [
31
] proposed classifier guidance for improving generated image FID score. In a classifier-guided
diffusion model, a classifier model is pretrained to predict noised images’ object class. During training, the classifier
provides gradient on which direction to optimise the generated image, so that the generate image resembles an object
closer to the target class.
To avoid training classifier for guiding model, Jonathan and Tim [
32
] proposed classifier-free guidance model. In
classifier-free guidance, the difference between outputs of generative model when provided with either guided and
unguided context information is used as implicit guidance. By using classifier-free diffusion model as text-to-image
generator, DALL-E2 [
9
], GLIDE [
11
] and High-Resolution Image Synthesis With Latent Diffusion Models [
33
] model
achieves significant image generation performance. In particular, DALL-E2 use CLIP model for extracting feature
from text, predict the corresponding image CLIP feature through prior network, then use predicted image CLIP feature
for final image generation. The model achieves significant novelty in generated images and also inspired us to train a
image-to-text model with diffusion model in generation step.
2