Vision Transformer Based Model for Describing a Set of Images as a Story Zainy M. Malakan120000 000269800992

2025-05-06 0 0 2.09MB 14 页 10玖币

侵权投诉

Vision Transformer Based Model for Describing

a Set of Images as a Story

Zainy M. Malakan1,2[0000−0002−6980−0992],

Ghulam Mubashar Hassan1[0000−0002−6636−8807], and

Ajmal Mian1[0000−0002−5206−3842]

1The University of Western Australia, Perth WA 6009, Australia

{ghulam.hassan,ajmal.mian}@uwa.edu.au

2Umm Al-Qura University, Makkah 24382, Saudi Arabia

{zmmalakan}@uqu.edu.sa

Abstract. Visual Story-Telling is the process of forming a multi sen-

tence story from a set of images. Appropriately including visual variation

and contextual information captured inside the input images is one of

the most challenging aspects of visual storytelling. Consequently, stories

developed from a set of images often lack cohesiveness, relevance, and

semantic relationship. In this paper, we propose a novel Vision Trans-

former Based Model for describing a set of images as a story. The pro-

posed method extracts the distinct features of the input images using a

Vision Transformer (ViT). Firstly, input images are divided into 16X16

patches and bundled into a linear projection of ﬂattened patches. The

transformation from a single image to multiple image patches captures

the visual variety of the input visual patterns. These features are used

as input to a Bidirectional-LSTM which is part of the sequence encoder.

This captures the past and future image context of all image patches.

Then, an attention mechanism is implemented and used to increase the

discriminatory capacity of the data fed into the language model, i.e. a

Mogriﬁer-LSTM. The performance of our proposed model is evaluated

using the Visual Story-Telling dataset (VIST), and the results show that

our model outperforms the current state of the art models.

Keywords: Storytelling ·Vision Transformer ·Image Processing.

1 Introduction

Visual description or storytelling (VST) seeks to create a sequence of meaningful

sentences to narrate a set of images. It has attracted signiﬁcant interest from the

vision to language ﬁeld. However, compared to image [17,7] and video [24,28,18]

captioning, narrative storytelling [21,14,19] has more complex structures and

incorporates themes that do not appear explicitly in the given set of images.

Moreover, describing a set of images is challenging because it demands algorithms

to not only comprehend the semantic information, such as activities and objects

in each of the ﬁve images along with their relationships, but also demands ﬂuency

in the phrases as well as the visually unrepresented notions.

arXiv:2210.02762v3 [cs.CV] 14 Jul 2023

2 Zainy M. Malakan, Ghulam Mubashar Hassan, and Ajmal Mian

Fig. 1. An example of three vision description techniques includes a single picture

caption, story-like caption, and narrative storytelling, which is our aim.

Recent storytelling techniques utilise sequence-to-sequence (seq2seq) models

[20,14] to produce narratives based story on a set of images. The key idea behind

these approaches is to implement a convolutional neural network (CNN), (i.e,

a sequence encoder), to extract the visual features of the set of images. Then,

combining these visual features, a complete set of image representations is ob-

tained. The next step is to input this representational vector into a hierarchical

long-short-term memory (LSTM) model to form a sequence of sentences as a

story. This approach has dominated this area of research owing to its capacity

to generate high-quality and adaptable narratives.

Figure 1 illustrates the technical challenges between single image captioning

style, isolation style, and storytelling style for a set of ﬁve images. For example,

the ﬁrst sentence of all the three blocks in Figure 1 annotations show the follow-

ing: “A picture of cars around.”, “The car is parked in the street.”, and “I went

to the park yesterday, and there were many cars there.”. The ﬁrst description is

known as image captioning style which conveys the actual and physical picture

information. The second description is known as storytelling in-isolation style

which catches the image content as well, but it is not linked to the following

sentence. The ﬁnal description is known as ﬁrst-person storytelling style which

explains more inferences about the image as a story-based sentence and also

links to the subsequent sentence.

Vision Transformer Based Model for Describing a Set of Images as a Story 3

In order to solve the above challenges and diﬃculties, we propose a novel

methodology that explores the signiﬁcance of spatial dimension conversion and

its eﬃcacy on Vision Transformer (ViT) [6] based model. Our method proceeds

by extracting the feature vectors from the given images by dividing them into

16X16 patches and feeding them into a Bidirectional-LSTM (Bi-LSTM). This

models the visual patches as a temporal link among the set of images. By us-

ing the Bidirectional-LSTM, we represent the temporal link between patches

in both forward and backward directions. To preserve the visual-speciﬁc con-

text and relevance, we convert the visual features and contextual vectors from

Bi-LSTM into a shared latent space using a Mogriﬁer-LSTM architecture [22].

During the ﬁrst layer’s gated modulation, the initial gating step scales the in-

put embedding based on the ground truth context, producing a contextualized

representation of the input. This combination of multi-view feature extraction

and highly context-dependent input information allows the language model to

provide more meaningful and contextual descriptions of the input set of images.

The following is a summary of the contributions presented in this paper:

–We propose a novel ViT sequence encoder framework, that utilises multi-

view visual information extraction for appropriate narrative based story on

the given set of images as input.

–We take into account the context of the past as well as the future and

employ an attention mechanism over the contextualized characteristics that

have been obtained from Vision Transformer (ViT) to construct semantically

rich narratives from a language model.

–We propose to combine Mogriﬁer-LSTM with enriched visual characteristics

(patches) and semantic inputs to generate data-driven narratives that are

coherent and relevant.

–We demonstrate the utility of our proposed method through multiple eval-

uation metrics on the largest known Visual Story-Telling dataset (VIST)

[12]3. In addition, we compare the performance of our technique with exist-

ing state of the art techniques and show that it outperforms them on various

evaluation metrics.

2 Related Works

This section presents a review of literature on diﬀerent visual captions that

directly relate to narrative storytelling techniques, followed by the literature on

visual storytelling methods.

2.1 Visual Understanding

Visual understanding algorithms, which include image and video captioning, are

the most signiﬁcant sort of networks utilized to tackle the problem of narrative

storytelling. Since it is most relevant to our study, we brieﬂy discuss the recent

3https://visionandlanguage.net/VIST/

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

VisionTransformerBasedModelforDescribingaSetofImagesasaStoryZainyM.Malakan1,2[0000−0002−6980−0992],GhulamMubasharHassan1[0000−0002−6636−8807],andAjmalMian1[0000−0002−5206−3842]1TheUniversityofWesternAustralia,PerthWA6009,Australia{ghulam.hassan,ajmal.mian}@uwa.edu.au2UmmAl-QuraUniversity,Makkah24382...

展开>> 收起<<

Vision Transformer Based Model for Describing a Set of Images as a Story Zainy M. Malakan120000 000269800992.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Vision Transformer Based Model for Describing a Set of Images as a Story Zainy M. Malakan120000 000269800992

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: