Vision Transformer Based Model for Describing a Set of Images as a Story Zainy M. Malakan120000 000269800992

2025-05-06 0 0 2.09MB 14 页 10玖币
侵权投诉
Vision Transformer Based Model for Describing
a Set of Images as a Story
Zainy M. Malakan1,2[0000000269800992],
Ghulam Mubashar Hassan1[0000000266368807], and
Ajmal Mian1[0000000252063842]
1The University of Western Australia, Perth WA 6009, Australia
{ghulam.hassan,ajmal.mian}@uwa.edu.au
2Umm Al-Qura University, Makkah 24382, Saudi Arabia
{zmmalakan}@uqu.edu.sa
Abstract. Visual Story-Telling is the process of forming a multi sen-
tence story from a set of images. Appropriately including visual variation
and contextual information captured inside the input images is one of
the most challenging aspects of visual storytelling. Consequently, stories
developed from a set of images often lack cohesiveness, relevance, and
semantic relationship. In this paper, we propose a novel Vision Trans-
former Based Model for describing a set of images as a story. The pro-
posed method extracts the distinct features of the input images using a
Vision Transformer (ViT). Firstly, input images are divided into 16X16
patches and bundled into a linear projection of flattened patches. The
transformation from a single image to multiple image patches captures
the visual variety of the input visual patterns. These features are used
as input to a Bidirectional-LSTM which is part of the sequence encoder.
This captures the past and future image context of all image patches.
Then, an attention mechanism is implemented and used to increase the
discriminatory capacity of the data fed into the language model, i.e. a
Mogrifier-LSTM. The performance of our proposed model is evaluated
using the Visual Story-Telling dataset (VIST), and the results show that
our model outperforms the current state of the art models.
Keywords: Storytelling ·Vision Transformer ·Image Processing.
1 Introduction
Visual description or storytelling (VST) seeks to create a sequence of meaningful
sentences to narrate a set of images. It has attracted significant interest from the
vision to language field. However, compared to image [17,7] and video [24,28,18]
captioning, narrative storytelling [21,14,19] has more complex structures and
incorporates themes that do not appear explicitly in the given set of images.
Moreover, describing a set of images is challenging because it demands algorithms
to not only comprehend the semantic information, such as activities and objects
in each of the five images along with their relationships, but also demands fluency
in the phrases as well as the visually unrepresented notions.
arXiv:2210.02762v3 [cs.CV] 14 Jul 2023
2 Zainy M. Malakan, Ghulam Mubashar Hassan, and Ajmal Mian
Fig. 1. An example of three vision description techniques includes a single picture
caption, story-like caption, and narrative storytelling, which is our aim.
Recent storytelling techniques utilise sequence-to-sequence (seq2seq) models
[20,14] to produce narratives based story on a set of images. The key idea behind
these approaches is to implement a convolutional neural network (CNN), (i.e,
a sequence encoder), to extract the visual features of the set of images. Then,
combining these visual features, a complete set of image representations is ob-
tained. The next step is to input this representational vector into a hierarchical
long-short-term memory (LSTM) model to form a sequence of sentences as a
story. This approach has dominated this area of research owing to its capacity
to generate high-quality and adaptable narratives.
Figure 1 illustrates the technical challenges between single image captioning
style, isolation style, and storytelling style for a set of five images. For example,
the first sentence of all the three blocks in Figure 1 annotations show the follow-
ing: “A picture of cars around.”, “The car is parked in the street.”, and “I went
to the park yesterday, and there were many cars there.”. The first description is
known as image captioning style which conveys the actual and physical picture
information. The second description is known as storytelling in-isolation style
which catches the image content as well, but it is not linked to the following
sentence. The final description is known as first-person storytelling style which
explains more inferences about the image as a story-based sentence and also
links to the subsequent sentence.
Vision Transformer Based Model for Describing a Set of Images as a Story 3
In order to solve the above challenges and difficulties, we propose a novel
methodology that explores the significance of spatial dimension conversion and
its efficacy on Vision Transformer (ViT) [6] based model. Our method proceeds
by extracting the feature vectors from the given images by dividing them into
16X16 patches and feeding them into a Bidirectional-LSTM (Bi-LSTM). This
models the visual patches as a temporal link among the set of images. By us-
ing the Bidirectional-LSTM, we represent the temporal link between patches
in both forward and backward directions. To preserve the visual-specific con-
text and relevance, we convert the visual features and contextual vectors from
Bi-LSTM into a shared latent space using a Mogrifier-LSTM architecture [22].
During the first layer’s gated modulation, the initial gating step scales the in-
put embedding based on the ground truth context, producing a contextualized
representation of the input. This combination of multi-view feature extraction
and highly context-dependent input information allows the language model to
provide more meaningful and contextual descriptions of the input set of images.
The following is a summary of the contributions presented in this paper:
We propose a novel ViT sequence encoder framework, that utilises multi-
view visual information extraction for appropriate narrative based story on
the given set of images as input.
We take into account the context of the past as well as the future and
employ an attention mechanism over the contextualized characteristics that
have been obtained from Vision Transformer (ViT) to construct semantically
rich narratives from a language model.
We propose to combine Mogrifier-LSTM with enriched visual characteristics
(patches) and semantic inputs to generate data-driven narratives that are
coherent and relevant.
We demonstrate the utility of our proposed method through multiple eval-
uation metrics on the largest known Visual Story-Telling dataset (VIST)
[12]3. In addition, we compare the performance of our technique with exist-
ing state of the art techniques and show that it outperforms them on various
evaluation metrics.
2 Related Works
This section presents a review of literature on different visual captions that
directly relate to narrative storytelling techniques, followed by the literature on
visual storytelling methods.
2.1 Visual Understanding
Visual understanding algorithms, which include image and video captioning, are
the most significant sort of networks utilized to tackle the problem of narrative
storytelling. Since it is most relevant to our study, we briefly discuss the recent
3https://visionandlanguage.net/VIST/
摘要:

VisionTransformerBasedModelforDescribingaSetofImagesasaStoryZainyM.Malakan1,2[0000−0002−6980−0992],GhulamMubasharHassan1[0000−0002−6636−8807],andAjmalMian1[0000−0002−5206−3842]1TheUniversityofWesternAustralia,PerthWA6009,Australia{ghulam.hassan,ajmal.mian}@uwa.edu.au2UmmAl-QuraUniversity,Makkah24382...

展开>> 收起<<
Vision Transformer Based Model for Describing a Set of Images as a Story Zainy M. Malakan120000 000269800992.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:2.09MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注