Image Semantic Relation Generation Mingzhe Du School of Computer Science and Engineering NTU

2025-05-08 0 0 1.15MB 5 页 10玖币
侵权投诉
Image Semantic Relation Generation
Mingzhe Du
School of Computer Science and Engineering, NTU
50 Nanyang Ave, Singapore 639798
mingzhe001@e.ntu.edu.sg
Abstract
Scene graphs provide structured semantic understand-
ing beyond images. For downstream tasks, such as im-
age retrieval, visual question answering, visual relationship
detection, and even autonomous vehicle technology, scene
graphs can not only distil complex image information but
also correct the bias of visual models using semantic-level
relations, which has broad application prospects. However,
the heavy labour cost of constructing graph annotations
may hinder the application of PSG in practical scenarios.
Inspired by the observation that people usually identify the
subject and object first and then determine the relationship
between them, we proposed to decouple the scene graphs
generation task into two sub-tasks: 1) an image segmen-
tation task to pick up the qualified objects. 2) a restricted
auto-regressive text generation task to generate the relation
between given objects. Therefore, in this work, we intro-
duce image semantic relation generation (ISRG), a simple
but effective image-to-text model, which achieved 31 points
on the OpenPSG dataset and outperforms strong baselines
respectively by 16 points (ResNet-50) and 5 points (CLIP).
1. Introduction
The PSG classification task aims to identify the three
most salient relations in a given image [6]. Unlike gener-
ating a full scene graph [17], this task does not require the
model to find the objects corresponding to the relations [6].
Therefore, if we can settle the scene graph generation task
or identify the relationship between the given subject and
object, we can solve the PSG classification task thoroughly.
In this work, we first employ a panoptic image segmenta-
tion algorithm [7] from Detectron 2 [4] to map each pixel to
its object, then pick the top-k objects with large area ratios,
and finally provide these objects into a multi-model model
to generate sequences, under the prefix tree constraint, until
we get three different relation descriptions.
The experiment results show that the generative relation
extraction model has obvious advantages compared with the
multi-label classification model (ResNet, ViT, and CLIP).
The combination of the visual encoder and the language de-
coder enables the model to learn more profound relational
semantic concepts, which makes our model significantly
outperform other baseline models.
In summary, we make the following explorations on the
OpenPSG dataset:
Multi-modal Image-to-text Model We introduce the vi-
sion encoder-decoder model [1] in the PSG classification
task. Leveraging the strong learning abilities of the pre-
trained vision encoder and language decoder, our model
shows the superior semantic understanding capability span-
ning image and text.
End-to-end Scene Graph Generation Compared with
CLIP, which only exchanges multi-modal information be-
tween image features and texture features at the attention
matrix, our method ISRG fuses image and text messages
during each token generation process [12]. This way, the
text and the image can be fully integrated, encouraging the
model to comprehend relation semantic concepts.
Comparison of Model Performance In the experiment
section, we compare several baseline models, including
ResNet-50, ViT, CLIP, and our method ISRG. Our model
outperforms other baseline models on the OpenPSG dataset
by improving the lack of previous models.
2. Related works
Scene Graph is a graph structure G= (O, E)ex-
tracted from a given image, which consists of Nobject
nodes O= (O1, O2, ..., On)and Mrelations between
nodes E= (E1, E2, ..., Em). Therefore, we can use triplets
T= (Ohead, Erelation , Otail )on this directed graph to de-
scribe relationships between two objects from a given im-
age [17]. For example, location relations like (”zebra in
front of elephant”) or state relations like (”person holding
baseball bat”).
Residual Neural Network (ResNet) is an open-gated
variant of HighwayNet [14]. Due to the skip mechanism,
ResNet is the first feed-forward neural network to reach
arXiv:2210.11253v1 [cs.CV] 19 Oct 2022
摘要:

ImageSemanticRelationGenerationMingzheDuSchoolofComputerScienceandEngineering,NTU50NanyangAve,Singapore639798mingzhe001@e.ntu.edu.sgAbstractScenegraphsprovidestructuredsemanticunderstand-ingbeyondimages.Fordownstreamtasks,suchasim-ageretrieval,visualquestionanswering,visualrelationshipdetection,ande...

展开>> 收起<<
Image Semantic Relation Generation Mingzhe Du School of Computer Science and Engineering NTU.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:1.15MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注