Image Semantic Relation Generation Mingzhe Du School of Computer Science and Engineering NTU

2025-05-08 0 0 1.15MB 5 页 10玖币

侵权投诉

Image Semantic Relation Generation

Mingzhe Du

School of Computer Science and Engineering, NTU

50 Nanyang Ave, Singapore 639798

mingzhe001@e.ntu.edu.sg

Abstract

Scene graphs provide structured semantic understand-

ing beyond images. For downstream tasks, such as im-

age retrieval, visual question answering, visual relationship

detection, and even autonomous vehicle technology, scene

graphs can not only distil complex image information but

also correct the bias of visual models using semantic-level

relations, which has broad application prospects. However,

the heavy labour cost of constructing graph annotations

may hinder the application of PSG in practical scenarios.

Inspired by the observation that people usually identify the

subject and object ﬁrst and then determine the relationship

between them, we proposed to decouple the scene graphs

generation task into two sub-tasks: 1) an image segmen-

tation task to pick up the qualiﬁed objects. 2) a restricted

auto-regressive text generation task to generate the relation

between given objects. Therefore, in this work, we intro-

duce image semantic relation generation (ISRG), a simple

but effective image-to-text model, which achieved 31 points

on the OpenPSG dataset and outperforms strong baselines

respectively by 16 points (ResNet-50) and 5 points (CLIP).

1. Introduction

The PSG classiﬁcation task aims to identify the three

most salient relations in a given image [6]. Unlike gener-

ating a full scene graph [17], this task does not require the

model to ﬁnd the objects corresponding to the relations [6].

Therefore, if we can settle the scene graph generation task

or identify the relationship between the given subject and

object, we can solve the PSG classiﬁcation task thoroughly.

In this work, we ﬁrst employ a panoptic image segmenta-

tion algorithm [7] from Detectron 2 [4] to map each pixel to

its object, then pick the top-k objects with large area ratios,

and ﬁnally provide these objects into a multi-model model

to generate sequences, under the preﬁx tree constraint, until

we get three different relation descriptions.

The experiment results show that the generative relation

extraction model has obvious advantages compared with the

multi-label classiﬁcation model (ResNet, ViT, and CLIP).

The combination of the visual encoder and the language de-

coder enables the model to learn more profound relational

semantic concepts, which makes our model signiﬁcantly

outperform other baseline models.

In summary, we make the following explorations on the

OpenPSG dataset:

Multi-modal Image-to-text Model We introduce the vi-

sion encoder-decoder model [1] in the PSG classiﬁcation

task. Leveraging the strong learning abilities of the pre-

trained vision encoder and language decoder, our model

shows the superior semantic understanding capability span-

ning image and text.

End-to-end Scene Graph Generation Compared with

CLIP, which only exchanges multi-modal information be-

tween image features and texture features at the attention

matrix, our method ISRG fuses image and text messages

during each token generation process [12]. This way, the

text and the image can be fully integrated, encouraging the

model to comprehend relation semantic concepts.

Comparison of Model Performance In the experiment

section, we compare several baseline models, including

ResNet-50, ViT, CLIP, and our method ISRG. Our model

outperforms other baseline models on the OpenPSG dataset

by improving the lack of previous models.

2. Related works

Scene Graph is a graph structure G= (O, E)ex-

tracted from a given image, which consists of Nobject

nodes O= (O1, O2, ..., On)and Mrelations between

nodes E= (E1, E2, ..., Em). Therefore, we can use triplets

T= (Ohead, Erelation , Otail )on this directed graph to de-

scribe relationships between two objects from a given im-

age [17]. For example, location relations like (”zebra in

front of elephant”) or state relations like (”person holding

baseball bat”).

Residual Neural Network (ResNet) is an open-gated

variant of HighwayNet [14]. Due to the skip mechanism,

ResNet is the ﬁrst feed-forward neural network to reach

arXiv:2210.11253v1 [cs.CV] 19 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ImageSemanticRelationGenerationMingzheDuSchoolofComputerScienceandEngineering,NTU50NanyangAve,Singapore639798mingzhe001@e.ntu.edu.sgAbstractScenegraphsprovidestructuredsemanticunderstand-ingbeyondimages.Fordownstreamtasks,suchasim-ageretrieval,visualquestionanswering,visualrelationshipdetection,ande...

展开>> 收起<<

Image Semantic Relation Generation Mingzhe Du School of Computer Science and Engineering NTU.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Image Semantic Relation Generation Mingzhe Du School of Computer Science and Engineering NTU

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: