Unbiased Scene Graph Generation using Predicate Similarities Misaki OhashiYusuke Matsui

2025-05-06 0 0 1.34MB 17 页 10玖币

侵权投诉

Unbiased Scene Graph Generation using

Predicate Similarities

Misaki Ohashi∗Yusuke Matsui∗

Abstract

Scene Graphs are widely applied in computer vision as a graphical rep-

resentation of relationships between objects shown in images. However,

these applications have not yet reached a practical stage of development

owing to biased training caused by long-tailed predicate distributions. In

recent years, many studies have tackled this problem. In contrast, rela-

tively few works have considered predicate similarities as a unique dataset

feature which also leads to the biased prediction. Due to the feature, infre-

quent predicates (e.g., “parked on”, “covered in”) are easily misclassiﬁed

as closely-related frequent predicates (e.g., “on”, “in”). Utilizing pred-

icate similarities, we propose a new classiﬁcation scheme that branches

the process to several ﬁne-grained classiﬁers for similar predicate groups.

The classiﬁers aim to capture the diﬀerences among similar predicates in

detail. We also introduce the idea of transfer learning to enhance the

features for the predicates which lack suﬃcient training samples to learn

the descriptive representations. The results of extensive experiments on

the Visual Genome dataset show that the combination of our method

and an existing debiasing approach greatly improves performance on tail

predicates in challenging SGCls/SGDet tasks. Nonetheless, the overall

performance of the proposed approach does not reach that of the current

state of the art, so further analysis remains necessary as future work.

1 Introduction

Scene graphs describe objects that appear image data and their relationships in

the image. Generally, scene graph generation (SGG) is divided into three stages,

including object detection, object classiﬁcation, and relationship classiﬁcation.

Scene graphs comprehensively capture the content of image scenes. Hence, they

can be applied to high-level and wide-ranging practical tasks, including visual

question answering [2, 3, 4], image captioning [5, 6, 7], and image retrieval [8, 9].

The relationship classiﬁcation stage in SGG typically involves class imbal-

ance problems in the most widely-used Visual Genome dataset [10]. As shown

in Fig. 1, the number of training samples for “on” is about 50 times higher than

“standing on”. A model trained with such an imbalanced dataset is more likely

to predict a few frequent predicates (e.g., “on”, “in”) against many infrequent

predicates (e.g., “lying on”, “covered in”). Hereafter, we refer to frequent and

infrequent predicates as head and tail predicates, respectively.

∗The University of Tokyo

arXiv:2210.00920v1 [cs.CV] 3 Oct 2022

(d) Prediction by [1]

(b) Ground Truth

man

bike

pant

sidewalk

walking on

street

parked on

wearing

(a) Input Image

man

bike

pant

sidewalk

street

wearing

Label

Similarities

has

weraing

near

behind

wiith

holding

above

under

wears

sitting on

in front of

riding

standing on

attached to

over

carrying

walking on

for

looking at

watching

hanging from

and

belonging to

parked on

laying on

between

Predicate

0.0

0.1

Frequency

0.2

0.3

Figure 1: The class imbalance problem in scene graph generation. (a) An input

image. (b) Ground-truth scene graph. (c) Frequency distribution of training

samples for top-30 most frequent labels. (d) Aﬀected by label similarities and

imbalanced data distribution, the prediction results of an earlier method [1]

misclassify some descriptive predicates as “on”.

Existing unbiased methods [11, 12, 13, 14, 15, 16, 17, 18, 19] have focused

on the long-tailed distribution in the dataset. However, few works have focused

on another unique dataset feature, predicate similarities, which are also an im-

portant cause of the biased predictions. In contrast to general classiﬁcation

tasks, the dataset includes many semantically similar predicates. These simi-

larities make distinguishing between heads and tails challenging and encourage

misclassiﬁcation of tail predicates as more predictable head predicates. Because

head predicates are less descriptive than tail predicates, the graphs with heads

are less informative and more impractical. For example, Fig. 1 (b)(c) show that

the behavior “walking on” and the state “parked on” are all predicted as “on”,

resulting in the ambiguous description of the image content. Scene graphs that

represent limited visual information typically perform poorly in applications

to high-level tasks. Therefore, SGG models should be developed to predict as

speciﬁc a predicate as possible based on the subjects represented in image.

In this study, we propose a new relation predictor that utilizes the predi-

cate similarities of the dataset. Conventional all-class classiﬁers consider only

signiﬁcant diﬀerences between dissimilar predicates. In contrast, our proposed

predictor consists of several independent ﬁne-grained classiﬁers, each focusing

on slight diﬀerences between semantically similar predicates. The proposed ap-

proach is designed to recognize tail predicates that conventional classiﬁers tend

to misclassify as similar head predicates.

Furthermore, inspired by earlier work [14], we adopt a knowledge transfer

module for better representation learning. It enhances poorly learned features of

tail predicates by transferring the features of heads learned with suﬃcient sam-

ples. In contrast to the previous method [14], we transfer the knowledge within

similar predicates rather than all predicates. Because each ﬁne-grained classiﬁer

targets speciﬁc similar predicates, features would be noisy if the knowledge from

all predicates were incorporated, including dissimilar ones.

The contributions of this study are summarized as follows.

•We propose a method to handle the long-tail distribution and semantic

similarities of predicate labels by combining a similarity-based branching

scheme and a knowledge transfer module.

•The proposed method eﬀectively improves the tails’ prediction. In par-

ticular, when combined with an existing debiasing inference method, it

achieved the best recall on the challenging SGCls/SGDet tasks.

•Although our approach improved the accuracy of tail labels, its overall

performance was lower than the current state of the art, especially for a

relatively easy task (PredCls). Further analysis remains as future work.

2 Related Work

2.1 Imbalanced Classiﬁcation

In recent years, three primary methods have been applied to perform classiﬁca-

tion tasks involving long-tailed datasets.

Data re-balancing is a classical approach that adjusts the amount of data

to achieve a more balanced distribution. This method includes over-sampling

for minority classes [20, 21] and under-sampling for major classes [22]. Over-

sampling is prone to over-ﬁtting for the tail classes, whereas undersampling

discards most data, a considerable portion of the data, which makes it diﬃcult

to apply to highly imbalanced datasets.

Cost-sensitive re-weighting assigns diﬀerent loss weights based on the number

of classes or samples. Commonly used methods include weighting classes pro-

portionally to the inverse of the class frequency [23, 24] or the inverse square root

of the frequency [25, 26]. In recent years, Cui et al. [27] proposed re-weighting

by an inverse eﬀective number of samples, and Lin et al. [28] introduced sample-

level re-weighting.

Transfer learning involves transferring features learned from head classes

with abundant samples to tail classes that are learned insuﬃciently. Liu et

al. [29] introduced dynamic meta-embedding to exchange visual knowledge be-

tween heads and tails by combining a direct image feature and associated mem-

ory representations.

2.2 Scene Graph Generation

In the ﬁrst stage of SGG, an object detector (e.g., Faster R-CNN [30]) detects

several objects in an image. As the next step, object classiﬁcation is performed

after encoding the detections from the ﬁrst stage into object contextual infor-

mation. In most studies, the contexts are incorporated by message passing al-

gorithms such as graph attention networks [31], LSTM [1], and TreeLSTM [32].

Finally, the relationships among detected objects are predicted with a module

similar to object classiﬁcation.

Many studies [11, 12, 13, 14, 15, 16, 17, 18, 19] have proposed various meth-

ods to deal with the class imbalance problem since Chen et al. [33] and Tang

et al. [32] proposed the more balanced mean recall metrics. Tang et al. [13]

adopted a counterfactual approach in making inferences to remove a context

co-occurrence bias. Chiou et al. [19] recovered the unbiased probabilities from

biased probabilities by label frequencies estimated dynamically in training. Also,

recent works have adopted general ideas to address tackle long-tailed issues, as

shown in Sec. 2.1. Li et al. [18] proposed bi-level data resampling, including

image-level oversampling and instance-level undersampling. Moreover, task-

speciﬁc loss functions and weighting methods have also been proposed. Yan

et al. [15] introduced loss re-weighting by an inverse of a degree of predicate

correlations. Yu et al. [16] proposed a loss for a hierarchical cognitive structure

to support coarse-to-ﬁne classiﬁcation. Suhail et al. [17] adopted a loss for-

mulation using an energy-based model for structured learning of scene graphs.

Furthermore, He et al. [14] applied the approach of transfer learning to SGG

tasks.

These recent works [11, 12, 13, 14, 15, 16, 17, 18, 19] have improved SGG per-

formance, but few studies have addressed predicate similarities in the dataset.

Yan et al. [15] mentioned the feature but focused on predicates having weak

correlations with others, and thereby did not directly take advantage of the

relationship between similar predicates. Yu et al. [16] adopted a similar focus

to that of the present work, but their method only considers parent-children

relationships among predicates, whereas the proposed method does not limit to

such hierarchical similarities.

3 Proposed Approach

Scene graph generation tasks involve generating a graph representation com-

prising objects and the visual relationships among them shown in a given input

image. In particular, we aim to address the biased relationship classiﬁcation

caused by imbalanced predicate distributions and semantic overlaps among the

predicates. To this end, we introduce a classiﬁcation strategy which focuses on

predicate similarities and utilizes the idea of transfer learning. In this section,

we ﬁrst present the problem setting in Sec. 3.1. We then explain the details of

our proposed predictor in Sec. 3.2. Fig. 2 shows an overview of the model.

3.1 Problem Setting

We ﬁrst detect object candidates using a standard object detector such as Faster

R-CNN [30]. Given an image I, the detector outputs Nbounding boxes B=

{bi}N

i=1 ⊂R4. Each box also includes an ROIAlign feature [34] and a tentative

object label such as “dog” and “man”. We then reﬁne these features with

a message-passing module for the ﬁnal object classiﬁcation and relationship

classiﬁcation.

Relationship classiﬁcation is then performed as follows. Given a pair of

bounding boxes, a relation predictor classiﬁes the pair from a set of Apredicate

labels (e.g., “on”, “in”) denoted as A={1,2, . . . , A}. Here, for each pair of

bounding boxes, we have three input features, e,z,and u(see an example in

Fig. 2). A P-dimensional pairwise relation feature e∈RPis obtained from

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UnbiasedSceneGraphGenerationusingPredicateSimilaritiesMisakiOhashi*YusukeMatsui*AbstractSceneGraphsarewidelyappliedincomputervisionasagraphicalrep-resentationofrelationshipsbetweenobjectsshowninimages.However,theseapplicationshavenotyetreachedapracticalstageofdevelopmentowingtobiasedtrainingcausedby...

展开>> 收起<<

Unbiased Scene Graph Generation using Predicate Similarities Misaki OhashiYusuke Matsui.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Unbiased Scene Graph Generation using Predicate Similarities Misaki OhashiYusuke Matsui

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: