Many studies [11, 12, 13, 14, 15, 16, 17, 18, 19] have proposed various meth-
ods to deal with the class imbalance problem since Chen et al. [33] and Tang
et al. [32] proposed the more balanced mean recall metrics. Tang et al. [13]
adopted a counterfactual approach in making inferences to remove a context
co-occurrence bias. Chiou et al. [19] recovered the unbiased probabilities from
biased probabilities by label frequencies estimated dynamically in training. Also,
recent works have adopted general ideas to address tackle long-tailed issues, as
shown in Sec. 2.1. Li et al. [18] proposed bi-level data resampling, including
image-level oversampling and instance-level undersampling. Moreover, task-
specific loss functions and weighting methods have also been proposed. Yan
et al. [15] introduced loss re-weighting by an inverse of a degree of predicate
correlations. Yu et al. [16] proposed a loss for a hierarchical cognitive structure
to support coarse-to-fine classification. Suhail et al. [17] adopted a loss for-
mulation using an energy-based model for structured learning of scene graphs.
Furthermore, He et al. [14] applied the approach of transfer learning to SGG
tasks.
These recent works [11, 12, 13, 14, 15, 16, 17, 18, 19] have improved SGG per-
formance, but few studies have addressed predicate similarities in the dataset.
Yan et al. [15] mentioned the feature but focused on predicates having weak
correlations with others, and thereby did not directly take advantage of the
relationship between similar predicates. Yu et al. [16] adopted a similar focus
to that of the present work, but their method only considers parent-children
relationships among predicates, whereas the proposed method does not limit to
such hierarchical similarities.
3 Proposed Approach
Scene graph generation tasks involve generating a graph representation com-
prising objects and the visual relationships among them shown in a given input
image. In particular, we aim to address the biased relationship classification
caused by imbalanced predicate distributions and semantic overlaps among the
predicates. To this end, we introduce a classification strategy which focuses on
predicate similarities and utilizes the idea of transfer learning. In this section,
we first present the problem setting in Sec. 3.1. We then explain the details of
our proposed predictor in Sec. 3.2. Fig. 2 shows an overview of the model.
3.1 Problem Setting
We first detect object candidates using a standard object detector such as Faster
R-CNN [30]. Given an image I, the detector outputs Nbounding boxes B=
{bi}N
i=1 ⊂R4. Each box also includes an ROIAlign feature [34] and a tentative
object label such as “dog” and “man”. We then refine these features with
a message-passing module for the final object classification and relationship
classification.
Relationship classification is then performed as follows. Given a pair of
bounding boxes, a relation predictor classifies the pair from a set of Apredicate
labels (e.g., “on”, “in”) denoted as A={1,2, . . . , A}. Here, for each pair of
bounding boxes, we have three input features, e,z,and u(see an example in
Fig. 2). A P-dimensional pairwise relation feature e∈RPis obtained from
4