Context-driven Visual Object Recognition based on Knowledge Graphs Sebastian Monka12 Lavdim Halilaj1 and Achim Rettinger2

2025-05-06 0 0 2MB 18 页 10玖币

侵权投诉

Context-driven Visual Object Recognition based

on Knowledge Graphs

Sebastian Monka1,2, Lavdim Halilaj1, and Achim Rettinger2

1Bosch Center for Artiﬁcial Intelligence, Renningen, Germany

{sebastian.monka,lavdim.halilaj}@de.bosch.com

2Trier University, Trier, Germany

{rettinger}@uni-trier.de

Abstract. Current deep learning methods for object recognition are

purely data-driven and require a large number of training samples to

achieve good results. Due to their sole dependence on image data, these

methods tend to fail when confronted with new environments where even

small deviations occur. Human perception, however, has proven to be

signiﬁcantly more robust to such distribution shifts. It is assumed that

their ability to deal with unknown scenarios is based on extensive incor-

poration of contextual knowledge. Context can be based either on object

co-occurrences in a scene or on memory of experience. In accordance

with the human visual cortex which uses context to form diﬀerent object

representations for a seen image, we propose an approach that enhances

deep learning methods by using external contextual knowledge encoded

in a knowledge graph. Therefore, we extract diﬀerent contextual views

from a generic knowledge graph, transform the views into vector space

and infuse it into a DNN. We conduct a series of experiments to in-

vestigate the impact of diﬀerent contextual views on the learned object

representations for the same image dataset. The experimental results

provide evidence that the contextual views inﬂuence the image represen-

tations in the DNN diﬀerently and therefore lead to diﬀerent predictions

for the same images. We also show that context helps to strengthen the

robustness of object recognition models for out-of-distribution images,

usually occurring in transfer learning tasks or real-world scenarios.

Keywords: Neuro-Symbolic ·Knowledge Graph ·Contextual Learning

1 Introduction

How humans perceive the real world is strongly dependent on the context [38,29].

Especially, in situations with poor quality of visual input, for instance caused by

large distances, or short capturing times, context appears to play a major role

in improving the reliability of recognition [43]. Perception is not only inﬂuenced

by co-occurring objects or visual features in the same image, but also by expe-

rience and memory [39]. There is evidence that humans perceive similar images

diﬀerently considering the given context [10]. A famous example are ambiguous

ﬁgures as shown in Figure 1.

arXiv:2210.11233v1 [cs.AI] 20 Oct 2022

2 S. Monka et al.

(a) Duck or rabbit? [25] (b) Young lady or old woman? [1].

Fig. 1: The mental representation for ambiguous images can change based on the

context, although the perceived image is still the same.

Depending on the context, i.e. if it is Easter or Christmas [9], Figure 1a can

be either a duck or a rabbit. Likewise, inﬂuenced by own-age social biases [36],

Figure 1b can be either a young lady or an old woman. Humans categorize images

based on various types of context. Known categories are based on visual features

or semantic concepts [5], but may also be based on other information such as

attributes describing their function. Accordingly, neuroscience has shown that

the human brain encodes visual input into individual contextual object represen-

tations [15,17,49], namely visual, taxonomical, and functional [32]. Concretely,

in a visual context, images of a drum and a barrel have a high similarity, as they

share similar visual features. In a taxonomical context, a drum would be similar

to a violin, as they both are musical instruments. And in a functional context,

the drum would be similar to a hammer, since the same action of hitting can be

performed with both objects [7].

Whereas there is much evidence that intelligent machines should also repre-

sent information in contextualized embeddings, deep neural networks (DNNs)

form their object representations based only on the feature distribution of the

image dataset [8,56]. Therefore, they fail if the objects are placed in an incon-

gruent context that was not present in previous seen images [4].

For the scope of this work we investigate the following research questions:

•RQ1 - Can context provided in form of a KG inﬂuence learning image rep-

resentations of a DNN, the ﬁnal accuracy, and the image predictions?

•RQ2 - Can context help to avoid critical errors in domain changing scenarios

where DNNs fail?

To enable standard DNNs to build contextual object representations, we pro-

vide the context using a knowledge graph (KG) and its corresponding knowledge

graph embedding (hKG). Similar to the process in the human brain, we conduct

experiments with three diﬀerent types of contexts, namely visual context, taxo-

nomical context, and functional context 3. We provide two versions of knowledge

infusion into a DNN and compare the induction of diﬀerent contextual models in

depth by quantitatively investigating their learned contextual embedding spaces

using class-related cosine similarities. In addition we evaluate our approach quan-

titatively by comparing their ﬁnal accuracy on object recognition tasks on source

and target domains and provide insights and challenges. The structure of this

Context-driven Visual Object Recognition based on Knowledge Graphs 3

paper is organized as follows: Section 6 outlines related work. In Section 3.1 we

introduce the three diﬀerent types of context and an option to model these views

in a contextual knowledge graph. Section 3 shows two ways of infusing context

into a visual DNN. In Section 4 we conduct experiments on seven image datasets

in two transfer learning scenarios. In Section 5 we answer the research questions

and summarize the main insights of our approach.

2 Preliminaries

Contextual Image Representations in the Brain. Cognitive and neuroscience re-

search has recently begun to investigate the relationship between viewed objects

and the corresponding fMRI scan activities of the human brain. It is assumed

that the primate visual system is organized into two separate processing path-

ways in the visual cortex, namely, the dorsal pathway and the ventral pathway.

While the dorsal pathway is responsible for the spatial recognition of objects

as well as actions and manipulations such as grasping, the ventral pathway is

responsible for recognizing the type of object based on its form or motion [52].

Bonner et al. [6] recently showed that the sensory coding of objects in the ven-

tral cortex of the human brain is related to statistical embeddings of object or

word co-occurrences. Moreover, these object representations potentially reﬂect a

number of diﬀerent properties, which together are considered to form an object

concept [32]. It can be learned based on the context in which the object is seen.

For example, an object concept may include the visual features, its taxonomy,

or the function of the object [49,17].

Image Representations in the DNN. Recent work has shown that while the

performance of humans, monkeys, and DNNs is quite similar for object-level

confusions, the image-level performance does not match between diﬀerent do-

mains [49]. In contrast to visual object representations in the brain, which also

include high level contextual knowledge of concepts and their functions, image

representations of DNNs only depend on the statistical co-occurrence of visual

features and a speciﬁc task. We consider the context extracted from the dataset

as dataset bias. Even in balanced datasets, i.e., datasets containing the same

number of images for each class, there still exists imbalance due to overlap of

features between diﬀerent classes. For instance, it must be taken into account

that a cat and a dog have similar visual features and that in composite datasets

certain classes can have diﬀerent meta-information for the images, such as illu-

mination, perspective or sensor resolution. This dataset bias leads to predeﬁned

neighborhoods in the visual embedding space, as well as predeﬁned similarities

between distinct classes. In a DNN, an encoder network E(·) maps images xto

a visual embedding hv=E(x)∈RdE, where the activations of the ﬁnal pool-

ing layer and thus the representation layer have a dimensionality dE, where dE

depends on the encoder network itself.

Contextual Representations in the KG. A knowledge graph is a graph of data

aiming to accumulate and convey real-world knowledge, where entities are repre-

4 S. Monka et al.

KGE

Generic KG Image Dataset

Contextual View Extraction

KGE

DNN

Visual

Taxonomical

Functional

Contextual View Infusion

Fig. 2: Our approach to learn contextual image representations consists of two

main parts: 1) the contextual view extraction; and 2) the contextual view infusion.

sented by nodes and relationships between entities are represented by edges [20].

We deﬁne a generic knowledge graph (GKG) as a graph of data that relates

diﬀerent classes of a dataset based on deﬁned contextual properties. These con-

textual properties can be both learned and manually curated. They bring in

prior knowledge about classes, even those that may not necessarily be present

in the image dataset, and thus place them in contextual relationships with each

other. A KG comprises a set of triples G=H, R, T , where Hrepresents enti-

ties, T⊆E×Ldenotes entities or literal values and R, is a set of relationships

connecting Hand T.

3 Learning Contextual Image Representations

The framework, as shown in Figure 2 consists of two main parts: 1) the contex-

tual view extraction, where task relevant knowledge is extracted from a generic

knowledge graph; and 2) the contextual view infusion, where the contextual view

is infused into the DNN.

3.1 Contextual View Extraction

A knowledge graph can represent prior knowledge encoded with rich semantics

in a graph structure. A GKG encapsulating ncontextual views:

GKG ⊇ {GKG1, GKG2, ..., GKGn}

is a collection of heterogeneous knowledge sources, where each contextual view

deﬁnes speciﬁc relationships between encoded classes. However, for a particular

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Context-drivenVisualObjectRecognitionbasedonKnowledgeGraphsSebastianMonka1;2,LavdimHalilaj1,andAchimRettinger21BoschCenterforArticialIntelligence,Renningen,Germanyfsebastian.monka,lavdim.halilajg@de.bosch.com2TrierUniversity,Trier,Germanyfrettingerg@uni-trier.deAbstract.Currentdeeplearningmethodsfo...

展开>> 收起<<

Context-driven Visual Object Recognition based on Knowledge Graphs Sebastian Monka12 Lavdim Halilaj1 and Achim Rettinger2.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Context-driven Visual Object Recognition based on Knowledge Graphs Sebastian Monka12 Lavdim Halilaj1 and Achim Rettinger2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: