Context-driven Visual Object Recognition based on Knowledge Graphs Sebastian Monka12 Lavdim Halilaj1 and Achim Rettinger2

2025-05-06 0 0 2MB 18 页 10玖币
侵权投诉
Context-driven Visual Object Recognition based
on Knowledge Graphs
Sebastian Monka1,2, Lavdim Halilaj1, and Achim Rettinger2
1Bosch Center for Artificial Intelligence, Renningen, Germany
{sebastian.monka,lavdim.halilaj}@de.bosch.com
2Trier University, Trier, Germany
{rettinger}@uni-trier.de
Abstract. Current deep learning methods for object recognition are
purely data-driven and require a large number of training samples to
achieve good results. Due to their sole dependence on image data, these
methods tend to fail when confronted with new environments where even
small deviations occur. Human perception, however, has proven to be
significantly more robust to such distribution shifts. It is assumed that
their ability to deal with unknown scenarios is based on extensive incor-
poration of contextual knowledge. Context can be based either on object
co-occurrences in a scene or on memory of experience. In accordance
with the human visual cortex which uses context to form different object
representations for a seen image, we propose an approach that enhances
deep learning methods by using external contextual knowledge encoded
in a knowledge graph. Therefore, we extract different contextual views
from a generic knowledge graph, transform the views into vector space
and infuse it into a DNN. We conduct a series of experiments to in-
vestigate the impact of different contextual views on the learned object
representations for the same image dataset. The experimental results
provide evidence that the contextual views influence the image represen-
tations in the DNN differently and therefore lead to different predictions
for the same images. We also show that context helps to strengthen the
robustness of object recognition models for out-of-distribution images,
usually occurring in transfer learning tasks or real-world scenarios.
Keywords: Neuro-Symbolic ·Knowledge Graph ·Contextual Learning
1 Introduction
How humans perceive the real world is strongly dependent on the context [38,29].
Especially, in situations with poor quality of visual input, for instance caused by
large distances, or short capturing times, context appears to play a major role
in improving the reliability of recognition [43]. Perception is not only influenced
by co-occurring objects or visual features in the same image, but also by expe-
rience and memory [39]. There is evidence that humans perceive similar images
differently considering the given context [10]. A famous example are ambiguous
figures as shown in Figure 1.
arXiv:2210.11233v1 [cs.AI] 20 Oct 2022
2 S. Monka et al.
(a) Duck or rabbit? [25] (b) Young lady or old woman? [1].
Fig. 1: The mental representation for ambiguous images can change based on the
context, although the perceived image is still the same.
Depending on the context, i.e. if it is Easter or Christmas [9], Figure 1a can
be either a duck or a rabbit. Likewise, influenced by own-age social biases [36],
Figure 1b can be either a young lady or an old woman. Humans categorize images
based on various types of context. Known categories are based on visual features
or semantic concepts [5], but may also be based on other information such as
attributes describing their function. Accordingly, neuroscience has shown that
the human brain encodes visual input into individual contextual object represen-
tations [15,17,49], namely visual, taxonomical, and functional [32]. Concretely,
in a visual context, images of a drum and a barrel have a high similarity, as they
share similar visual features. In a taxonomical context, a drum would be similar
to a violin, as they both are musical instruments. And in a functional context,
the drum would be similar to a hammer, since the same action of hitting can be
performed with both objects [7].
Whereas there is much evidence that intelligent machines should also repre-
sent information in contextualized embeddings, deep neural networks (DNNs)
form their object representations based only on the feature distribution of the
image dataset [8,56]. Therefore, they fail if the objects are placed in an incon-
gruent context that was not present in previous seen images [4].
For the scope of this work we investigate the following research questions:
RQ1 - Can context provided in form of a KG influence learning image rep-
resentations of a DNN, the final accuracy, and the image predictions?
RQ2 - Can context help to avoid critical errors in domain changing scenarios
where DNNs fail?
To enable standard DNNs to build contextual object representations, we pro-
vide the context using a knowledge graph (KG) and its corresponding knowledge
graph embedding (hKG). Similar to the process in the human brain, we conduct
experiments with three different types of contexts, namely visual context, taxo-
nomical context, and functional context 3. We provide two versions of knowledge
infusion into a DNN and compare the induction of different contextual models in
depth by quantitatively investigating their learned contextual embedding spaces
using class-related cosine similarities. In addition we evaluate our approach quan-
titatively by comparing their final accuracy on object recognition tasks on source
and target domains and provide insights and challenges. The structure of this
Context-driven Visual Object Recognition based on Knowledge Graphs 3
paper is organized as follows: Section 6 outlines related work. In Section 3.1 we
introduce the three different types of context and an option to model these views
in a contextual knowledge graph. Section 3 shows two ways of infusing context
into a visual DNN. In Section 4 we conduct experiments on seven image datasets
in two transfer learning scenarios. In Section 5 we answer the research questions
and summarize the main insights of our approach.
2 Preliminaries
Contextual Image Representations in the Brain. Cognitive and neuroscience re-
search has recently begun to investigate the relationship between viewed objects
and the corresponding fMRI scan activities of the human brain. It is assumed
that the primate visual system is organized into two separate processing path-
ways in the visual cortex, namely, the dorsal pathway and the ventral pathway.
While the dorsal pathway is responsible for the spatial recognition of objects
as well as actions and manipulations such as grasping, the ventral pathway is
responsible for recognizing the type of object based on its form or motion [52].
Bonner et al. [6] recently showed that the sensory coding of objects in the ven-
tral cortex of the human brain is related to statistical embeddings of object or
word co-occurrences. Moreover, these object representations potentially reflect a
number of different properties, which together are considered to form an object
concept [32]. It can be learned based on the context in which the object is seen.
For example, an object concept may include the visual features, its taxonomy,
or the function of the object [49,17].
Image Representations in the DNN. Recent work has shown that while the
performance of humans, monkeys, and DNNs is quite similar for object-level
confusions, the image-level performance does not match between different do-
mains [49]. In contrast to visual object representations in the brain, which also
include high level contextual knowledge of concepts and their functions, image
representations of DNNs only depend on the statistical co-occurrence of visual
features and a specific task. We consider the context extracted from the dataset
as dataset bias. Even in balanced datasets, i.e., datasets containing the same
number of images for each class, there still exists imbalance due to overlap of
features between different classes. For instance, it must be taken into account
that a cat and a dog have similar visual features and that in composite datasets
certain classes can have different meta-information for the images, such as illu-
mination, perspective or sensor resolution. This dataset bias leads to predefined
neighborhoods in the visual embedding space, as well as predefined similarities
between distinct classes. In a DNN, an encoder network E(·) maps images xto
a visual embedding hv=E(x)RdE, where the activations of the final pool-
ing layer and thus the representation layer have a dimensionality dE, where dE
depends on the encoder network itself.
Contextual Representations in the KG. A knowledge graph is a graph of data
aiming to accumulate and convey real-world knowledge, where entities are repre-
4 S. Monka et al.
KGE
Generic KG Image Dataset
Contextual View Extraction
KGE
KGE
DNN
DNN
DNN
Visual
Taxonomical
Functional
Contextual View Infusion
Fig. 2: Our approach to learn contextual image representations consists of two
main parts: 1) the contextual view extraction; and 2) the contextual view infusion.
sented by nodes and relationships between entities are represented by edges [20].
We define a generic knowledge graph (GKG) as a graph of data that relates
different classes of a dataset based on defined contextual properties. These con-
textual properties can be both learned and manually curated. They bring in
prior knowledge about classes, even those that may not necessarily be present
in the image dataset, and thus place them in contextual relationships with each
other. A KG comprises a set of triples G=H, R, T , where Hrepresents enti-
ties, TE×Ldenotes entities or literal values and R, is a set of relationships
connecting Hand T.
3 Learning Contextual Image Representations
The framework, as shown in Figure 2 consists of two main parts: 1) the contex-
tual view extraction, where task relevant knowledge is extracted from a generic
knowledge graph; and 2) the contextual view infusion, where the contextual view
is infused into the DNN.
3.1 Contextual View Extraction
A knowledge graph can represent prior knowledge encoded with rich semantics
in a graph structure. A GKG encapsulating ncontextual views:
GKG ⊇ {GKG1, GKG2, ..., GKGn}
is a collection of heterogeneous knowledge sources, where each contextual view
defines specific relationships between encoded classes. However, for a particular
摘要:

Context-drivenVisualObjectRecognitionbasedonKnowledgeGraphsSebastianMonka1;2,LavdimHalilaj1,andAchimRettinger21BoschCenterforArti cialIntelligence,Renningen,Germanyfsebastian.monka,lavdim.halilajg@de.bosch.com2TrierUniversity,Trier,Germanyfrettingerg@uni-trier.deAbstract.Currentdeeplearningmethodsfo...

展开>> 收起<<
Context-driven Visual Object Recognition based on Knowledge Graphs Sebastian Monka12 Lavdim Halilaj1 and Achim Rettinger2.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:2MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注