Learning by Asking Questions for Knowledge-based Novel Object Recognition Kohei Uehara

2025-04-27 0 0 2.72MB 10 页 10玖币

侵权投诉

Learning by Asking Questions for Knowledge-based Novel Object

Recognition

Kohei Uehara

The University of Tokyo

uehara@mi.t.u-tokyo.ac.jp

Tatsuya Harada

The University of Tokyo

RIKEN

harada@mi.t.u-tokyo.ac.jp

Abstract

In real-world object recognition, there are

numerous object classes to be recognized.

Conventional image recognition based on su-

pervised learning can only recognize object

classes that exist in the training data, and

thus has limited applicability in the real world.

On the other hand, humans can recognize

novel objects by asking questions and acquir-

ing knowledge about them. Inspired by this,

we study a framework for acquiring exter-

nal knowledge through question generation

that would help the model instantly recog-

nize novel objects. Our pipeline consists of

two components: the Object Classiﬁer, which

performs knowledge-based object recognition,

and the Question Generator, which generates

knowledge-aware questions to acquire novel

knowledge. We also propose a question gen-

eration strategy based on the conﬁdence of

the knowledge-aware prediction of the Ob-

ject Classiﬁer. To train the Question Gen-

erator, we construct a dataset that contains

knowledge-aware questions about objects in

the images. Our experiments show that the pro-

posed pipeline effectively acquires knowledge

about novel objects compared to several base-

lines.

1 Introduction

Object category recognition has long been a central

topic in computer vision research. Traditionally,

object recognition has been addressed by super-

vised learning using a large dataset of image-label

pairs (Deng et al.,2009). However, with supervised

approaches, the model can only recognize a frozen

set of object classes and is not suitable for real-

world object recognition, where numerous object

classes exist. Recently, image recognition methods

based on contrastive learning using image-text pair

datasets have emerged (Radford et al.,2021;Jia

et al.,2021). By training on hundreds of millions

of image-text pairs, these models have acquired

remarkable zero-shot recognition capabilities for

a wide variety of objects. However, these mod-

els can recognize objects that commonly appear

in the pre-training dataset but are not as effective

for rare objects (Shen et al.,2022). Collecting new

data and retraining the entire model to make these

models recognize novel objects is impractical con-

sidering the cost of data collection and computation.

Therefore, it is essential to develop a method that

enables the model to recognize novel objects while

maintaining low data collection costs and avoiding

model retraining as much as possible.

When humans acquire knowledge about the

world, asking questions and explicitly ac-

quiring knowledge are important skills in-

volved (Chouinard et al.,2007;Ronfard et al.,

2018). Inspired by this, we explored methods to

dynamically increase knowledge in image recogni-

tion by asking questions. This approach has several

advantages over the traditional supervised learning

method: (1) it requires only a small amount of data

to acquire knowledge because the system acquires

only the knowledge it needs, and (2) it has a low

data collection cost because the system itself seeks

the required data.

We propose a pipeline consisting of a knowledge-

based object classiﬁer (OC) and a question gener-

ator (QG) for knowledge acquisition. Following

previous research on structured knowledge (Ji et al.,

2022), we represent knowledge as a knowledge

triplet, that is, a list of three words or phrases: head,

relation, and tail, such as

dog, IsA, mammal

. We

train the OC to retrieve knowledge from knowledge

sources, which outputs the corresponding head in

the knowledge source as the predicted object class

(e.g.,

〈

IsA, mammal

〉→

dog). The QG model then

generates questions to add new knowledge to the

knowledge source for novel object recognition. In

the QG model, we use two modes in question gen-

eration:

conﬁrmation

and

exploration

, as illus-

trated in Figure 1. First, “conﬁrmation” is used

when the unknown object is relatively close to a

arXiv:2210.05879v1 [cs.CV] 12 Oct 2022

Knowledge prediction:

This is a mammal.

Question Generation:

What is the mammal in the

left side of the image?

Answerer:

It is a chihuahua.

Acquired Knowledge:

Chihuahua is a mammal.

Confirmation

Knowledge prediction:

None/Unconfident

Question Generation:

What is the object sitting

next to the dog made of?

Answerer:

Teddy-bear is made of

fur.

Acquired Knowledge:

Teddy-bear is made of fur.

Exploration

Knowledge Source

Figure 1: Conceptual illustration of our proposed pipeline. If the model is conﬁdent about the predicted knowledge,

question generation is performed in conﬁrmation mode. If the model is not conﬁdent, question generation is

performed in exploration mode.

known object category. For example, if the model

knows about “dog,” then a novel category “chi-

huahua” is considered to be a close concept to

“dog.” In this case, the model can infer reason-

able knowledge (e.g., both “chihuahua” and “dog”

are a type of mammal) and ask questions to conﬁrm

it, such as “What is the mammal on the left side of

the image?” In contrast, the “exploration” mode is

used when the unknown object is far from the ex-

isting object category (e.g., “teddy-bear” may not

resemble any known object class). In this case, the

model is unable to estimate the proper knowledge

and attempts to obtain all the necessary knowledge

by asking questions (“What is the object sitting

next to the dog made of?”).

Our contributions and ﬁndings can be summa-

rized as follows:

•

We propose a novel pipeline to acquire knowl-

edge about novel objects by asking ques-

tions. We designed the OC model based on

CLIP (Radford et al.,2021) and the QG model

as a Transformer (Vaswani et al.,2017) based

text generation model.

•

We built a novel dataset to train the QG model,

namely,

Professional K-VQG

. This dataset

contains a variety of annotations such as ob-

ject labels, bounding boxes, knowledge, and

knowledge-aware questions.

•

We compare our proposed pipeline with sev-

eral baselines and show that the knowledge

acquired through question generation is effec-

tive for novel object recognition.

2 Related Work

Novel object recognition

Increasing the number

of recognizable object classes is a widely studied

problem in object recognition. A typical approach

in novel object recognition is to train a model that

computes the similarity between the visual and

semantic features of objects. To compute seman-

tic features of a novel object, external knowledge

about the object (e.g., attributes (Lampert et al.,

2009;Farhadi et al.,2009;Jayaraman and Grau-

man,2014;Akata et al.,2016;Li et al.,2021),

class hierarchy (Rohrbach et al.,2011;Wang et al.,

2018), or textual description (Ba et al.,2015;Qiao

et al.,2016;Reed et al.,2016;Zareian et al.,

2021)) is often used. Recently proposed vision-

and-language contrastive learning methods, such

as CLIP (Radford et al.,2021) or ALIGN (Jia et al.,

2021), use extremely large-scale image caption

data to learn the relationship between images and

their textual descriptions. With the help of the

preﬁx-tuning technique, these models exhibited a

strong zero-shot recognition ability. However, the

abovementioned studies have a problem in that they

require either a well-prepared knowledge database

on novel objects or a large number of image-text

pair datasets and appropriately designed prompts,

both of which are labor-intensive tasks for humans.

In our method, once the question generation model

is trained, the model dynamically acquires the nec-

essary knowledge, thereby reducing human effort.

Learning by asking (LBA)

LBA generates ques-

tions to collect additional data to train a model.

With the development of natural language genera-

tion methods, several studies using question genera-

tion to acquire the information necessary to solve a

task (e.g., reading comprehension (Du et al.,2017;

Yuan et al.,2017) or question answering (Scialom

and Staiano,2020)) have been conducted. In addi-

tion, in vision-and-language ﬁelds, LBA is applied

to VQA tasks (Misra et al.,2018) or image cap-

tioning tasks (Shen et al.,2019). However, our

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LearningbyAskingQuestionsforKnowledge-basedNovelObjectRecognitionKoheiUeharaTheUniversityofTokyouehara@mi.t.u-tokyo.ac.jpTatsuyaHaradaTheUniversityofTokyoRIKENharada@mi.t.u-tokyo.ac.jpAbstractInreal-worldobjectrecognition,therearenumerousobjectclassestoberecognized.Conventionalimagerecognitionbasedo...

展开>> 收起<<

Learning by Asking Questions for Knowledge-based Novel Object Recognition Kohei Uehara.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Learning by Asking Questions for Knowledge-based Novel Object Recognition Kohei Uehara

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: