Learning by Asking Questions for Knowledge-based Novel Object Recognition Kohei Uehara

2025-04-27 0 0 2.72MB 10 页 10玖币
侵权投诉
Learning by Asking Questions for Knowledge-based Novel Object
Recognition
Kohei Uehara
The University of Tokyo
uehara@mi.t.u-tokyo.ac.jp
Tatsuya Harada
The University of Tokyo
RIKEN
harada@mi.t.u-tokyo.ac.jp
Abstract
In real-world object recognition, there are
numerous object classes to be recognized.
Conventional image recognition based on su-
pervised learning can only recognize object
classes that exist in the training data, and
thus has limited applicability in the real world.
On the other hand, humans can recognize
novel objects by asking questions and acquir-
ing knowledge about them. Inspired by this,
we study a framework for acquiring exter-
nal knowledge through question generation
that would help the model instantly recog-
nize novel objects. Our pipeline consists of
two components: the Object Classifier, which
performs knowledge-based object recognition,
and the Question Generator, which generates
knowledge-aware questions to acquire novel
knowledge. We also propose a question gen-
eration strategy based on the confidence of
the knowledge-aware prediction of the Ob-
ject Classifier. To train the Question Gen-
erator, we construct a dataset that contains
knowledge-aware questions about objects in
the images. Our experiments show that the pro-
posed pipeline effectively acquires knowledge
about novel objects compared to several base-
lines.
1 Introduction
Object category recognition has long been a central
topic in computer vision research. Traditionally,
object recognition has been addressed by super-
vised learning using a large dataset of image-label
pairs (Deng et al.,2009). However, with supervised
approaches, the model can only recognize a frozen
set of object classes and is not suitable for real-
world object recognition, where numerous object
classes exist. Recently, image recognition methods
based on contrastive learning using image-text pair
datasets have emerged (Radford et al.,2021;Jia
et al.,2021). By training on hundreds of millions
of image-text pairs, these models have acquired
remarkable zero-shot recognition capabilities for
a wide variety of objects. However, these mod-
els can recognize objects that commonly appear
in the pre-training dataset but are not as effective
for rare objects (Shen et al.,2022). Collecting new
data and retraining the entire model to make these
models recognize novel objects is impractical con-
sidering the cost of data collection and computation.
Therefore, it is essential to develop a method that
enables the model to recognize novel objects while
maintaining low data collection costs and avoiding
model retraining as much as possible.
When humans acquire knowledge about the
world, asking questions and explicitly ac-
quiring knowledge are important skills in-
volved (Chouinard et al.,2007;Ronfard et al.,
2018). Inspired by this, we explored methods to
dynamically increase knowledge in image recogni-
tion by asking questions. This approach has several
advantages over the traditional supervised learning
method: (1) it requires only a small amount of data
to acquire knowledge because the system acquires
only the knowledge it needs, and (2) it has a low
data collection cost because the system itself seeks
the required data.
We propose a pipeline consisting of a knowledge-
based object classifier (OC) and a question gener-
ator (QG) for knowledge acquisition. Following
previous research on structured knowledge (Ji et al.,
2022), we represent knowledge as a knowledge
triplet, that is, a list of three words or phrases: head,
relation, and tail, such as
h
dog, IsA, mammal
i
. We
train the OC to retrieve knowledge from knowledge
sources, which outputs the corresponding head in
the knowledge source as the predicted object class
(e.g.,
IsA, mammal
dog). The QG model then
generates questions to add new knowledge to the
knowledge source for novel object recognition. In
the QG model, we use two modes in question gen-
eration:
confirmation
and
exploration
, as illus-
trated in Figure 1. First, “confirmation” is used
when the unknown object is relatively close to a
arXiv:2210.05879v1 [cs.CV] 12 Oct 2022
Knowledge prediction:
This is a mammal.
Question Generation:
What is the mammal in the
left side of the image?
Answerer:
It is a chihuahua.
Acquired Knowledge:
Chihuahua is a mammal.
Confirmation
Knowledge prediction:
None/Unconfident
Question Generation:
What is the object sitting
next to the dog made of?
Answerer:
Teddy-bear is made of
fur.
Acquired Knowledge:
Teddy-bear is made of fur.
Exploration
Knowledge Source
Figure 1: Conceptual illustration of our proposed pipeline. If the model is confident about the predicted knowledge,
question generation is performed in confirmation mode. If the model is not confident, question generation is
performed in exploration mode.
known object category. For example, if the model
knows about “dog, then a novel category “chi-
huahua” is considered to be a close concept to
“dog. In this case, the model can infer reason-
able knowledge (e.g., both “chihuahua” and “dog”
are a type of mammal) and ask questions to confirm
it, such as “What is the mammal on the left side of
the image?” In contrast, the “exploration” mode is
used when the unknown object is far from the ex-
isting object category (e.g., “teddy-bear” may not
resemble any known object class). In this case, the
model is unable to estimate the proper knowledge
and attempts to obtain all the necessary knowledge
by asking questions (“What is the object sitting
next to the dog made of?”).
Our contributions and findings can be summa-
rized as follows:
We propose a novel pipeline to acquire knowl-
edge about novel objects by asking ques-
tions. We designed the OC model based on
CLIP (Radford et al.,2021) and the QG model
as a Transformer (Vaswani et al.,2017) based
text generation model.
We built a novel dataset to train the QG model,
namely,
Professional K-VQG
. This dataset
contains a variety of annotations such as ob-
ject labels, bounding boxes, knowledge, and
knowledge-aware questions.
We compare our proposed pipeline with sev-
eral baselines and show that the knowledge
acquired through question generation is effec-
tive for novel object recognition.
2 Related Work
Novel object recognition
Increasing the number
of recognizable object classes is a widely studied
problem in object recognition. A typical approach
in novel object recognition is to train a model that
computes the similarity between the visual and
semantic features of objects. To compute seman-
tic features of a novel object, external knowledge
about the object (e.g., attributes (Lampert et al.,
2009;Farhadi et al.,2009;Jayaraman and Grau-
man,2014;Akata et al.,2016;Li et al.,2021),
class hierarchy (Rohrbach et al.,2011;Wang et al.,
2018), or textual description (Ba et al.,2015;Qiao
et al.,2016;Reed et al.,2016;Zareian et al.,
2021)) is often used. Recently proposed vision-
and-language contrastive learning methods, such
as CLIP (Radford et al.,2021) or ALIGN (Jia et al.,
2021), use extremely large-scale image caption
data to learn the relationship between images and
their textual descriptions. With the help of the
prefix-tuning technique, these models exhibited a
strong zero-shot recognition ability. However, the
abovementioned studies have a problem in that they
require either a well-prepared knowledge database
on novel objects or a large number of image-text
pair datasets and appropriately designed prompts,
both of which are labor-intensive tasks for humans.
In our method, once the question generation model
is trained, the model dynamically acquires the nec-
essary knowledge, thereby reducing human effort.
Learning by asking (LBA)
LBA generates ques-
tions to collect additional data to train a model.
With the development of natural language genera-
tion methods, several studies using question genera-
tion to acquire the information necessary to solve a
task (e.g., reading comprehension (Du et al.,2017;
Yuan et al.,2017) or question answering (Scialom
and Staiano,2020)) have been conducted. In addi-
tion, in vision-and-language fields, LBA is applied
to VQA tasks (Misra et al.,2018) or image cap-
tioning tasks (Shen et al.,2019). However, our
摘要:

LearningbyAskingQuestionsforKnowledge-basedNovelObjectRecognitionKoheiUeharaTheUniversityofTokyouehara@mi.t.u-tokyo.ac.jpTatsuyaHaradaTheUniversityofTokyoRIKENharada@mi.t.u-tokyo.ac.jpAbstractInreal-worldobjectrecognition,therearenumerousobjectclassestoberecognized.Conventionalimagerecognitionbasedo...

展开>> 收起<<
Learning by Asking Questions for Knowledge-based Novel Object Recognition Kohei Uehara.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:2.72MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注