Collaborative Image Understanding

2025-04-29 0 0 1.04MB 11 页 10玖币
侵权投诉
Collaborative Image Understanding
Koby Bibas
kobybibas@gmail.com
Meta
Tel Aviv, Israel
Oren Sar Shalom
oren.sarshalom@gmail.com
Amazon
Tel Aviv, Israel
Dietmar Jannach
dietmar.jannach@aau.at
University of Klagenfurt
Klagenfurt, Austria
ABSTRACT
Automatically understanding the contents of an image is a highly
relevant problem in practice. In e-commerce and social media set-
tings, for example, a common problem is to automatically catego-
rize user-provided pictures. Nowadays, a standard approach is to
ne-tune pre-trained image models with application-specic data.
Besides images, organizations however often also collect collabo-
rative signals in the context of their application, in particular how
users interacted with the provided online content, e.g., in forms of
viewing, rating, or tagging. Such signals are commonly used for
item recommendation, typically by deriving latent user and item
representations from the data. In this work, we show that such
collaborative information can be leveraged to improve the classi-
cation process of new images. Specically, we propose a multitask
learning framework, where the auxiliary task is to reconstruct col-
laborative latent item representations. A series of experiments on
datasets from e-commerce and social media demonstrates that con-
sidering collaborative signals helps to signicantly improve the
performance of the main task of image classication by up to 9.1%.
CCS CONCEPTS
Information systems Information extraction
;
Comput-
ing methodologies Multi-task learning; Computer vision.
KEYWORDS
Information Extraction, Image Categorization, Collaborative Filter-
ing, Multitask Learning
ACM Reference Format:
Koby Bibas, Oren Sar Shalom, and Dietmar Jannach. 2022. Collaborative
Image Understanding. In Proceedings of the 31st ACM International Con-
ference on Information and Knowledge Management (CIKM ’22), October
17–21, 2022, Atlanta, GA, USA. ACM, New York, NY, USA, 11 pages. https:
//doi.org/10.1145/3511808.3557260
1 INTRODUCTION
Image understanding can be described as the process of automati-
cally analyzing unstructured image data in order to extract knowl-
edge for specic tasks. Automatically assigning images to pre-
dened categories (i.e., image categorization) is one important form
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
CIKM ’22, October 17–21, 2022, Atlanta, GA, USA
©2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9236-5/22/10. . . $15.00
https://doi.org/10.1145/3511808.3557260
of image understanding and an area in which we observed substan-
tial progress in recent years, largely due to innovations in deep
learning [
6
,
14
,
31
]. Nowadays, one standard way of developing an
image categorization solution for a specic task or application is
to rely on pre-trained image models and to ne-tune them with
application-specic data for the particular problem at hand.
In a specic application setting, additionally to the raw image
information (pixels), other types of information may be available
as well. In particular, providers of e-commerce or social media sites
nowadays often have access to collaborative information that is
collected in the context of an application. Such sites usually often
record how users interact with the provided online content, e.g., in
the form of viewing, rating, commenting, tagging, or resharing.
Such collaborative signals can be used for dierent purposes and
in particular for personalization. Most importantly, these signals
have been successfully used for many years by online retailers and
service providers to build collaborative ltering (CF) recommender
systems [
47
]. Today’s most eective CF systems are based on de-
riving latent representations (embeddings) of users and items from
a given user-item interaction matrix through matrix factorization
[26] or deep learning techniques [22, 34].
The latent representations that are derived from collaborative
information cannot be directly interpreted. However, it seems in-
tuitive to assume that the item representations may encode some
information about the relevant categories in a domain.
In this work, we explore an approach which we call collaborative
image understanding. Specically, the idea is to leverage the collab-
orative information as additional supervision for the training phase
of image understanding models. By incorporating this collaborative
information only during training, the model can support also new
images for which collaborative information is not available. Such
problems for example arise when online platforms have to auto-
matically process images as soon as they are uploaded, before they
gain any collaborative information. Any application that combines
images and interaction data suits to this setting, where applications
include e-commerce, social media and online content.
The main contributions of our work are as follows.
(1)
We introduce a new signal (i.e., collaborative information)
to support the training procedure of automated image un-
derstanding processes.
(2)
We propose a general multitask-learning (MTL) [
9
] frame-
work to incorporate collaborative information in the form of
latent item representations into existing single-task content
understanding models.
(3)
We explore several alternative approaches to combining col-
laborative information in the training procedure of image
classication models.
arXiv:2210.11907v1 [cs.CV] 21 Oct 2022
CIKM ’22, October 17–21, 2022, Atlanta, GA, USA Koby Bibas, Oren Sar Shalom, and Dietmar Jannach
(4)
We show that our approach is also particularly eective in
typical real-world situations when labels are missing for a
fraction of the training images.
Our MTL approach, where the auxiliary task is to reconstruct col-
laborative latent item representations, is highly promising. An in-
depth empirical evaluation on the Pinterest [
65
], MovieLens [
19
],
Amazon Clothing and Amazon Toys [
38
] datasets revealed that the
classication performance can be improved by 5.2%, 2.6%, 9.1%, and
3.0% respectively.
The rest of this paper is organized as follows. In Section 2 we
provide the necessary background and review earlier work. Sec-
tion 3 details the proposed approach; thorough experiments are
described in Section 4. Section 5 concludes the paper.
2 BACKGROUND AND RELATED WORK
In this section, we rst briey provide some relevant background
in three research areas: image classication, collaborative ltering
and multi-task Learning (Section 2.1). Then, we discuss earlier
works, which use side information or collaborative information in
combination with image data (Section 2.2).
2.1 Background
2.1.1 Image Classification with Neural Networks. Classifying im-
ages to predened categories based on their content is an important
task in the areas of computer vision and information extraction. The
training input for this learning problem is a set
{(𝑥𝑖, 𝑦𝑖)}
, where
𝑥𝑖
is an image of an entity
𝑖
and
𝑦𝑖
is the corresponding category.
The ImageNet database [
14
] for example, contains millions of hand-
labeled images, which are assigned to thousands of categories. Usu-
ally models trained on one domain do not perform well on other
domains and labels from the target domain are required [
66
]. As
manual annotation is expensive and time consuming, label paucity
is a ubiquitous problem.
In the last ten years, substantial progress was made in the eld
due to advances in deep learning technology and in particular
through the use of convolutional neural networks (CNN) [
31
,
57
,
58
].
Continuously new architectures were proposed over the years,
including the ResNet [
20
], MobileNet [
25
] and the recent RegNet
[42] approaches that we use in the experiments in this paper.
Given a particular application problem, a common approach to
mitigate the paucity of labeled data is to rely on existing models
that are pre-trained on a large corpus like ImageNet, and to then
ne-tune them with the data of the particular application. Such an
approach is also followed in our work. However, in addition to the
application-specic images and labels, we also consider collabora-
tive information to further improve classication performance.
2.1.2 Collaborative Filtering and Latent Factor Models. Collabo-
rative ltering recommender systems are commonly based on a
matrix that contains recorded interactions between users and items
[
47
], with the typical goal of predicting the missing entries of the
matrix and/or ranking the items for each user in terms of their
estimated relevance. Matrix factorization (MF) techniques prove to
be highly eective for this task and are based on representing users
and items in a lower-dimensional latent space. While early methods
were based on Singular Value Decomposition [
7
], later approaches
used various types of machine learning models to compute the
latent factors, e.g., [
26
]. Note that while various deep learning ap-
proaches were proposed for the task in recent years, latent factor
models remain highly competitive and relevant today [45].
Technically, MF methods project users and items into a shared
latent space, where each user and each item is represented through
a real-valued vector. A user vector indicates to what extent the
corresponding user prefers each of the latent traits; likewise, an
item vector represents to what extent the corresponding item pos-
sesses them [
30
]. The traits are called latent because their meaning
is unknown. Yet, we know that latent factor models are highly
eective for recommendations and furthermore, that user vectors
hold information on the demographics of the users [
46
]. Therefore,
we hypothesize that item vectors encode certain types of relevant
information about the items—like category or brand—which we try
to leverage in our work for improved image understanding.
2.1.3 Multi-Task Learning. MTL is a machine learning approach
which optimizes multiple learning tasks in parallel based on a
shared representation for all tasks. The promise of MTL is that
if the tasks share signicant commonalities, then this approach
may lead to improved performance on each individual task, com-
pared to training task-specic models separately [
9
]. The rationale
behind MTL and its success lies in the potential of improved gen-
eralization by using the information contained in the signals of
related tasks. Thus, each individual task can “help” other tasks,
often in order to also prevent overtting for one particular task.
Concretely, this is achieved by having some shared parameters for
all tasks. That way, during training time each task directly aects
the other ones.
Various successful applications of MTL are reported in the lit-
erature, both for computer vision problems and other areas, see
Zhang and Yang
[64]
for a survey. However, using MTL to combine
collaborative signals and image information for higher categoriza-
tion accuracy has not been explored in the literature before. As our
experimental evaluations will show, these two types of information
indeed have commonalities that contribute to the learning process
in an MTL setting. In addition, the proposed MTL approach is fa-
vorable over other combination approaches such as training the
models independently.
2.2 Related Work
There is a plethora of classication models that combine raw image
information together with additional metadata, which generally
fall under the umbrella of multi-modal classication. Notable exam-
ples are image and text classication, see Baltrušaitis et al
. [1]
and
Ramachandram and Taylor
[43]
for related surveys on multi-modal
learning. Our work is fundamentally dierent than multi-modal
learning since in this those works the other modalities are part of
the features of the model. That is, they serve as input to the model
both at training and inference time. In our work we assume that the
other modality, namely collaborative information, is not available
at inference time, yet we aim to leverage its existence at training
time by using it to guide the learning of the model. Thus, the model
may benet from this signal without relying on it for inference.
Several works have tried to harness the power of CF for various
image understanding problems. The work of Dabov et al. [13], for
example, applies a CF technique to improve image denoising. Choi
Collaborative Image Understanding CIKM ’22, October 17–21, 2022, Atlanta, GA, USA
et al
. [12]
, on the other hand, showed how to leverage collabora-
tive information for better pose estimation results. However, the
neighborhood of an image in these works is determined not by the
interactions with it, but according to visual similarity.
There are also existing several works that aim to improve the
image content understanding process using collaborative informa-
tion as we do in our work [
35
,
37
,
56
,
61
]. However, in these works
the user interaction signal is used as an input to their image under-
standing models. Therefore, the collaborative information cannot
be used to model new images, which is the focus of our work. Our
proposed method in contrast does not treat user interactions as a
feature for the model, but only as means of guidance.
Another type of related work can be found in Li and Tang
[32]
and Li et al
. [33]
, which both aim to improve the training of content
understanding models using user data. However, they treat the user
interaction data as a type of content. That is, they omit the identities
of the interacting users and do not apply collaborative ltering as
we do. As a result of such an approach, if the user interaction is, for
example, image tagging, then their work is limited to enhancing
image tagging processes. An earlier work described in Sang et al
.
[51]
retains the users’ identities and its objective is to improve
image tagging using interaction data. This is done by applying
tensor factorization to model users, images and tags in the same
joint latent space. However, this approach also binds the type of the
interactions and the prediction task. In our approach, the prediction
task is not tied to a specic type of user interactions.
Common to all discussed approaches based on user data is their
need to process the raw image data for each user interaction. As the
number of pixels
𝑃
in a typical image reaches to million (10
6
) and
the number
𝑁
of interactions in real-life datasets can be hundreds
of millions (10
8
), their running time is in
𝑂(𝑁·𝑃)=
10
14
, which
is computationally expensive or even infeasible for large real-life
datasets. Dierently from that, in our approach we rst apply a CF
method on the interaction data to create a compact representation
for each item of dimensionality
𝑓
, and we then train a content
understanding model using those xed-size vectors. Therefore, the
running time is
𝑂(𝑓· (𝑃+𝑁)) 𝑂(𝑁·𝑃)
. This is achieved since
instead of taking a naive end-to-end solution, we adopt a two-phase
solution, which rst transforms the interaction data to a label set.
Finally, it is worth mentioning that the other direction, i.e., con-
tent understanding to enhance recommender systems, has shown
striking success [
2
,
15
,
21
,
50
,
55
]. The relationship of these works to
our current approach is however limited, as our goal is to improve
the image understanding process and not item recommendations.
3 PROPOSED TECHNICAL APPROACH
3.1 Overview
Borrowing terminology from the recommender systems community,
the term item refers to an entity that users can interact with, and
furthermore may have an image as side-information. In this paper,
as the main focus is image understanding, we use the terms image
and item mostly interchangeably. In some cases the image can
be the item itself, as in image sharing services (e.g., Pinterest) for
example, where users interact with images.
Our proposed approach consists of two main phases, as illus-
trated in Figure 1. The input to the learning task is a collection of
Symbol Explanation
𝑥𝑖Content of item 𝑖
𝑦𝑖Category label of item 𝑖(a binary vector)
ˆ
𝑦𝑖A predicted label of item 𝑖(a real-valued vector)
𝑈𝑖Set of users who interacted with item 𝑖
𝑞𝑖CF vector of item 𝑖
ˆ
𝑞𝑖A predicted CF vector of item 𝑖
𝜔𝑖Weight of CF vector of item 𝑖
Table 1: Main symbols used in the paper.
images, where (i) each image has category labels assigned and (ii)
for each image we are given a set of users who have interacted
with it, e.g., through rating, tagging, or sharing. In the rst phase
of our approach, a user-item interaction matrix is derived from the
input data and a CF technique is applied to create xed-size latent
representations (vectors) of users and items. The purpose of this
phase is solely to generate augmented labels for the classication
problem. In the second phase, multitask learning is applied, where
the main task is to predict the categories of the images, and the
auxiliary task is to reconstruct the latent item representations cre-
ated in the rst phase
1
. As mentioned, we assume that the latent
item vectors—dubbed CF vectors from here on—encode category
information that can help the training of the main task. Technically,
instead of learning the image categorization from scratch, we rely
on a pre-trained image model, which we then ne-tune with the
given image data of the problem at hand.
Once the model is trained, we use it to predict categories of
new images, i.e., for images for which we do not have collaborative
information. Typical important use cases could be pictures that
users upload on a social media platform or pictures of newly added
shop items that should be automatically categorized.
It is important to note that at this level, we consider our technical
contribution as a framework because we propose a general form of
incorporating collaborative information into the image understand-
ing process. In a given application scenario, the framework can then
be instantiated in appropriate ways, using a suitable combination
of matrix factorization technique, training set for pre-training, and
deep learning architecture for image categorization.
3.2 Technical Approach
In the following, we describe our approach more formally. In this
formalization, we use
𝑖∈ {
1
, . . . , 𝐼 }
to index items. Table 1 summa-
rizes the notations in this paper.
At training time, we assume each item
𝑖
is associated with an
image
𝑥𝑖
and with some label
𝑦𝑖
, a category in our use case. Fur-
thermore, the item may be associated with a set
𝑈𝑖
of users who
interacted with it. At inference time, given an image to be modeled
we assume it is not associated with historical interaction data.
In many practical scenarios, the content understanding model
must be invoked upon introducing the new item to the system, and
therefore it has not gathered usage data (also known as cold item, a
ubiquitous problem in recommender systems [
52
]). We emphasize
that integrating collaborative ltering information at inference time
1The latent user vectors are not used in this process.
摘要:

CollaborativeImageUnderstandingKobyBibaskobybibas@gmail.comMetaTelAviv,IsraelOrenSarShalomoren.sarshalom@gmail.comAmazonTelAviv,IsraelDietmarJannachdietmar.jannach@aau.atUniversityofKlagenfurtKlagenfurt,AustriaABSTRACTAutomaticallyunderstandingthecontentsofanimageisahighlyrelevantprobleminpractice.I...

展开>> 收起<<
Collaborative Image Understanding.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:1.04MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注