Collaborative Image Understanding

2025-04-29 0 0 1.04MB 11 页 10玖币

侵权投诉

Koby Bibas

kobybibas@gmail.com

Meta

Tel Aviv, Israel

Oren Sar Shalom

oren.sarshalom@gmail.com

Amazon

Tel Aviv, Israel

Dietmar Jannach

dietmar.jannach@aau.at

University of Klagenfurt

Klagenfurt, Austria

ABSTRACT

Automatically understanding the contents of an image is a highly

relevant problem in practice. In e-commerce and social media set-

tings, for example, a common problem is to automatically catego-

rize user-provided pictures. Nowadays, a standard approach is to

ne-tune pre-trained image models with application-specic data.

Besides images, organizations however often also collect collabo-

rative signals in the context of their application, in particular how

users interacted with the provided online content, e.g., in forms of

viewing, rating, or tagging. Such signals are commonly used for

item recommendation, typically by deriving latent user and item

representations from the data. In this work, we show that such

collaborative information can be leveraged to improve the classi-

cation process of new images. Specically, we propose a multitask

learning framework, where the auxiliary task is to reconstruct col-

laborative latent item representations. A series of experiments on

datasets from e-commerce and social media demonstrates that con-

sidering collaborative signals helps to signicantly improve the

performance of the main task of image classication by up to 9.1%.

CCS CONCEPTS

•Information systems →Information extraction

;

•Comput-

ing methodologies →Multi-task learning; Computer vision.

KEYWORDS

Information Extraction, Image Categorization, Collaborative Filter-

ing, Multitask Learning

ACM Reference Format:

Koby Bibas, Oren Sar Shalom, and Dietmar Jannach. 2022. Collaborative

Image Understanding. In Proceedings of the 31st ACM International Con-

ference on Information and Knowledge Management (CIKM ’22), October

17–21, 2022, Atlanta, GA, USA. ACM, New York, NY, USA, 11 pages. https:

//doi.org/10.1145/3511808.3557260

1 INTRODUCTION

Image understanding can be described as the process of automati-

cally analyzing unstructured image data in order to extract knowl-

edge for specic tasks. Automatically assigning images to pre-

dened categories (i.e., image categorization) is one important form

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

CIKM ’22, October 17–21, 2022, Atlanta, GA, USA

ACM ISBN 978-1-4503-9236-5/22/10. . . $15.00

https://doi.org/10.1145/3511808.3557260

of image understanding and an area in which we observed substan-

tial progress in recent years, largely due to innovations in deep

learning [

]. Nowadays, one standard way of developing an

image categorization solution for a specic task or application is

to rely on pre-trained image models and to ne-tune them with

application-specic data for the particular problem at hand.

In a specic application setting, additionally to the raw image

information (pixels), other types of information may be available

as well. In particular, providers of e-commerce or social media sites

nowadays often have access to collaborative information that is

collected in the context of an application. Such sites usually often

record how users interact with the provided online content, e.g., in

the form of viewing, rating, commenting, tagging, or resharing.

Such collaborative signals can be used for dierent purposes and

in particular for personalization. Most importantly, these signals

have been successfully used for many years by online retailers and

service providers to build collaborative ltering (CF) recommender

systems [

]. Today’s most eective CF systems are based on de-

riving latent representations (embeddings) of users and items from

a given user-item interaction matrix through matrix factorization

[26] or deep learning techniques [22, 34].

The latent representations that are derived from collaborative

information cannot be directly interpreted. However, it seems in-

tuitive to assume that the item representations may encode some

information about the relevant categories in a domain.

In this work, we explore an approach which we call collaborative

image understanding. Specically, the idea is to leverage the collab-

orative information as additional supervision for the training phase

of image understanding models. By incorporating this collaborative

information only during training, the model can support also new

images for which collaborative information is not available. Such

problems for example arise when online platforms have to auto-

matically process images as soon as they are uploaded, before they

gain any collaborative information. Any application that combines

images and interaction data suits to this setting, where applications

include e-commerce, social media and online content.

The main contributions of our work are as follows.

(1)

We introduce a new signal (i.e., collaborative information)

to support the training procedure of automated image un-

derstanding processes.

(2)

We propose a general multitask-learning (MTL) [

] frame-

work to incorporate collaborative information in the form of

latent item representations into existing single-task content

understanding models.

(3)

We explore several alternative approaches to combining col-

laborative information in the training procedure of image

classication models.

arXiv:2210.11907v1 [cs.CV] 21 Oct 2022

CIKM ’22, October 17–21, 2022, Atlanta, GA, USA Koby Bibas, Oren Sar Shalom, and Dietmar Jannach

(4)

We show that our approach is also particularly eective in

typical real-world situations when labels are missing for a

fraction of the training images.

Our MTL approach, where the auxiliary task is to reconstruct col-

laborative latent item representations, is highly promising. An in-

depth empirical evaluation on the Pinterest [

], MovieLens [

Amazon Clothing and Amazon Toys [

] datasets revealed that the

classication performance can be improved by 5.2%, 2.6%, 9.1%, and

3.0% respectively.

The rest of this paper is organized as follows. In Section 2 we

provide the necessary background and review earlier work. Sec-

tion 3 details the proposed approach; thorough experiments are

described in Section 4. Section 5 concludes the paper.

2 BACKGROUND AND RELATED WORK

In this section, we rst briey provide some relevant background

in three research areas: image classication, collaborative ltering

and multi-task Learning (Section 2.1). Then, we discuss earlier

works, which use side information or collaborative information in

combination with image data (Section 2.2).

2.1 Background

2.1.1 Image Classification with Neural Networks. Classifying im-

ages to predened categories based on their content is an important

task in the areas of computer vision and information extraction. The

training input for this learning problem is a set

{(𝑥𝑖, 𝑦𝑖)}

, where

𝑥𝑖

is an image of an entity

𝑖

and

𝑦𝑖

is the corresponding category.

The ImageNet database [

] for example, contains millions of hand-

labeled images, which are assigned to thousands of categories. Usu-

ally models trained on one domain do not perform well on other

domains and labels from the target domain are required [

]. As

manual annotation is expensive and time consuming, label paucity

is a ubiquitous problem.

In the last ten years, substantial progress was made in the eld

due to advances in deep learning technology and in particular

through the use of convolutional neural networks (CNN) [

Continuously new architectures were proposed over the years,

including the ResNet [

], MobileNet [

] and the recent RegNet

[42] approaches that we use in the experiments in this paper.

Given a particular application problem, a common approach to

mitigate the paucity of labeled data is to rely on existing models

that are pre-trained on a large corpus like ImageNet, and to then

ne-tune them with the data of the particular application. Such an

approach is also followed in our work. However, in addition to the

application-specic images and labels, we also consider collabora-

tive information to further improve classication performance.

2.1.2 Collaborative Filtering and Latent Factor Models. Collabo-

rative ltering recommender systems are commonly based on a

matrix that contains recorded interactions between users and items

[

], with the typical goal of predicting the missing entries of the

matrix and/or ranking the items for each user in terms of their

estimated relevance. Matrix factorization (MF) techniques prove to

be highly eective for this task and are based on representing users

and items in a lower-dimensional latent space. While early methods

were based on Singular Value Decomposition [

], later approaches

used various types of machine learning models to compute the

latent factors, e.g., [

]. Note that while various deep learning ap-

proaches were proposed for the task in recent years, latent factor

models remain highly competitive and relevant today [45].

Technically, MF methods project users and items into a shared

latent space, where each user and each item is represented through

a real-valued vector. A user vector indicates to what extent the

corresponding user prefers each of the latent traits; likewise, an

item vector represents to what extent the corresponding item pos-

sesses them [

]. The traits are called latent because their meaning

is unknown. Yet, we know that latent factor models are highly

eective for recommendations and furthermore, that user vectors

hold information on the demographics of the users [

]. Therefore,

we hypothesize that item vectors encode certain types of relevant

information about the items—like category or brand—which we try

to leverage in our work for improved image understanding.

2.1.3 Multi-Task Learning. MTL is a machine learning approach

which optimizes multiple learning tasks in parallel based on a

shared representation for all tasks. The promise of MTL is that

if the tasks share signicant commonalities, then this approach

may lead to improved performance on each individual task, com-

pared to training task-specic models separately [

]. The rationale

behind MTL and its success lies in the potential of improved gen-

eralization by using the information contained in the signals of

related tasks. Thus, each individual task can “help” other tasks,

often in order to also prevent overtting for one particular task.

Concretely, this is achieved by having some shared parameters for

all tasks. That way, during training time each task directly aects

the other ones.

Various successful applications of MTL are reported in the lit-

erature, both for computer vision problems and other areas, see

Zhang and Yang

[64]

for a survey. However, using MTL to combine

collaborative signals and image information for higher categoriza-

tion accuracy has not been explored in the literature before. As our

experimental evaluations will show, these two types of information

indeed have commonalities that contribute to the learning process

in an MTL setting. In addition, the proposed MTL approach is fa-

vorable over other combination approaches such as training the

models independently.

2.2 Related Work

There is a plethora of classication models that combine raw image

information together with additional metadata, which generally

fall under the umbrella of multi-modal classication. Notable exam-

ples are image and text classication, see Baltrušaitis et al

. [1]

and

Ramachandram and Taylor

[43]

for related surveys on multi-modal

learning. Our work is fundamentally dierent than multi-modal

learning since in this those works the other modalities are part of

the features of the model. That is, they serve as input to the model

both at training and inference time. In our work we assume that the

other modality, namely collaborative information, is not available

at inference time, yet we aim to leverage its existence at training

time by using it to guide the learning of the model. Thus, the model

may benet from this signal without relying on it for inference.

Several works have tried to harness the power of CF for various

image understanding problems. The work of Dabov et al. [13], for

example, applies a CF technique to improve image denoising. Choi

Collaborative Image Understanding CIKM ’22, October 17–21, 2022, Atlanta, GA, USA

et al

. [12]

, on the other hand, showed how to leverage collabora-

tive information for better pose estimation results. However, the

neighborhood of an image in these works is determined not by the

interactions with it, but according to visual similarity.

There are also existing several works that aim to improve the

image content understanding process using collaborative informa-

tion as we do in our work [

]. However, in these works

the user interaction signal is used as an input to their image under-

standing models. Therefore, the collaborative information cannot

be used to model new images, which is the focus of our work. Our

proposed method in contrast does not treat user interactions as a

feature for the model, but only as means of guidance.

Another type of related work can be found in Li and Tang

[32]

and Li et al

. [33]

, which both aim to improve the training of content

understanding models using user data. However, they treat the user

interaction data as a type of content. That is, they omit the identities

of the interacting users and do not apply collaborative ltering as

we do. As a result of such an approach, if the user interaction is, for

example, image tagging, then their work is limited to enhancing

image tagging processes. An earlier work described in Sang et al

[51]

retains the users’ identities and its objective is to improve

image tagging using interaction data. This is done by applying

tensor factorization to model users, images and tags in the same

joint latent space. However, this approach also binds the type of the

interactions and the prediction task. In our approach, the prediction

task is not tied to a specic type of user interactions.

Common to all discussed approaches based on user data is their

need to process the raw image data for each user interaction. As the

number of pixels

𝑃

in a typical image reaches to million (10

) and

the number

𝑁

of interactions in real-life datasets can be hundreds

of millions (10

), their running time is in

𝑂(𝑁·𝑃)=

, which

is computationally expensive or even infeasible for large real-life

datasets. Dierently from that, in our approach we rst apply a CF

method on the interaction data to create a compact representation

for each item of dimensionality

𝑓

, and we then train a content

understanding model using those xed-size vectors. Therefore, the

running time is

𝑂(𝑓· (𝑃+𝑁)) ≪ 𝑂(𝑁·𝑃)

. This is achieved since

instead of taking a naive end-to-end solution, we adopt a two-phase

solution, which rst transforms the interaction data to a label set.

Finally, it is worth mentioning that the other direction, i.e., con-

tent understanding to enhance recommender systems, has shown

striking success [

]. The relationship of these works to

our current approach is however limited, as our goal is to improve

the image understanding process and not item recommendations.

3 PROPOSED TECHNICAL APPROACH

3.1 Overview

Borrowing terminology from the recommender systems community,

the term item refers to an entity that users can interact with, and

furthermore may have an image as side-information. In this paper,

as the main focus is image understanding, we use the terms image

and item mostly interchangeably. In some cases the image can

be the item itself, as in image sharing services (e.g., Pinterest) for

example, where users interact with images.

Our proposed approach consists of two main phases, as illus-

trated in Figure 1. The input to the learning task is a collection of

Symbol Explanation

𝑥𝑖Content of item 𝑖

𝑦𝑖Category label of item 𝑖(a binary vector)

𝑦𝑖A predicted label of item 𝑖(a real-valued vector)

𝑈𝑖Set of users who interacted with item 𝑖

𝑞𝑖CF vector of item 𝑖

𝑞𝑖A predicted CF vector of item 𝑖

𝜔𝑖Weight of CF vector of item 𝑖

Table 1: Main symbols used in the paper.

images, where (i) each image has category labels assigned and (ii)

for each image we are given a set of users who have interacted

with it, e.g., through rating, tagging, or sharing. In the rst phase

of our approach, a user-item interaction matrix is derived from the

input data and a CF technique is applied to create xed-size latent

representations (vectors) of users and items. The purpose of this

phase is solely to generate augmented labels for the classication

problem. In the second phase, multitask learning is applied, where

the main task is to predict the categories of the images, and the

auxiliary task is to reconstruct the latent item representations cre-

ated in the rst phase

. As mentioned, we assume that the latent

item vectors—dubbed CF vectors from here on—encode category

information that can help the training of the main task. Technically,

instead of learning the image categorization from scratch, we rely

on a pre-trained image model, which we then ne-tune with the

given image data of the problem at hand.

Once the model is trained, we use it to predict categories of

new images, i.e., for images for which we do not have collaborative

information. Typical important use cases could be pictures that

users upload on a social media platform or pictures of newly added

shop items that should be automatically categorized.

It is important to note that at this level, we consider our technical

contribution as a framework because we propose a general form of

incorporating collaborative information into the image understand-

ing process. In a given application scenario, the framework can then

be instantiated in appropriate ways, using a suitable combination

of matrix factorization technique, training set for pre-training, and

deep learning architecture for image categorization.

3.2 Technical Approach

In the following, we describe our approach more formally. In this

formalization, we use

𝑖∈ {

, . . . , 𝐼 }

to index items. Table 1 summa-

rizes the notations in this paper.

At training time, we assume each item

𝑖

is associated with an

image

𝑥𝑖

and with some label

𝑦𝑖

, a category in our use case. Fur-

thermore, the item may be associated with a set

𝑈𝑖

of users who

interacted with it. At inference time, given an image to be modeled

we assume it is not associated with historical interaction data.

In many practical scenarios, the content understanding model

must be invoked upon introducing the new item to the system, and

therefore it has not gathered usage data (also known as cold item, a

ubiquitous problem in recommender systems [

]). We emphasize

that integrating collaborative ltering information at inference time

1The latent user vectors are not used in this process.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CollaborativeImageUnderstandingKobyBibaskobybibas@gmail.comMetaTelAviv,IsraelOrenSarShalomoren.sarshalom@gmail.comAmazonTelAviv,IsraelDietmarJannachdietmar.jannach@aau.atUniversityofKlagenfurtKlagenfurt,AustriaABSTRACTAutomaticallyunderstandingthecontentsofanimageisahighlyrelevantprobleminpractice.I...

展开>> 收起<<

Collaborative Image Understanding.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Collaborative Image Understanding

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: