
Collaborative Image Understanding CIKM ’22, October 17–21, 2022, Atlanta, GA, USA
et al
. [12]
, on the other hand, showed how to leverage collabora-
tive information for better pose estimation results. However, the
neighborhood of an image in these works is determined not by the
interactions with it, but according to visual similarity.
There are also existing several works that aim to improve the
image content understanding process using collaborative informa-
tion as we do in our work [
35
,
37
,
56
,
61
]. However, in these works
the user interaction signal is used as an input to their image under-
standing models. Therefore, the collaborative information cannot
be used to model new images, which is the focus of our work. Our
proposed method in contrast does not treat user interactions as a
feature for the model, but only as means of guidance.
Another type of related work can be found in Li and Tang
[32]
and Li et al
. [33]
, which both aim to improve the training of content
understanding models using user data. However, they treat the user
interaction data as a type of content. That is, they omit the identities
of the interacting users and do not apply collaborative ltering as
we do. As a result of such an approach, if the user interaction is, for
example, image tagging, then their work is limited to enhancing
image tagging processes. An earlier work described in Sang et al
.
[51]
retains the users’ identities and its objective is to improve
image tagging using interaction data. This is done by applying
tensor factorization to model users, images and tags in the same
joint latent space. However, this approach also binds the type of the
interactions and the prediction task. In our approach, the prediction
task is not tied to a specic type of user interactions.
Common to all discussed approaches based on user data is their
need to process the raw image data for each user interaction. As the
number of pixels
𝑃
in a typical image reaches to million (10
6
) and
the number
𝑁
of interactions in real-life datasets can be hundreds
of millions (10
8
), their running time is in
𝑂(𝑁·𝑃)=
10
14
, which
is computationally expensive or even infeasible for large real-life
datasets. Dierently from that, in our approach we rst apply a CF
method on the interaction data to create a compact representation
for each item of dimensionality
𝑓
, and we then train a content
understanding model using those xed-size vectors. Therefore, the
running time is
𝑂(𝑓· (𝑃+𝑁)) ≪ 𝑂(𝑁·𝑃)
. This is achieved since
instead of taking a naive end-to-end solution, we adopt a two-phase
solution, which rst transforms the interaction data to a label set.
Finally, it is worth mentioning that the other direction, i.e., con-
tent understanding to enhance recommender systems, has shown
striking success [
2
,
15
,
21
,
50
,
55
]. The relationship of these works to
our current approach is however limited, as our goal is to improve
the image understanding process and not item recommendations.
3 PROPOSED TECHNICAL APPROACH
3.1 Overview
Borrowing terminology from the recommender systems community,
the term item refers to an entity that users can interact with, and
furthermore may have an image as side-information. In this paper,
as the main focus is image understanding, we use the terms image
and item mostly interchangeably. In some cases the image can
be the item itself, as in image sharing services (e.g., Pinterest) for
example, where users interact with images.
Our proposed approach consists of two main phases, as illus-
trated in Figure 1. The input to the learning task is a collection of
Symbol Explanation
𝑥𝑖Content of item 𝑖
𝑦𝑖Category label of item 𝑖(a binary vector)
ˆ
𝑦𝑖A predicted label of item 𝑖(a real-valued vector)
𝑈𝑖Set of users who interacted with item 𝑖
𝑞𝑖CF vector of item 𝑖
ˆ
𝑞𝑖A predicted CF vector of item 𝑖
𝜔𝑖Weight of CF vector of item 𝑖
Table 1: Main symbols used in the paper.
images, where (i) each image has category labels assigned and (ii)
for each image we are given a set of users who have interacted
with it, e.g., through rating, tagging, or sharing. In the rst phase
of our approach, a user-item interaction matrix is derived from the
input data and a CF technique is applied to create xed-size latent
representations (vectors) of users and items. The purpose of this
phase is solely to generate augmented labels for the classication
problem. In the second phase, multitask learning is applied, where
the main task is to predict the categories of the images, and the
auxiliary task is to reconstruct the latent item representations cre-
ated in the rst phase
1
. As mentioned, we assume that the latent
item vectors—dubbed CF vectors from here on—encode category
information that can help the training of the main task. Technically,
instead of learning the image categorization from scratch, we rely
on a pre-trained image model, which we then ne-tune with the
given image data of the problem at hand.
Once the model is trained, we use it to predict categories of
new images, i.e., for images for which we do not have collaborative
information. Typical important use cases could be pictures that
users upload on a social media platform or pictures of newly added
shop items that should be automatically categorized.
It is important to note that at this level, we consider our technical
contribution as a framework because we propose a general form of
incorporating collaborative information into the image understand-
ing process. In a given application scenario, the framework can then
be instantiated in appropriate ways, using a suitable combination
of matrix factorization technique, training set for pre-training, and
deep learning architecture for image categorization.
3.2 Technical Approach
In the following, we describe our approach more formally. In this
formalization, we use
𝑖∈ {
1
, . . . , 𝐼 }
to index items. Table 1 summa-
rizes the notations in this paper.
At training time, we assume each item
𝑖
is associated with an
image
𝑥𝑖
and with some label
𝑦𝑖
, a category in our use case. Fur-
thermore, the item may be associated with a set
𝑈𝑖
of users who
interacted with it. At inference time, given an image to be modeled
we assume it is not associated with historical interaction data.
In many practical scenarios, the content understanding model
must be invoked upon introducing the new item to the system, and
therefore it has not gathered usage data (also known as cold item, a
ubiquitous problem in recommender systems [
52
]). We emphasize
that integrating collaborative ltering information at inference time
1The latent user vectors are not used in this process.