In this paper, we adopt the idea of projecting image features
in a latent embedding space via a Neural Network (NN) model.
We propose a class-balanced triplet loss that separate image
features in a latent embedding space for class-imbalanced
datasets. We also propose a Gaussian Process (GP) model to
learn a mapping between features and a semantic space. The
classical Gaussian Process (GP), when used in the setting of
regression, is robust to overfitting [18]. If training and testing
data come from the same distribution, a PAC-Bayesian Bound
[19] guarantees that the training error will be close to the
testing error.
Our experiments demonstrate that our model, though em-
ploying a simple design, can reach SOTA performance on the
class-imbalanced ZSL datasets AWA1, AWA2 and APY in the
Generalized ZSL setting.
The main contributions of our work are:
1) We propose a novel simple framework for ZSL, where
image features from a deep Neural Network are mapped
into a latent embedding space to generate latent pro-
totypes for each seen class by a novel triplet training
model. Then a Gaussian Process (GP) regression model
is trained via maximizing the marginal likelihood to
predict latent prototypes of unseen classes.
2) The mapping from image features to a latent space is
performed by our proposed triplet training model for
ZSL learning, using a novel triplet loss that is robust on
class-imbalanced ZSL datasets. Our experiments show
improved performance over the traditional triplet loss
on all ZSL datasets, including SOTA performance on
class-imbalanced datasets, specifically, AWA1, AWA2
and APY.
3) Given feature vectors extracted by a pre-trained ResNet,
our model has an average training time of 5 minutes on
all ZSL datasets, faster than several SOTA models that
have high accuracy.
II. RELATED WORKS
Traditional and Generalized ZSL: Early ZSL research
adopts a so-called Traditional ZSL setting [1], [20]. The
Traditional ZSL requires the model to train on images of
seen classes and semantic vectors of seen and unseen classes.
Test images are restricted to the unseen classes. However, in
practice, test images may also come from the seen classes
[17]. The Generalized ZSL setting was proposed to address
the problem of including both seen and unseen images in the
test set. According to Xian et al. [17], models that have good
performance in the Traditional ZSL setting may not work well
in the Generalized ZSL setting.
Prototypical Methods. Our classification model is related
to prototypical methods proposed in Zero-Shot and Few-
Shot learning [21], [22], [23]. In the prototypical methods,
a prototype is learned for each class to help classification. For
example, Snell et al. [21] propose a neural network to learn a
projection from semantic vectors to feature prototypes of each
class. Test samples are classified via Nearest Neighbor among
prototypes. While the classification process of our model is
similar to prototypical methods, our model uses a Gaussian
Process Regression instead of Neural Networks to predict
prototypes of unseen classes.
Inductive and Transductive ZSL: Similar to most ZSL
models, the model we propose is an inductive ZSL model.
Inductive ZSL requires that no feature information of unseen
classes is present during the training phase [17]. Models that
introduce unlabeled unseen images during the training phase
are called transductive ZSL models [24]. Ensuring a fair
comparison, results from such models are usually compared
separately to inductive models since additional information is
introduced [3], [25], [26].
Triplet Loss. Many ZSL models have proposed a triplet
loss in their framework to help separate samples from different
classes. Chacheux et al. [8] proposed a variant of a triplet loss
in their model to learn feature prototypes for different classes.
Han et al. [9] adopt an improved version of the triplet loss
called “center loss” proposed in [27] that separates samples
in a latent space. Unlike their models, we notice that current
triplet losses proposed for the ZSL problem may not perform
well on class-imbalanced datasets like AWA2, AWA1 and
APY. An improved version of the triplet loss training model
is proposed to mitigate this problem.
Gaussian Process Regression. For the ZSL problem,
Dolma et al. [28] proposed a model that performs k-nearest
neighbor search for test samples over training samples and per-
forms a GP regression based on the search result. Mukherjee
et al. [29] model image features and semantic vectors for each
class with Gaussians, and learns a linear projection between
the two distributions. Our model is closest to Elhoseiny
et al. [30], where Gaussian Process Regression is used to
predict unseen class prototypes based on seen class prototypes.
However, they used a Gaussian Process directly without the
benefit of a learned network model for feature embedding, and
showed relatively poor results. Verma and Rai [3] proposed a
Kernel Ridge Regression (KRR) approach called GFZSL for
the traditional ZSL problem. Our experiment demonstrates that
our model outperforms GFZSL by a large margin.
III. PROPOSED APPROACH
We propose a hybrid model for the ZSL problem: a Latent
Feature Embedding model to separate inter-class features that
is robust to class-imbalanced datasets, a GP Regression model
to predict prototypes of unseen classes based on seen classes
and semantic information and a calibrated classifier to balance
the trade-off between seen and unseen class accuracy.
A. Latent Feature Embedding Model
Model Structure We propose to learn a linear NN mapping
from image features to latent embeddings. We argue that for
the ZSL task, a linear projection with limited flexibility can
help prevent the model from overfitting on seen class training
samples. Following others [3], [28], we model feature vectors
from each class using the multivariate Gaussian distribution.
We exploit the fact that Gaussian random vectors are closed
under linear transformations.