A Lower Bound of Hash Codes Performance Xiaosu Zhu1 xiaosu.zhuoutlook.comJingkuan Song1

2025-05-01 0 0 9.22MB 23 页 10玖币
侵权投诉
A Lower Bound of Hash Codes’ Performance
Xiaosu Zhu1
xiaosu.zhu@outlook.com
Jingkuan Song1
jingkuan.song@gmail.com
Yu Lei1
leiyu6969@gmail.com
Lianli Gao1
lianli.gao@uestc.edu.cn
Heng Tao Shen1,2
shenhengtao@hotmail.com
1Center for Future Media, University of Electronic Science and Technology of China
2Peng Cheng Laboratory
Abstract
As a crucial approach for compact representation learning, hashing has achieved
great success in effectiveness and efficiency. Numerous heuristic Hamming space
metric learning objectives are designed to obtain high-quality hash codes. Nev-
ertheless, a theoretical analysis of criteria for learning good hash codes remains
largely unexploited. In this paper, we prove that inter-class distinctiveness and
intra-class compactness among hash codes determine the lower bound of hash
codes’ performance. Promoting these two characteristics could lift the bound and
improve hash learning. We then propose a surrogate model to fully exploit the
above objective by estimating the posterior of hash codes and controlling it, which
results in a low-bias optimization. Extensive experiments reveal the effectiveness of
the proposed method. By testing on a series of hash-models, we obtain performance
improvements among all of them, with an up to
26.5%
increase in mean Average
Precision and an up to 20.5% increase in accuracy. Our code is publicly available
at https://github.com/VL-Group/LBHash.
1 Introduction
The explosive increase of multimedia digital content requires people to develop large-scale processing
techniques for effective and efficient multimedia understanding [
55
,
58
,
60
,
61
]. Meanwhile, growing
focus on carbon neutrality initiatives urges researchers to handle above issues with low power and
memory consumption [
34
,
36
]. Fortunately, hashing, as a key role in compact representation learning,
has demonstrated efficiency and effectiveness over the past few decades towards above demands [
26
].
By digesting multimedia contents into short binary codes, hashing is able to significantly reduce
storage, boost speed, and preserve key information [18,25,29,31,64].
In recent years, learning to hash has shown its advantages in alleviating information loss by performing
data-dependent Hamming space projection with a non-linear model [
28
,
37
,
54
] compared to hand-
crafted hashing functions. The purpose of optimizing hash-models is to generate semantically
meaningful hash codes for downstream tasks. A practical convention is to increase similarities among
codes of the same category, otherwise the opposite. Such convention is validated to be effective in
many applications [9,30,38,45,46,56].
Practices in current advances still leave two questions which remain primarily unexploited: When
should we determine a hash-model for producing good codes, and by what means to achieve this
goal? Taking a comprehensive study on the above questions is essential since it not only guides us to
Corresponding author.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.05899v1 [cs.CV] 12 Oct 2022
develop high-performance algorithms in typical hash-based applications but also provides potential
possibilities to inspire future works in hash learning.
Although some works [
1
,
54
] try to answer the above questions by building connections between
hash codes and common descriptors with domain knowledge, we still seek theoretical guarantees for
hash codes’ performance. Similar situations occur in related works [
34
,
52
,
53
], where objectives to
optimize hash-models are designed by intuitions or heuristics.
By formulating correlations among hash codes as inter-class distinctiveness and intra-class com-
pactness, we try to address the above issues and provide a lower bound of hash codes’ performance.
Based on this, we could further derive from them as objectives to lift the lower bound. Moreover, to
effectively train model towards such targets, estimating and further controlling the posterior of hash
codes is necessary since bit-level dependence exists in posterior distribution. If we neglect this, bias
will be introduced during hash-model optimization, hindering performance. Summarizing both of
them, Our main contributions are summarized as follows:
1) To the best of our knowledge, we are the first to formulate a lower bound of hash codes’ per-
formance, which is directly proportional to inter-class distinctiveness and intra-class compactness
among codes. Based on this, an objective is further proposed to lift this lower bound and thus enhance
performance.
2) A novel posterior estimation over hash codes is proposed to perform low-bias optimization towards
the above objective. As a non-linear function, it could be utilized for optimizing hash-models via
gradient descent. Such plug-and-play component is able to integrate into current hash-models to
boost performance.
3) Extensive experiments on three benchmark datasets reveal the effectiveness of the proposed method
in two typical tasks. Specifically, we obtain an up to
26.5%
increase on mean Average Precision for
fast retrieval. We also observe an up to 20.5% increase on accuracy for hash-based recognition.
The rest of paper is organized as follows: We first briefly review current hashing methods and
give preliminaries (Secs. 2and 3). Then, lower bound on hash codes’ performance by inter-class
distinctiveness and intra-class compactness is proposed (Sec. 4). To fully exploit such guidance,
posterior estimation over multivariate Bernoulli distribution is designed for code control (Sec. 5).
Experiments (Sec. 6) are then conducted for evaluation.
2 Related Works
This paper will focus on data-dependent learning-to-hash methods. Recent advances in building
these models are validated to be effective in hash-based recognition and fast retrieval [
1
,
21
,
22
,
30
,
45
]. The typical approach utilizes a projection function to convert raw inputs into compact
binary codes and conduct classification or approximate nearest neighbour search on codes for
downstream tasks [
9
,
46
,
56
]. As mentioned in introduction, to obtain such a projector, current
practices suggest that we formulate it as a Hamming space metric learning problem under binary
constraints. Pointwise [
25
,
33
,
57
], pairwise [
5
,
6
,
29
,
30
,
42
,
63
], triplet [
13
,
32
], listwise [
51
,
62
]
and prototype-based [
2
,
7
,
21
,
59
] objectives are all studied for model training, where the core idea is
to adjust similarities among samples from same / different categories.
Besides the above objectives, we should also note that hash-model optimization is NP-hard [
1
,
33
,
45
,
54
]. That is caused by the discrete, binarized hash code formulation. To tackle the issue,
previous works propose many methods that target on following purposes:
a)
Surrogate functions that
approximate hashing by continuous functions, such as tanh [6], Stochastic neuron [12], affine [43]
or Straight-Through [
3
]. By adopting these, the discrete hash-model optimization is transformed into
approximated but easier one via gradient descent.
b)
Non-differentiable optimization algorithm design,
which focuses on solutions that directly handle discrete inputs, including Coordinate descent [
41
],
Discrete gradient estimation [
16
], etc. There is also bag of tricks which help for stable training. For
example, minimizing quantization error reduces the gap between raw outputs of the model and final
hash codes [30], and code-balancing alleviates the problem of producing trivial codes [15].
Please refer to literature reviews on above works [
34
,
52
,
53
] for comprehensive studies on learning
to hash. We could see two issues exist and remain primarily unexploited in above works. The first
is objective design, which is commonly proposed by intuitive and with few theoretical guarantees.
2
The second is model optimization, where bit-level dependence is essential in code control but most
works neglect it. To tackle these, we would firstly give a lower bound of hash codes’ performance
and utilize it as a guidance for hash-model training. Then, a novel model optimization approach is
proposed in a multivariate perspective for low-bias code control.
3 Preliminaries
Learning to hash performs data-dependent compact representation learning. Inputs
xXRd
are firstly transformed into
h
-dim real-valued vector
`
by
Fθ
, a projection parameterized by
θ
:
RdFθ
Rh. And `is then binarized to hbits binary code bBHh:
xFθ
`bin
b,where bin (·) = +1,(·)0,
1,(·)<0.
Typical usage of hash codes.
Common hashing-based tasks include fast retrieval [
2
,
23
,
31
,
44
] and
recognition [
30
,
37
,
41
,
45
]. For fast retrieval, given a query under the same distribution of
X
, we
convert it into query hash code
q
and conduct fast approximate nearest neighbour search in
B
to
produce rank list
R
. In
R
, samples are organized from nearest to furthest to query. Correspondingly
among all results in
R
, true positives
tpiTP
indicate samples that match with query while others
are false positives
fpiFP
(
i
indicates rank). As a criterion, Average Precision (AP) is commonly
adopted to determine how well the retrieved results are. It encourages true positives to have higher
ranks than false positives:
AP =1
|TP|Ptpi,mTP P@i
, where
P@i
is the rank-
i
precision. As for
hash-based recognition, since current works formulate it as a variant of retrieval, we could still adopt
AP as a criterion to indicate performance. The detailed explanation is placed in Supp. Sec. C.
4 Lower Bound by Inter-Class Distinctiveness and Intra-Class Compactness
We start at an arbitrary rank list for studying the characteristics of codes in the above scenarios. To
calculate AP easily, we first introduce mis-rank that indicates how many false positives take higher
places than true positives. We refer readers to Supp. Secs. A and B for detailed description and proof.
Definition.
Mis-rank
m
indicates how many false positives have higher ranks than a true positive,
e.g.,
tpi,m
is at rank
i
, meanwhile
m=|d(q,fp)< d q,tpi,m|
, where
d(q,·)
is distance
between qand a sample, and | · | is number of elements in set.
Remark. The rank-iprecision P@iis derived to be (im)
/i.
Then, average precision could be derived as:
AP =1
|TP|X
tpi,mTP
P@i=1
|TP|X
tpi,mTP
im
i,
for all true positives in rank list. This is because the rank-
i
precision
P@i
for any true positive
tpi,m
equals to (im)
/i.
From the derivation, we could immediately obtain that
AP
increases
iff m
decreases. Therefore,
now we could focus on how to reduce
m
to thereby increase average precision. Noticed that the value
of mis highly related to two distances, d(q,fp)and d(q,tp), the following proposition is raised:
Proposition.
mmax d(q,tp)
min d(q,fp)tp TP,fp FP
where ·denotes upper bound. Correspondingly,
AP min d(q,fp)
max d(q,tp)tp TP,fp FP
where ·denotes lower bound.
If inputs
xj
have associated labels
yj
, the above studies are able to be further generalized. Class-wise
centers, inter-class distinctiveness and intra-class compactness are introduced to formulate our final
lower-bound of hash codes’ performance.
3
Definition. Center ccCis a representative hash code of a specific class c:
cc= arg min
bX
j
db,bj,bj|yj=c
where bjis the j-th sample in set Band yjis the label of bj.
Definition.
Inter-class distinctiveness is
min Dinter
where
Dinter
is a set that measures distances
between all centers over
C
:
dcj,ck| ∀cj,ckC
. In contrast, intra-class compactness is
1
/max Dintra
, where
Dintra =dbj,cc| ∀yj=c
measures the distances among samples of the
same class to their corresponding center.
Combining above propositions and definitions, we could reach:
Proposition.
AP min d(q,fp)
max d(q,tp)const ·min Dinter
max Dintra
.(1)
This proposition reveals how AP is affected by the above two inter- and intra- factors. Actually, it
could cover common scenarios as a criterion of hash codes’ performance, e.g., fast retrieval and
recognition, which will be explained in supplementary materials. Intuitively, lifting this lower bound
would enhance performance, formulating the following objective:
maximize min Dinter ,(2)
minimize max Dintra ,(3)
s.t.XRdFθ,bin
BHh.
Inspired by [
59
], we give a specific solution to implement Eqs. (2) and (3) under supervised hashing
by firstly defining distance maximized class-specific centers and then shrinking hash codes to their
corresponding centers. We will validate the effectiveness of such a solution in experiments. It is
worth noting that achieving above goal is an open topic, including unsupervised or other complex
scenarios, which we leave for future study. Next, to optimize hash-models with the above supervision,
we introduce posterior estimation over hash codes and then control it.
5 Posterior Estimation for Code Control
It is essentially hard to minimize the distance between a hash code and a target to achieve Eq. (3)
for two reasons. Firstly, the posterior distribution of
B
given inputs
X
is formulated to be under
multivariate Bernoulli distribution:
p(B|X)MVB
. If we do not take correlations among different
variables into account, optimization on
B
is biased. Secondly, dealing with hash codes is a
{−1,+1}
discrete optimization, which is not trivial to handle. Therefore, we try to organize Eq. (3) as a
Maximum Likelihood Estimation (MLE) to tackle the above issues. Considering the definition of
Hamming distance between an arbitrary hash code band its target t:
d(b,t) =
h
X
i
1{bi6=ti}=
h
X
i
1{`iti<0}
where
1
is the characteristic function and
(·)i
indicates value on the
i
-th dimension. Since Hamming
distance measures how many bits are different between two codes, the probability of
d(b,t) = δ
can
be formulated as:
p(d(b,t) = δ) = X
i(h
δ), j /(h
δ)
{p(bi6=ti,bj=tj)}.
Therefore,
b
and
t
have distance
δiff p(d(b,t) = δ)
is maximized. However, to precisely calculate
it is difficult, since it involves the joint probability of
b
. Therefore, we try to estimate all joint
probabilities by adopting a surrogate model Pπparameterized by πto perform estimation:
`Pπ
o,where oOR2h(4)
4
where
o
is the probabilities of a Categorical distribution
p(O|X)b=p(B|X)
, which contains
2h
entries. Each entry of oestimates one of a specific joint probability by feeding `into Pπ:
oib=p(bk>0)1{ιk=1},(bk<0)1{ιk=0},1kh,
0i2h1
where
ι
is the
h
bits binary representation of
i
. For example,
o3=p(b= (-1-1+1+1))
when
h= 4
.
To train Pπ, we perform Maximum Likelihood Estimation (MLE) over p(O|X):
Lπ(Pπ;o) = log oi,
where i=
h
X
k=1
1{bk>0} · 2k1.(5)
If
Pπ
is ready for estimation, we could directly adopt it as a nice non-linear function to maximize
p(d(b,t) = k)
. Note that to realize Eq. (3), we want
b
and its corresponding center
c
are closest, i.e.
maximize p(d(b,c) = 0). So, we could reformulate this maximization as another MLE:
Lθ(Fθ;b) = log p(d(b,c) = 0),
where p(d(b,c) = 0) = p(bi=ci,1ih),(6)
which could be optimized by calculating surrogate gradients of Eq. (6) through Pπ:
ˆ
Lθ
θ=
nlog oi;i=Ph
k=11{ck>0} · 2k1o
π
π
`
`
θb=Lθ
θ.(7)
Unfortunately, such joint probability estimation on
MVB
requires
O2h
complexity, which is not
flexible when
h
is large. By adopting the idea of block code [
49
], the above estimation is able
to perform on long hash bits by separating hash codes into a series of blocks. Specifically, any
h
bits hash codes can be split into
u
blocks while each block consumes
h
/u
bits. Correspondingly,
u
independent surrogate networks
Pπi,1iu
are adopted to perform the above estimation and
back-propagation simultaneously. With such decomposition, the whole problem is transformed into
usub-problems with a reduced complexity O2h O u·2h
/u.
As a summarization all of things, overall optimization via gradient descent is placed in Alg. 1. We
first perform Eq. (2) step by using the pre-defined centers and then perform Eq. (3) step by back-
propagation via Eq. (7). Two models Fθ,Pπare optimized with learning rate η1, η2respectively.
Algorithm 1 One of implementations under supervised circumstance.
1: procedure TRAIN(Fθ,Pπ)Training procedure of two models Fθ,Pπ.
2: Generate class-specific centers cC,|C|=Class-num; (Eq. (2)).
3: repeat Main training loop.
4: Sample xfrom Xwith label y;
5: `=Fθ(x);
6: o=Pπ(`);
7: ππη1Lπ
π;(Eq. (5)).
8: θθη2
ˆ
Lθ
θwith corresponding center c;Eq. (3), Eq. (7).
9: until Total epoch exceeds;
10: return Fθ;Optimized hash-model Fθ
11: end procedure
6 Experiments
We conduct extensive experiments on three benchmark datasets to confirm the effectiveness of our
proposed method. To make fair comparisons, we first provide experiments setup.
5
摘要:

ALowerBoundofHashCodes'PerformanceXiaosuZhu1xiaosu.zhu@outlook.comJingkuanSong1jingkuan.song@gmail.comYuLei1leiyu6969@gmail.comLianliGao1lianli.gao@uestc.edu.cnHengTaoShen1,2shenhengtao@hotmail.com1CenterforFutureMedia,UniversityofElectronicScienceandTechnologyofChina2PengChengLaboratoryAbstractAsa...

收起<<
A Lower Bound of Hash Codes Performance Xiaosu Zhu1 xiaosu.zhuoutlook.comJingkuan Song1.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:9.22MB 格式:PDF 时间:2025-05-01

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注