A Lower Bound of Hash Codes Performance Xiaosu Zhu1 xiaosu.zhuoutlook.comJingkuan Song1

2025-05-01 0 0 9.22MB 23 页 10玖币

侵权投诉

A Lower Bound of Hash Codes’ Performance

Xiaosu Zhu1

xiaosu.zhu@outlook.com

Jingkuan Song1∗

jingkuan.song@gmail.com

Yu Lei1

leiyu6969@gmail.com

Lianli Gao1

lianli.gao@uestc.edu.cn

Heng Tao Shen1,2

shenhengtao@hotmail.com

1Center for Future Media, University of Electronic Science and Technology of China

2Peng Cheng Laboratory

Abstract

As a crucial approach for compact representation learning, hashing has achieved

great success in effectiveness and efﬁciency. Numerous heuristic Hamming space

metric learning objectives are designed to obtain high-quality hash codes. Nev-

ertheless, a theoretical analysis of criteria for learning good hash codes remains

largely unexploited. In this paper, we prove that inter-class distinctiveness and

intra-class compactness among hash codes determine the lower bound of hash

codes’ performance. Promoting these two characteristics could lift the bound and

improve hash learning. We then propose a surrogate model to fully exploit the

above objective by estimating the posterior of hash codes and controlling it, which

results in a low-bias optimization. Extensive experiments reveal the effectiveness of

the proposed method. By testing on a series of hash-models, we obtain performance

improvements among all of them, with an up to

26.5%

increase in mean Average

Precision and an up to 20.5% increase in accuracy. Our code is publicly available

at https://github.com/VL-Group/LBHash.

1 Introduction

The explosive increase of multimedia digital content requires people to develop large-scale processing

techniques for effective and efﬁcient multimedia understanding [

]. Meanwhile, growing

focus on carbon neutrality initiatives urges researchers to handle above issues with low power and

memory consumption [

]. Fortunately, hashing, as a key role in compact representation learning,

has demonstrated efﬁciency and effectiveness over the past few decades towards above demands [

By digesting multimedia contents into short binary codes, hashing is able to signiﬁcantly reduce

storage, boost speed, and preserve key information [18,25,29,31,64].

In recent years, learning to hash has shown its advantages in alleviating information loss by performing

data-dependent Hamming space projection with a non-linear model [

] compared to hand-

crafted hashing functions. The purpose of optimizing hash-models is to generate semantically

meaningful hash codes for downstream tasks. A practical convention is to increase similarities among

codes of the same category, otherwise the opposite. Such convention is validated to be effective in

many applications [9,30,38,45,46,56].

Practices in current advances still leave two questions which remain primarily unexploited: When

should we determine a hash-model for producing good codes, and by what means to achieve this

goal? Taking a comprehensive study on the above questions is essential since it not only guides us to

∗Corresponding author.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.05899v1 [cs.CV] 12 Oct 2022

develop high-performance algorithms in typical hash-based applications but also provides potential

possibilities to inspire future works in hash learning.

Although some works [

] try to answer the above questions by building connections between

hash codes and common descriptors with domain knowledge, we still seek theoretical guarantees for

hash codes’ performance. Similar situations occur in related works [

], where objectives to

optimize hash-models are designed by intuitions or heuristics.

By formulating correlations among hash codes as inter-class distinctiveness and intra-class com-

pactness, we try to address the above issues and provide a lower bound of hash codes’ performance.

Based on this, we could further derive from them as objectives to lift the lower bound. Moreover, to

effectively train model towards such targets, estimating and further controlling the posterior of hash

codes is necessary since bit-level dependence exists in posterior distribution. If we neglect this, bias

will be introduced during hash-model optimization, hindering performance. Summarizing both of

them, Our main contributions are summarized as follows:

1) To the best of our knowledge, we are the ﬁrst to formulate a lower bound of hash codes’ per-

formance, which is directly proportional to inter-class distinctiveness and intra-class compactness

among codes. Based on this, an objective is further proposed to lift this lower bound and thus enhance

performance.

2) A novel posterior estimation over hash codes is proposed to perform low-bias optimization towards

the above objective. As a non-linear function, it could be utilized for optimizing hash-models via

gradient descent. Such plug-and-play component is able to integrate into current hash-models to

boost performance.

3) Extensive experiments on three benchmark datasets reveal the effectiveness of the proposed method

in two typical tasks. Speciﬁcally, we obtain an up to

26.5%

increase on mean Average Precision for

fast retrieval. We also observe an up to 20.5% increase on accuracy for hash-based recognition.

The rest of paper is organized as follows: We ﬁrst brieﬂy review current hashing methods and

give preliminaries (Secs. 2and 3). Then, lower bound on hash codes’ performance by inter-class

distinctiveness and intra-class compactness is proposed (Sec. 4). To fully exploit such guidance,

posterior estimation over multivariate Bernoulli distribution is designed for code control (Sec. 5).

Experiments (Sec. 6) are then conducted for evaluation.

2 Related Works

This paper will focus on data-dependent learning-to-hash methods. Recent advances in building

these models are validated to be effective in hash-based recognition and fast retrieval [

]. The typical approach utilizes a projection function to convert raw inputs into compact

binary codes and conduct classiﬁcation or approximate nearest neighbour search on codes for

downstream tasks [

]. As mentioned in introduction, to obtain such a projector, current

practices suggest that we formulate it as a Hamming space metric learning problem under binary

constraints. Pointwise [

], pairwise [

], triplet [

], listwise [

]

and prototype-based [

] objectives are all studied for model training, where the core idea is

to adjust similarities among samples from same / different categories.

Besides the above objectives, we should also note that hash-model optimization is NP-hard [

]. That is caused by the discrete, binarized hash code formulation. To tackle the issue,

previous works propose many methods that target on following purposes:

Surrogate functions that

approximate hashing by continuous functions, such as tanh [6], Stochastic neuron [12], aﬃne [43]

or Straight-Through [

]. By adopting these, the discrete hash-model optimization is transformed into

approximated but easier one via gradient descent.

Non-differentiable optimization algorithm design,

which focuses on solutions that directly handle discrete inputs, including Coordinate descent [

Discrete gradient estimation [

], etc. There is also bag of tricks which help for stable training. For

example, minimizing quantization error reduces the gap between raw outputs of the model and ﬁnal

hash codes [30], and code-balancing alleviates the problem of producing trivial codes [15].

Please refer to literature reviews on above works [

] for comprehensive studies on learning

to hash. We could see two issues exist and remain primarily unexploited in above works. The ﬁrst

is objective design, which is commonly proposed by intuitive and with few theoretical guarantees.

The second is model optimization, where bit-level dependence is essential in code control but most

works neglect it. To tackle these, we would ﬁrstly give a lower bound of hash codes’ performance

and utilize it as a guidance for hash-model training. Then, a novel model optimization approach is

proposed in a multivariate perspective for low-bias code control.

3 Preliminaries

Learning to hash performs data-dependent compact representation learning. Inputs

x∈X⊆Rd

are ﬁrstly transformed into

-dim real-valued vector

Fθ

, a projection parameterized by

RdFθ

−−→ Rh. And `is then binarized to hbits binary code b∈B⊆Hh:

xFθ

−−→ `bin

−−→ b,where bin (·) = +1,(·)≥0,

−1,(·)<0.

Typical usage of hash codes.

Common hashing-based tasks include fast retrieval [

] and

recognition [

]. For fast retrieval, given a query under the same distribution of

, we

convert it into query hash code

and conduct fast approximate nearest neighbour search in

produce rank list

. In

, samples are organized from nearest to furthest to query. Correspondingly

among all results in

, true positives

tpi∈TP

indicate samples that match with query while others

are false positives

fpi∈FP

(

indicates rank). As a criterion, Average Precision (AP) is commonly

adopted to determine how well the retrieved results are. It encourages true positives to have higher

ranks than false positives:

AP =1

|TP|P∀tpi,m∈TP P@i

, where

P@i

is the rank-

precision. As for

hash-based recognition, since current works formulate it as a variant of retrieval, we could still adopt

AP as a criterion to indicate performance. The detailed explanation is placed in Supp. Sec. C.

4 Lower Bound by Inter-Class Distinctiveness and Intra-Class Compactness

We start at an arbitrary rank list for studying the characteristics of codes in the above scenarios. To

calculate AP easily, we ﬁrst introduce mis-rank that indicates how many false positives take higher

places than true positives. We refer readers to Supp. Secs. A and B for detailed description and proof.

Deﬁnition.

Mis-rank

indicates how many false positives have higher ranks than a true positive,

e.g.,

tpi,m

is at rank

, meanwhile

m=|d(q,fp)< d q,tpi,m|

, where

d(q,·)

is distance

between qand a sample, and | · | is number of elements in set.

Remark. The rank-iprecision P@iis derived to be (i−m)

/i.

Then, average precision could be derived as:

AP =1

|TP|X

∀tpi,m∈TP

P@i=1

|TP|X

∀tpi,m∈TP

i−m

for all true positives in rank list. This is because the rank-

precision

P@i

for any true positive

tpi,m

equals to (i−m)

/i.

From the derivation, we could immediately obtain that

increases

iﬀ m

decreases. Therefore,

now we could focus on how to reduce

to thereby increase average precision. Noticed that the value

of mis highly related to two distances, d(q,fp)and d(q,tp), the following proposition is raised:

Proposition.

m∝max d(q,tp)

min d(q,fp)∀tp ∈TP,fp ∈FP

where ·denotes upper bound. Correspondingly,

AP ∝min d(q,fp)

max d(q,tp)∀tp ∈TP,fp ∈FP

where ·denotes lower bound.

If inputs

have associated labels

, the above studies are able to be further generalized. Class-wise

centers, inter-class distinctiveness and intra-class compactness are introduced to formulate our ﬁnal

lower-bound of hash codes’ performance.

Deﬁnition. Center cc∈Cis a representative hash code of a speciﬁc class c:

cc= arg min

db,bj,∀bj|yj=c

where bjis the j-th sample in set Band yjis the label of bj.

Deﬁnition.

Inter-class distinctiveness is

min Dinter

where

Dinter

is a set that measures distances

between all centers over

dcj,ck| ∀cj,ck∈C

. In contrast, intra-class compactness is

/max Dintra

, where

Dintra =dbj,cc| ∀yj=c

measures the distances among samples of the

same class to their corresponding center.

Combining above propositions and deﬁnitions, we could reach:

Proposition.

AP ∝min d(q,fp)

max d(q,tp)≥const ·min Dinter

max Dintra

.(1)

This proposition reveals how AP is affected by the above two inter- and intra- factors. Actually, it

could cover common scenarios as a criterion of hash codes’ performance, e.g., fast retrieval and

recognition, which will be explained in supplementary materials. Intuitively, lifting this lower bound

would enhance performance, formulating the following objective:

maximize min Dinter ,(2)

minimize max Dintra ,(3)

s.t.X⊆RdFθ,bin

−−−−−→ B⊆Hh.

Inspired by [

], we give a speciﬁc solution to implement Eqs. (2) and (3) under supervised hashing

by ﬁrstly deﬁning distance maximized class-speciﬁc centers and then shrinking hash codes to their

corresponding centers. We will validate the effectiveness of such a solution in experiments. It is

worth noting that achieving above goal is an open topic, including unsupervised or other complex

scenarios, which we leave for future study. Next, to optimize hash-models with the above supervision,

we introduce posterior estimation over hash codes and then control it.

5 Posterior Estimation for Code Control

It is essentially hard to minimize the distance between a hash code and a target to achieve Eq. (3)

for two reasons. Firstly, the posterior distribution of

given inputs

is formulated to be under

multivariate Bernoulli distribution:

p(B|X)∼MVB

. If we do not take correlations among different

variables into account, optimization on

is biased. Secondly, dealing with hash codes is a

{−1,+1}

discrete optimization, which is not trivial to handle. Therefore, we try to organize Eq. (3) as a

Maximum Likelihood Estimation (MLE) to tackle the above issues. Considering the deﬁnition of

Hamming distance between an arbitrary hash code band its target t:

d(b,t) =

1{bi6=ti}=

1{`iti<0}

where

is the characteristic function and

(·)i

indicates value on the

-th dimension. Since Hamming

distance measures how many bits are different between two codes, the probability of

d(b,t) = δ

can

be formulated as:

p(d(b,t) = δ) = X

∀i∈(h

δ), j /∈(h

δ)

{p(bi6=ti,bj=tj)}.

Therefore,

and

have distance

δiﬀ p(d(b,t) = δ)

is maximized. However, to precisely calculate

it is difﬁcult, since it involves the joint probability of

. Therefore, we try to estimate all joint

probabilities by adopting a surrogate model Pπparameterized by πto perform estimation:

`Pπ

−−→ o,where o∈O⊆R2h(4)

where

is the probabilities of a Categorical distribution

p(O|X)b=p(B|X)

, which contains

entries. Each entry of oestimates one of a speciﬁc joint probability by feeding `into Pπ:

oib=p(bk>0)1{ιk=1},(bk<0)1{ιk=0},1≤k≤h,

0≤i≤2h−1

where

is the

bits binary representation of

. For example,

o3=p(b= (-1-1+1+1))

when

h= 4

To train Pπ, we perform Maximum Likelihood Estimation (MLE) over p(O|X):

Lπ(Pπ;o) = −log oi,

where i=

k=1

1{bk>0} · 2k−1.(5)

Pπ

is ready for estimation, we could directly adopt it as a nice non-linear function to maximize

p(d(b,t) = k)

. Note that to realize Eq. (3), we want

and its corresponding center

are closest, i.e.

maximize p(d(b,c) = 0). So, we could reformulate this maximization as another MLE:

Lθ(Fθ;b) = −log p(d(b,c) = 0),

where p(d(b,c) = 0) = p(bi=ci,1≤i≤h),(6)

which could be optimized by calculating surrogate gradients of Eq. (6) through Pπ:

∂Lθ

∂θ=

∂n−log oi;i=Ph

k=11{ck>0} · 2k−1o

∂π

∂`

∂θb=∂Lθ

∂θ.(7)

Unfortunately, such joint probability estimation on

MVB

requires

O2h

complexity, which is not

ﬂexible when

is large. By adopting the idea of block code [

], the above estimation is able

to perform on long hash bits by separating hash codes into a series of blocks. Speciﬁcally, any

bits hash codes can be split into

blocks while each block consumes

bits. Correspondingly,

independent surrogate networks

Pπi,1≤i≤u

are adopted to perform the above estimation and

back-propagation simultaneously. With such decomposition, the whole problem is transformed into

usub-problems with a reduced complexity O2h→ O u·2h

/u.

As a summarization all of things, overall optimization via gradient descent is placed in Alg. 1. We

ﬁrst perform Eq. (2) step by using the pre-deﬁned centers and then perform Eq. (3) step by back-

propagation via Eq. (7). Two models Fθ,Pπare optimized with learning rate η1, η2respectively.

Algorithm 1 One of implementations under supervised circumstance.

1: procedure TRAIN(Fθ,Pπ)Training procedure of two models Fθ,Pπ.

2: Generate class-speciﬁc centers c∈C,|C|=Class-num; (Eq. (2)).

3: repeat Main training loop.

4: Sample xfrom Xwith label y;

5: `=Fθ(x);

6: o=Pπ(`);

7: π←π−η1∂Lπ

∂π;(Eq. (5)).

8: θ←θ−η2

∂Lθ

∂θwith corresponding center c;Eq. (3), Eq. (7).

9: until Total epoch exceeds;

10: return Fθ;Optimized hash-model Fθ

11: end procedure

6 Experiments

We conduct extensive experiments on three benchmark datasets to conﬁrm the effectiveness of our

proposed method. To make fair comparisons, we ﬁrst provide experiments setup.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ALowerBoundofHashCodes'PerformanceXiaosuZhu1xiaosu.zhu@outlook.comJingkuanSong1jingkuan.song@gmail.comYuLei1leiyu6969@gmail.comLianliGao1lianli.gao@uestc.edu.cnHengTaoShen1,2shenhengtao@hotmail.com1CenterforFutureMedia,UniversityofElectronicScienceandTechnologyofChina2PengChengLaboratoryAbstractAsa...

收起<<

A Lower Bound of Hash Codes Performance Xiaosu Zhu1 xiaosu.zhuoutlook.comJingkuan Song1.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Lower Bound of Hash Codes Performance Xiaosu Zhu1 xiaosu.zhuoutlook.comJingkuan Song1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: