IDa-Det An Information Discrepancy-aware Distillation for 1-bit Detectors Sheng Xu1 Yanjing Li1 Bohan Zeng1 Teli Ma2 Baochang Zhang13

2025-05-08 0 0 3.72MB 16 页 10玖币

侵权投诉

IDa-Det: An Information Discrepancy-aware

Distillation for 1-bit Detectors

Sheng Xu1†, Yanjing Li1†, Bohan Zeng1†, Teli Ma2, Baochang Zhang1,3∗,

Xianbin Cao1, Peng Gao2, Jinhu L¨u1,3

1Beihang University, Beijing, China

2Shanghai Artiﬁcial Intelligence Laboratory, Shanghai, China

3Zhongguancun Laboratory, Beijing, China

{shengxu, yanjingli, bohanzeng, bczhang}@buaa.edu.cn

Abstract. Knowledge distillation (KD) has been proven to be useful for

training compact object detection models. However, we observe that KD

is often eﬀective when the teacher model and student counterpart share

similar proposal information. This explains why existing KD methods

are less eﬀective for 1-bit detectors, caused by a signiﬁcant informa-

tion discrepancy between the real-valued teacher and the 1-bit student.

This paper presents an Information Discrepancy-aware strategy (IDa-

Det) to distill 1-bit detectors that can eﬀectively eliminate information

discrepancies and signiﬁcantly reduce the performance gap between a

1-bit detector and its real-valued counterpart. We formulate the distilla-

tion process as a bi-level optimization formulation. At the inner level, we

select the representative proposals with maximum information discrep-

ancy. We then introduce a novel entropy distillation loss to reduce the

disparity based on the selected proposals. Extensive experiments demon-

strate IDa-Det’s superiority over state-of-the-art 1-bit detectors and KD

methods on both PASCAL VOC and COCO datasets. IDa-Det achieves

a 76.9% mAP for a 1-bit Faster-RCNN with ResNet-18 backbone. Our

code is open-sourced on https://github.com/SteveTsui/IDa-Det.

Keywords: 1-bit detector, Knowledge distillation, Information discrep-

ancy

1 Introduction

Recently, the object detection task [6,19] has been greatly promoted due to

advances in deep convolutional neural networks (DNNs) [11]. However, DNN

models comprise a large number of parameters and ﬂoating-point operations

(FLOPs), restricting their deployment on embedded platforms. Techniques such

as compact network design [14,23], network pruning [12,15,37], low-rank decom-

position [5], and quantization [25,32,35] have been developed to address these

restrictions and accomplish an eﬃcient inference on the detection task. Among

†Equal contribution.

∗Corresponding author.

arXiv:2210.03477v1 [cs.CV] 7 Oct 2022

2 Xu et al.

(b) Res101

(d) BiRes18 (a) Input Image

Object Region False Positive Missed Detection

Fig. 1. Input images and saliency maps fol-

lowing [9]. Images are randomly selected from

VOC test2007. Each row includes: (a) in-

put images, saliency maps of (b) Faster-

RCNN with ResNet-101 backbone (Res101),

(Res18), (d) 1-bit Faster-RCNN with ResNet-

18 backbone (BiRes18), respectively.

these, binarized detectors have con-

tributed to object detection by ac-

celerating the CNN feature extract-

ing for real-time bounding box lo-

calization and foreground classiﬁca-

tion [33,30,34]. For example, the 1-

bit SSD300 [20] with VGG-16 back-

bone [27] theoretically achieve the

acceleration rate up to 15×with

XNOR and Bit-count operations

using binarized weights and acti-

vations as described in [30]. With

extremely high energy-eﬃciency for

embedded devices, they are able

to be installed directly on next-

generation AI chips. Despite these

appealing features, 1-bit detectors’

performance often deteriorates to

the point, which explains why they

are not widely used in real-world

embedded systems.

The recent art [34] employs ﬁne-

grained feature imitation (FGFI)

[29] to enhance the performance of 1-bit detectors. However, it neglects the

intrinsic information discrepancy between 1-bit detectors and real-valued de-

tectors. As shown in Fig. 1, we demonstrate that saliency maps of real-valued

Faster-RCNN of the ResNet-101 backbone (often used as the teacher network)

and the ResNet-18 backbone, in comparison with 1-bit Faster-RCNN of the

ResNet-18 backbone (often used as the student network) from top to bottom.

They show that knowledge distillation (KD) methods like [29] are eﬀective for

distilling real-valued Faster-RCNNs, only when their teacher model and their

student counterpart share small information discrepancy on proposals, as shown

in Fig. 1 (b) and (c). This phenomenon does not happen for 1-bit Faster-RCNN,

as shown in Fig. 1 (b) and (d). This might explain why existing KD methods

are less eﬀective in 1-bit detectors. A statistic on COCO and PASCAL VOC

datasets in Fig. 2 show that the discrepancy between proposal saliency maps

of Res101 and Res18 (blue) is much smaller than that of Res101 and BiRes18

(orange). That is to say, the smaller the distance is, the less the discrepancy

is. Brieﬂy, conventional KD methods show their eﬀectiveness on distilling real-

valued detectors but seem to be less eﬀective on distilling 1-bit detectors.

In this paper, we are motivated by the above observation and present an

information discrepancy-aware distillation for 1-bit detectors (IDa-Det), which

can eﬀectively address the information discrepancy problem, leading to an eﬃ-

cient distillation process. As shown in Fig. 3, we introduce a discrepancy-aware

method to select proposal pairs and facilitate distilling 1-bit detectors, rather

IDa-Det 3

(a) VOC trainval0712 (b) VOC test2007 (c) COCO trainval35k (d) COCO minival

Fig. 2. The Mahalanobis distance of the gradient in the intermediate neck feature

between Res101-Res18 (blue) and Res101-BiRes18 (orange) in various datasets.

than only using object anchor locations of student models or ground truth as in

existing methods [29,34,9]. We further introduce a novel entropy distillation loss

to leverage more comprehensive information than the conventional loss functions.

By doing so, we achieve a powerful information discrepancy-aware distillation

method for 1-bit detectors (IDa-Det). Our contributions are summarized as:

–Unlike existing KD methods, we distill 1-bit detectors by fully considering

the information discrepancy into optimization, which is simple yet eﬀective

for learning 1-bit detectors.

–We propose an entropy distillation loss further to improve the representa-

tion ability of the 1-bit detector and eﬀectively eliminate the information

discrepancy.

–We compare our IDa-Det against state-of-the-art 1-bit detectors and KD

methods on the VOC and large-scale COCO datasets. Extensive results re-

veal that our method outperformas the others by a considerable margin.

For instance, on VOC test2007, the 1-bit Faster-RCNN with ResNet-18

backbone achieved by IDa-Det obtains 76.9% mAP, achieving a new state-

of-the-art.

2 Related Work

1-bit Detectors. By removing the foreground redundancy, BiDet [30] fully ex-

ploits the representational capability of the binarized convolutions. In this way,

the information bottleneck is introduced, which limits the amount of data in

high-level feature maps, while maximizing the mutual information between fea-

ture maps and object detection. The performance of the Faster R-CNN detector

is signiﬁcantly enhanced by the ASDA-FRCNN [33] which suppresses the shared

amplitude between the real-value and binary kernels. LWS-Det [34] novelly pro-

poses a layer-wise searching approach, minimizing the angular and amplitude

errors for 1-bit detectors. Also, FGFI [29] is used by LWS-Det to distill the

backbone feature map further.

Knowledge Distillation. Knowledge distillation (KD), a signiﬁcant subset of

model compression methods, aims to transfer knowledge from a well-trained

teacher network to a more compact student model. The student is supervised

4 Xu et al.

Real-valued Teacher

1-bit Student

Object Region False Positive Missed Detection

Information

discrepancy

Entropy

distillation loss

Proposal distribution

(Channel-wise Gaussian distribution)

𝜑(⋅)

Fig. 3. Overview of the proposed information discrepancy-aware distillation (IDa-Det)

framework. We ﬁrst select representative proposal pairs based on the information dis-

crepancy. Then we propose the entropy distillation loss to eliminate the information

discrepancy.

using soft labels created by the teacher, as ﬁrstly proposed by [1]. Knowledge

distillation is redeﬁned by [13] as training a shallower network after the softmax

layer to approximate the teacher’s output. Object detectors can be compressed

using knowledge distillation, according to numerous recent papers. Chen et al.

[2] distill the student through all backbone features, regression head, and classi-

ﬁcation head, but both the imitation of whole feature maps and the distillation

in classiﬁcation head fail to add attention to the important foreground, poten-

tially resulting in a sub-optimal result. Mimicking [16] distills the features from

sampled region proposals. However, just replicating the aforementioned regions

may lead to misdirection, because the proposals occasionally perform poorly. In

order to distill the student, FGFI [29] introduces a unique attention mask to

create ﬁne-grained features from foreground object areas. DeFeat [9] balances

the background and foreground object regions to eﬃciently distill the student.

In summary, existing KD frameworks for object detection can only be em-

ployed for real-valued students having similar information as their teachers.

Thus, they are often ineﬀective in distilling 1-bit detectors. Unlike prior arts,

we identify that the information discrepancy between real-valued teacher and

1-bit students are signiﬁcant for distillation. We ﬁrst introduce Mahalanobis

distance to identify the information discrepancy and then accordingly distill the

features. Meanwhile, we propose a novel entropy distillation loss to prompt the

discrimination ability of 1-bit detectors further.

3 The Proposed Method

In this section, we describe our IDa-Det in detail. Firstly, we overview the 1-bit

CNNs. We then describe how we employ the information discrepancy method

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

IDa-Det:AnInformationDiscrepancy-awareDistillationfor1-bitDetectorsShengXu1†,YanjingLi1†,BohanZeng1†,TeliMa2,BaochangZhang1,3∗,XianbinCao1,PengGao2,JinhuL¨u1,31BeihangUniversity,Beijing,China2ShanghaiArtificialIntelligenceLaboratory,Shanghai,China3ZhongguancunLaboratory,Beijing,China{shengxu,yanjing...

展开>> 收起<<

IDa-Det An Information Discrepancy-aware Distillation for 1-bit Detectors Sheng Xu1 Yanjing Li1 Bohan Zeng1 Teli Ma2 Baochang Zhang13.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

IDa-Det An Information Discrepancy-aware Distillation for 1-bit Detectors Sheng Xu1 Yanjing Li1 Bohan Zeng1 Teli Ma2 Baochang Zhang13

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: