IDa-Det An Information Discrepancy-aware Distillation for 1-bit Detectors Sheng Xu1 Yanjing Li1 Bohan Zeng1 Teli Ma2 Baochang Zhang13

2025-05-08 0 0 3.72MB 16 页 10玖币
侵权投诉
IDa-Det: An Information Discrepancy-aware
Distillation for 1-bit Detectors
Sheng Xu1, Yanjing Li1, Bohan Zeng1, Teli Ma2, Baochang Zhang1,3,
Xianbin Cao1, Peng Gao2, Jinhu L¨u1,3
1Beihang University, Beijing, China
2Shanghai Artificial Intelligence Laboratory, Shanghai, China
3Zhongguancun Laboratory, Beijing, China
{shengxu, yanjingli, bohanzeng, bczhang}@buaa.edu.cn
Abstract. Knowledge distillation (KD) has been proven to be useful for
training compact object detection models. However, we observe that KD
is often effective when the teacher model and student counterpart share
similar proposal information. This explains why existing KD methods
are less effective for 1-bit detectors, caused by a significant informa-
tion discrepancy between the real-valued teacher and the 1-bit student.
This paper presents an Information Discrepancy-aware strategy (IDa-
Det) to distill 1-bit detectors that can effectively eliminate information
discrepancies and significantly reduce the performance gap between a
1-bit detector and its real-valued counterpart. We formulate the distilla-
tion process as a bi-level optimization formulation. At the inner level, we
select the representative proposals with maximum information discrep-
ancy. We then introduce a novel entropy distillation loss to reduce the
disparity based on the selected proposals. Extensive experiments demon-
strate IDa-Det’s superiority over state-of-the-art 1-bit detectors and KD
methods on both PASCAL VOC and COCO datasets. IDa-Det achieves
a 76.9% mAP for a 1-bit Faster-RCNN with ResNet-18 backbone. Our
code is open-sourced on https://github.com/SteveTsui/IDa-Det.
Keywords: 1-bit detector, Knowledge distillation, Information discrep-
ancy
1 Introduction
Recently, the object detection task [6,19] has been greatly promoted due to
advances in deep convolutional neural networks (DNNs) [11]. However, DNN
models comprise a large number of parameters and floating-point operations
(FLOPs), restricting their deployment on embedded platforms. Techniques such
as compact network design [14,23], network pruning [12,15,37], low-rank decom-
position [5], and quantization [25,32,35] have been developed to address these
restrictions and accomplish an efficient inference on the detection task. Among
Equal contribution.
Corresponding author.
arXiv:2210.03477v1 [cs.CV] 7 Oct 2022
2 Xu et al.
(b) Res101
(c) Res18
(d) BiRes18 (a) Input Image
Object Region False Positive Missed Detection
Fig. 1. Input images and saliency maps fol-
lowing [9]. Images are randomly selected from
VOC test2007. Each row includes: (a) in-
put images, saliency maps of (b) Faster-
RCNN with ResNet-101 backbone (Res101),
(c) Faster-RCNN with ResNet-18 backbone
(Res18), (d) 1-bit Faster-RCNN with ResNet-
18 backbone (BiRes18), respectively.
these, binarized detectors have con-
tributed to object detection by ac-
celerating the CNN feature extract-
ing for real-time bounding box lo-
calization and foreground classifica-
tion [33,30,34]. For example, the 1-
bit SSD300 [20] with VGG-16 back-
bone [27] theoretically achieve the
acceleration rate up to 15×with
XNOR and Bit-count operations
using binarized weights and acti-
vations as described in [30]. With
extremely high energy-efficiency for
embedded devices, they are able
to be installed directly on next-
generation AI chips. Despite these
appealing features, 1-bit detectors’
performance often deteriorates to
the point, which explains why they
are not widely used in real-world
embedded systems.
The recent art [34] employs fine-
grained feature imitation (FGFI)
[29] to enhance the performance of 1-bit detectors. However, it neglects the
intrinsic information discrepancy between 1-bit detectors and real-valued de-
tectors. As shown in Fig. 1, we demonstrate that saliency maps of real-valued
Faster-RCNN of the ResNet-101 backbone (often used as the teacher network)
and the ResNet-18 backbone, in comparison with 1-bit Faster-RCNN of the
ResNet-18 backbone (often used as the student network) from top to bottom.
They show that knowledge distillation (KD) methods like [29] are effective for
distilling real-valued Faster-RCNNs, only when their teacher model and their
student counterpart share small information discrepancy on proposals, as shown
in Fig. 1 (b) and (c). This phenomenon does not happen for 1-bit Faster-RCNN,
as shown in Fig. 1 (b) and (d). This might explain why existing KD methods
are less effective in 1-bit detectors. A statistic on COCO and PASCAL VOC
datasets in Fig. 2 show that the discrepancy between proposal saliency maps
of Res101 and Res18 (blue) is much smaller than that of Res101 and BiRes18
(orange). That is to say, the smaller the distance is, the less the discrepancy
is. Briefly, conventional KD methods show their effectiveness on distilling real-
valued detectors but seem to be less effective on distilling 1-bit detectors.
In this paper, we are motivated by the above observation and present an
information discrepancy-aware distillation for 1-bit detectors (IDa-Det), which
can effectively address the information discrepancy problem, leading to an effi-
cient distillation process. As shown in Fig. 3, we introduce a discrepancy-aware
method to select proposal pairs and facilitate distilling 1-bit detectors, rather
IDa-Det 3
(a) VOC trainval0712 (b) VOC test2007 (c) COCO trainval35k (d) COCO minival
Fig. 2. The Mahalanobis distance of the gradient in the intermediate neck feature
between Res101-Res18 (blue) and Res101-BiRes18 (orange) in various datasets.
than only using object anchor locations of student models or ground truth as in
existing methods [29,34,9]. We further introduce a novel entropy distillation loss
to leverage more comprehensive information than the conventional loss functions.
By doing so, we achieve a powerful information discrepancy-aware distillation
method for 1-bit detectors (IDa-Det). Our contributions are summarized as:
Unlike existing KD methods, we distill 1-bit detectors by fully considering
the information discrepancy into optimization, which is simple yet effective
for learning 1-bit detectors.
We propose an entropy distillation loss further to improve the representa-
tion ability of the 1-bit detector and effectively eliminate the information
discrepancy.
We compare our IDa-Det against state-of-the-art 1-bit detectors and KD
methods on the VOC and large-scale COCO datasets. Extensive results re-
veal that our method outperformas the others by a considerable margin.
For instance, on VOC test2007, the 1-bit Faster-RCNN with ResNet-18
backbone achieved by IDa-Det obtains 76.9% mAP, achieving a new state-
of-the-art.
2 Related Work
1-bit Detectors. By removing the foreground redundancy, BiDet [30] fully ex-
ploits the representational capability of the binarized convolutions. In this way,
the information bottleneck is introduced, which limits the amount of data in
high-level feature maps, while maximizing the mutual information between fea-
ture maps and object detection. The performance of the Faster R-CNN detector
is significantly enhanced by the ASDA-FRCNN [33] which suppresses the shared
amplitude between the real-value and binary kernels. LWS-Det [34] novelly pro-
poses a layer-wise searching approach, minimizing the angular and amplitude
errors for 1-bit detectors. Also, FGFI [29] is used by LWS-Det to distill the
backbone feature map further.
Knowledge Distillation. Knowledge distillation (KD), a significant subset of
model compression methods, aims to transfer knowledge from a well-trained
teacher network to a more compact student model. The student is supervised
4 Xu et al.
Real-valued Teacher
1-bit Student
Object Region False Positive Missed Detection
Information
discrepancy
Entropy
distillation loss
Proposal distribution
(Channel-wise Gaussian distribution)
𝜑(⋅)
𝜑(⋅)
Fig. 3. Overview of the proposed information discrepancy-aware distillation (IDa-Det)
framework. We first select representative proposal pairs based on the information dis-
crepancy. Then we propose the entropy distillation loss to eliminate the information
discrepancy.
using soft labels created by the teacher, as firstly proposed by [1]. Knowledge
distillation is redefined by [13] as training a shallower network after the softmax
layer to approximate the teacher’s output. Object detectors can be compressed
using knowledge distillation, according to numerous recent papers. Chen et al.
[2] distill the student through all backbone features, regression head, and classi-
fication head, but both the imitation of whole feature maps and the distillation
in classification head fail to add attention to the important foreground, poten-
tially resulting in a sub-optimal result. Mimicking [16] distills the features from
sampled region proposals. However, just replicating the aforementioned regions
may lead to misdirection, because the proposals occasionally perform poorly. In
order to distill the student, FGFI [29] introduces a unique attention mask to
create fine-grained features from foreground object areas. DeFeat [9] balances
the background and foreground object regions to efficiently distill the student.
In summary, existing KD frameworks for object detection can only be em-
ployed for real-valued students having similar information as their teachers.
Thus, they are often ineffective in distilling 1-bit detectors. Unlike prior arts,
we identify that the information discrepancy between real-valued teacher and
1-bit students are significant for distillation. We first introduce Mahalanobis
distance to identify the information discrepancy and then accordingly distill the
features. Meanwhile, we propose a novel entropy distillation loss to prompt the
discrimination ability of 1-bit detectors further.
3 The Proposed Method
In this section, we describe our IDa-Det in detail. Firstly, we overview the 1-bit
CNNs. We then describe how we employ the information discrepancy method
摘要:

IDa-Det:AnInformationDiscrepancy-awareDistillationfor1-bitDetectorsShengXu1†,YanjingLi1†,BohanZeng1†,TeliMa2,BaochangZhang1,3∗,XianbinCao1,PengGao2,JinhuL¨u1,31BeihangUniversity,Beijing,China2ShanghaiArtificialIntelligenceLaboratory,Shanghai,China3ZhongguancunLaboratory,Beijing,China{shengxu,yanjing...

展开>> 收起<<
IDa-Det An Information Discrepancy-aware Distillation for 1-bit Detectors Sheng Xu1 Yanjing Li1 Bohan Zeng1 Teli Ma2 Baochang Zhang13.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:3.72MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注