Decoupled Mixup for Generalized Visual Recognition Haozhe Liu1 Wentian Zhang2 Jinheng Xie2 Haoqian Wu3 Bing LiB1

2025-05-06 0 0 2.24MB 14 页 10玖币
侵权投诉
Decoupled Mixup for Generalized Visual
Recognition
Haozhe Liu1, Wentian Zhang2, Jinheng Xie2, Haoqian Wu3, Bing LiB1,
Ziqi Zhang4, Yuexiang LiB2, Yawen Huang2, Bernard Ghanem1, Yefeng Zheng2
1King Abdullah University of Science and Technology, Saudi Arabia
2Jarvis Lab, Tencent, Shenzhen, China
3YouTu Lab, Tencent, Shenzhen, China
4Tsinghua University, Shenzhen, China
{haozhe.liu;bing.li;bernard.ghanem}@kaust.edu.sa;
zhangwentianml@gmail.com; xiejinheng2020@email.szu.edu.cn
zq-zhang18@mails.tsinghua.edu.cn;
{linuswu;vicyxli;yawenhuang;yefengzheng}@tencent.com
Abstract. Convolutional neural networks (CNN) have demonstrated
remarkable performance, when the training and testing data are from
the same distribution. However, such trained CNN models often largely
degrade on testing data which is unseen and Out-Of-the-Distribution
(OOD). To address this issue, we propose a novel ”Decoupled-Mixup”
method to train CNN models for OOD visual recognition. Different from
previous work combining pairs of images homogeneously, our method
decouples each image into discriminative and noise-prone regions, and
then heterogeneously combine these regions of image pairs to train CNN
models. Since the observation is that noise-prone regions such as tex-
tural and clutter background are adverse to the generalization ability
of CNN models during training, we enhance features from discrimi-
native regions and suppress noise-prone ones when combining an im-
age pair. To further improves the generalization ability of trained mod-
els, we propose to disentangle discriminative and noise-prone regions in
frequency-based and context-based fashions. Experiment results show
the high generalization performance of our method on testing data that
are composed of unseen contexts, where our method achieves 85.76%
top-1 accuracy in Track-1 and 79.92% in Track-2 in NICO Challenge.
The source code is available at https://github.com/HaozheLiu-ST/
NICOChallenge-OOD-Classification.
1 Introduction
Convolutional neural networks (CNN) have been successfully applied in various
tasks such as visual recognition and image generation. However, the learned CNN
models are vulnerable to the samples which are unseen and Out-Of-Distribution
Equal Contribution
This paper is accepeted by ECCV’22 Workshop (Causality in Vision)
arXiv:2210.14783v1 [cs.CV] 26 Oct 2022
2 H. Liu et al.
(OOD) [12,28,11]. To address this issue, research efforts have been devoted to
data augmentation and regularization, which have shown promising achieve-
ments.
Zhang et al. [23] propose a data augmentation method named Mixup which
mixes image pairs and their corresponding labels to form smooth annotations
for training models. Mixup can be regarded as a locally linear out-of-manifold
regularization [3], relating to the boundary of the adversarial robust training [24].
Hence, this simple technique has been shown to substantially facilitate both the
model robustness and generalization. Following this direction, many variants
have been proposed to explore the form of interpolation. Manifold Mixup [16]
generalizes Mixup to the feature space. Guo et al. [3] proposed an adaptive
Mixup by reducing the misleading random generation. Cutmix is then proposed
by Yun et al. [21], which introduces region-based interpolation between images to
replace global mixing. By adopting the region-based mixing like Cutmix, Kim et
al. [9] proposed Puzzle Mix to generate the virtual sample by utilizing saliency
information from each input. Liu et al. [11] proposed to regard mixing-based
data augmentation as a dynamic feature aggregation method, which can obtain
a compact feature space with strong robustness against the adversarial attacks.
More recently, Hong et al. [6] proposed styleMix to separate content and style for
enhanced data augmentation. As a contemporary work similar to the styleMix,
Zhou et al. [32] proposed Mixstyle to mix the style in the bottom layers of a
deep model within un-mixed label. By implicitly shuffling the style information,
Mixstyle can improve model generalization and achieve the satisfactory OOD
visual recognition performance. Despite of the gradual progress, Mixstyle and
StyleMix should work based on AdaIN [8] to disentangle style information, which
requires the feature maps as input. However, based on the empirical study [2],
the style information is sensitive to the depth of the layer and the network
architecture, which limits their potential for practical applications.
In this paper, inspired by Mixup and StyleMix, we propose a novel method
named Decoupled-Mixup to combine image pairs for training CNN models. Our
insight is that not all image regions benefit OOD visual recognition, where noise-
prone regions such as textural and clutter background are often adverse to the
generalization of CNN models during training. Yet, previous work Mixup treats
all image regions equally to combine a pair of images. Differently, we propose
to decouple each image into discriminative and noise-prone regions, and sup-
press noise-prone region during image combination, such that the CNN model
pays more attention to discriminative regions during training. In particular, we
propose an universal form based on Mixup, where StyleMix can be regarded as
a special case of Decoupled-Mixup in feature space. Furthermore, by extending
our Decoupled-Mixup to context and frequency domains respectively, we propose
Context-aware Decoupled-Mixup (CD-Mixup) and Frequency-aware Decoupled
Mixup (FD-Mixup) to capture discriminative regions and noise-prone ones us-
ing saliency and the texture information, and suppress noise-prone regions in
the image pair combination. By such heterogeneous combination, our method
Decoupled Mixup for Generalized Visual Recognition 3
trains the CNN model to emphasize more informative regions, which improves
the generalization ability of the trained model.
In summary, our contribution of this paper is three-fold:
We propose a novel method to train CNN models for OOD visual recognition.
Our method suppresses noise-prone regions when combining image pairs for
training, such that the trained CNN model emphasizes discriminative image
regions, which improves its generalization ability.
Our CD-Mixup and FD-Mixup modules effectively decouple each image into
discriminative and noise-prone regions by separately exploiting context and
texture domains, which does not require extra object/instance-level annota-
tions.
Experiment results show that our method achieves superior performance and
better generalization ability on testing data composed of unseen contexts,
compared with state-of-the-art Mixup-based methods.
2 Related Works
OOD Generalization OOD generalization considers the generalization capa-
bilities to the unseen distribution in the real scenarios of deep models trained
with limited data. Recently, OOD generalization has been introduced in many
visual applications [13,25,10,27,26]. In general, the unseen domains of OOD sam-
ples incur great confusion to the deep models in visual recognition. To address
such issue, domain generalization methods are proposed to train models only
from accessible domains and make models generalize well on unseen domains.
Several works [15,26,29] propose to obtain the domain-invariant features across
source domains and inhibit their negative effect, leading to better generalization
ability under unseen domain. Another simple but effective domain generalization
method is to enlarge the data space with data augmentation and regularization
of accessible source domains [23,30,31]. Following this direction, we further de-
couple and suppress the noise-prone regions (e.g. background and texture infor-
mation) from source domains to improve OOD generalization of deep models.
Self/weakly Supervised Segmentation A series of methods [18,17,19] demon-
strate a good ability to segment objects of interest out of the complex back-
grounds in Self/weakly supervised manners. However, in the absence of pixel-
level annotations, spurious correlations will result in the incorrect segmentation
of class-related backgrounds. To handle this problem, CLIMS [17] propose a
language-image matching-based suppression and C2AM [19] propose contrastive
learning of foreground-background discrimination. The aforementioned meth-
ods can serve as the disentanglement function in the proposed context-aware
Decoupled-Mixup.
摘要:

DecoupledMixupforGeneralizedVisualRecognitionHaozheLiu†1,WentianZhang†2,JinhengXie†2,HaoqianWu3,BingLiB1,ZiqiZhang4,YuexiangLiB2,YawenHuang2,BernardGhanem1,YefengZheng21KingAbdullahUniversityofScienceandTechnology,SaudiArabia2JarvisLab,Tencent,Shenzhen,China3YouTuLab,Tencent,Shenzhen,China4TsinghuaU...

展开>> 收起<<
Decoupled Mixup for Generalized Visual Recognition Haozhe Liu1 Wentian Zhang2 Jinheng Xie2 Haoqian Wu3 Bing LiB1.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:2.24MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注