Decoupled Mixup for Generalized Visual Recognition Haozhe Liu1 Wentian Zhang2 Jinheng Xie2 Haoqian Wu3 Bing LiB1

2025-05-06 0 0 2.24MB 14 页 10玖币

侵权投诉

Decoupled Mixup for Generalized Visual

Recognition

Haozhe Liu†1, Wentian Zhang†2, Jinheng Xie†2, Haoqian Wu3, Bing LiB1,

Ziqi Zhang4, Yuexiang LiB2, Yawen Huang2, Bernard Ghanem1, Yefeng Zheng2

1King Abdullah University of Science and Technology, Saudi Arabia

2Jarvis Lab, Tencent, Shenzhen, China

3YouTu Lab, Tencent, Shenzhen, China

4Tsinghua University, Shenzhen, China

{haozhe.liu;bing.li;bernard.ghanem}@kaust.edu.sa;

zhangwentianml@gmail.com; xiejinheng2020@email.szu.edu.cn

zq-zhang18@mails.tsinghua.edu.cn;

{linuswu;vicyxli;yawenhuang;yefengzheng}@tencent.com

Abstract. Convolutional neural networks (CNN) have demonstrated

remarkable performance, when the training and testing data are from

the same distribution. However, such trained CNN models often largely

degrade on testing data which is unseen and Out-Of-the-Distribution

(OOD). To address this issue, we propose a novel ”Decoupled-Mixup”

method to train CNN models for OOD visual recognition. Diﬀerent from

previous work combining pairs of images homogeneously, our method

decouples each image into discriminative and noise-prone regions, and

then heterogeneously combine these regions of image pairs to train CNN

models. Since the observation is that noise-prone regions such as tex-

tural and clutter background are adverse to the generalization ability

of CNN models during training, we enhance features from discrimi-

native regions and suppress noise-prone ones when combining an im-

age pair. To further improves the generalization ability of trained mod-

els, we propose to disentangle discriminative and noise-prone regions in

frequency-based and context-based fashions. Experiment results show

the high generalization performance of our method on testing data that

are composed of unseen contexts, where our method achieves 85.76%

top-1 accuracy in Track-1 and 79.92% in Track-2 in NICO Challenge.

The source code is available at https://github.com/HaozheLiu-ST/

NICOChallenge-OOD-Classification.

1 Introduction

Convolutional neural networks (CNN) have been successfully applied in various

tasks such as visual recognition and image generation. However, the learned CNN

models are vulnerable to the samples which are unseen and Out-Of-Distribution

†Equal Contribution

This paper is accepeted by ECCV’22 Workshop (Causality in Vision)

arXiv:2210.14783v1 [cs.CV] 26 Oct 2022

2 H. Liu et al.

(OOD) [12,28,11]. To address this issue, research eﬀorts have been devoted to

data augmentation and regularization, which have shown promising achieve-

ments.

Zhang et al. [23] propose a data augmentation method named Mixup which

mixes image pairs and their corresponding labels to form smooth annotations

for training models. Mixup can be regarded as a locally linear out-of-manifold

regularization [3], relating to the boundary of the adversarial robust training [24].

Hence, this simple technique has been shown to substantially facilitate both the

model robustness and generalization. Following this direction, many variants

have been proposed to explore the form of interpolation. Manifold Mixup [16]

generalizes Mixup to the feature space. Guo et al. [3] proposed an adaptive

Mixup by reducing the misleading random generation. Cutmix is then proposed

by Yun et al. [21], which introduces region-based interpolation between images to

replace global mixing. By adopting the region-based mixing like Cutmix, Kim et

al. [9] proposed Puzzle Mix to generate the virtual sample by utilizing saliency

information from each input. Liu et al. [11] proposed to regard mixing-based

data augmentation as a dynamic feature aggregation method, which can obtain

a compact feature space with strong robustness against the adversarial attacks.

More recently, Hong et al. [6] proposed styleMix to separate content and style for

enhanced data augmentation. As a contemporary work similar to the styleMix,

Zhou et al. [32] proposed Mixstyle to mix the style in the bottom layers of a

deep model within un-mixed label. By implicitly shuﬄing the style information,

Mixstyle can improve model generalization and achieve the satisfactory OOD

visual recognition performance. Despite of the gradual progress, Mixstyle and

StyleMix should work based on AdaIN [8] to disentangle style information, which

requires the feature maps as input. However, based on the empirical study [2],

the style information is sensitive to the depth of the layer and the network

architecture, which limits their potential for practical applications.

In this paper, inspired by Mixup and StyleMix, we propose a novel method

named Decoupled-Mixup to combine image pairs for training CNN models. Our

insight is that not all image regions beneﬁt OOD visual recognition, where noise-

prone regions such as textural and clutter background are often adverse to the

generalization of CNN models during training. Yet, previous work Mixup treats

all image regions equally to combine a pair of images. Diﬀerently, we propose

to decouple each image into discriminative and noise-prone regions, and sup-

press noise-prone region during image combination, such that the CNN model

pays more attention to discriminative regions during training. In particular, we

propose an universal form based on Mixup, where StyleMix can be regarded as

a special case of Decoupled-Mixup in feature space. Furthermore, by extending

our Decoupled-Mixup to context and frequency domains respectively, we propose

Context-aware Decoupled-Mixup (CD-Mixup) and Frequency-aware Decoupled

Mixup (FD-Mixup) to capture discriminative regions and noise-prone ones us-

ing saliency and the texture information, and suppress noise-prone regions in

the image pair combination. By such heterogeneous combination, our method

Decoupled Mixup for Generalized Visual Recognition 3

trains the CNN model to emphasize more informative regions, which improves

the generalization ability of the trained model.

In summary, our contribution of this paper is three-fold:

–We propose a novel method to train CNN models for OOD visual recognition.

Our method suppresses noise-prone regions when combining image pairs for

training, such that the trained CNN model emphasizes discriminative image

regions, which improves its generalization ability.

–Our CD-Mixup and FD-Mixup modules eﬀectively decouple each image into

discriminative and noise-prone regions by separately exploiting context and

texture domains, which does not require extra object/instance-level annota-

tions.

–Experiment results show that our method achieves superior performance and

better generalization ability on testing data composed of unseen contexts,

compared with state-of-the-art Mixup-based methods.

2 Related Works

OOD Generalization OOD generalization considers the generalization capa-

bilities to the unseen distribution in the real scenarios of deep models trained

with limited data. Recently, OOD generalization has been introduced in many

visual applications [13,25,10,27,26]. In general, the unseen domains of OOD sam-

ples incur great confusion to the deep models in visual recognition. To address

such issue, domain generalization methods are proposed to train models only

from accessible domains and make models generalize well on unseen domains.

Several works [15,26,29] propose to obtain the domain-invariant features across

source domains and inhibit their negative eﬀect, leading to better generalization

ability under unseen domain. Another simple but eﬀective domain generalization

method is to enlarge the data space with data augmentation and regularization

of accessible source domains [23,30,31]. Following this direction, we further de-

couple and suppress the noise-prone regions (e.g. background and texture infor-

mation) from source domains to improve OOD generalization of deep models.

Self/weakly Supervised Segmentation A series of methods [18,17,19] demon-

strate a good ability to segment objects of interest out of the complex back-

grounds in Self/weakly supervised manners. However, in the absence of pixel-

level annotations, spurious correlations will result in the incorrect segmentation

of class-related backgrounds. To handle this problem, CLIMS [17] propose a

language-image matching-based suppression and C2AM [19] propose contrastive

learning of foreground-background discrimination. The aforementioned meth-

ods can serve as the disentanglement function in the proposed context-aware

Decoupled-Mixup.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DecoupledMixupforGeneralizedVisualRecognitionHaozheLiu†1,WentianZhang†2,JinhengXie†2,HaoqianWu3,BingLiB1,ZiqiZhang4,YuexiangLiB2,YawenHuang2,BernardGhanem1,YefengZheng21KingAbdullahUniversityofScienceandTechnology,SaudiArabia2JarvisLab,Tencent,Shenzhen,China3YouTuLab,Tencent,Shenzhen,China4TsinghuaU...

展开>> 收起<<

Decoupled Mixup for Generalized Visual Recognition Haozhe Liu1 Wentian Zhang2 Jinheng Xie2 Haoqian Wu3 Bing LiB1.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Decoupled Mixup for Generalized Visual Recognition Haozhe Liu1 Wentian Zhang2 Jinheng Xie2 Haoqian Wu3 Bing LiB1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: