A Survey of Dataset Refinement for Problems in Computer Vision Datasets

2025-04-30 1 0 5.31MB 33 页 10玖币

侵权投诉

ZHIJING WAN, National Engineering Research Center for Multimedia Software, Institute of Articial Intelligence,

School of Computer Science, Wuhan University, China

ZHIXIANG WANG, Graduate School of Information Science and Technology, The University of Tokyo, Japan

CHEUKTING CHUNG, National Engineering Research Center for Multimedia Software, Institute of Articial

Intelligence, School of Computer Science, Wuhan University, China

ZHENG WANG

†

,National Engineering Research Center for Multimedia Software, Institute of Articial Intelligence,

School of Computer Science, Wuhan University, China

Large-scale datasets have played a crucial role in the advancement of computer vision. However, they often suer from problems such

as class imbalance, noisy labels, dataset bias, or high resource costs, which can inhibit model performance and reduce trustworthiness.

With the advocacy of data-centric research, various data-centric solutions have been proposed to solve the dataset problems mentioned

above. They improve the quality of datasets by re-organizing them, which we call dataset renement. In this survey, we provide a

comprehensive and structured overview of recent advances in dataset renement for problematic computer vision datasets

. Firstly,

we summarize and analyze the various problems encountered in large-scale computer vision datasets. Then, we classify the dataset

renement algorithms into three categories based on the renement process: data sampling, data subset selection, and active learning.

In addition, we organize these dataset renement methods according to the addressed data problems and provide a systematic

comparative description. We point out that these three types of dataset renement have distinct advantages and disadvantages for

dataset problems, which informs the choice of the data-centric method appropriate to a particular research objective. Finally, we

summarize the current literature and propose potential future research topics.

CCS Concepts: •General and reference

→

Surveys and overviews;•Computing methodologies

→

Computer vision;•

Security and privacy;

Additional Key Words and Phrases: Dataset renement, data sampling, subset selection, active learning

ACM Reference Format:

Zhijing Wan, Zhixiang Wang, CheukTing Chung, and Zheng Wang. 2023. A Survey of Dataset Renement for Problems in Computer

Vision Datasets. J. ACM 37, 4, Article 111 (August 2023), 33 pages. https://doi.org/XXXXXXX.XXXXXXX

†Corresponding author

1All resources are available at https://github.com/Vivian-wzj/DatasetRenement-CV.

Authors’ addresses: Zhijing Wan, National Engineering Research Center for Multimedia Software, Institute of Articial Intelligence, School of Computer

Science, Wuhan University, Wuhan, China, wanzjwhu@whu.edu.cn; Zhixiang Wang, Graduate School of Information Science and Technology, The

University of Tokyo, Tokyo, Japan, wangzx1994@gmail.com; CheukTing Chung, National Engineering Research Center for Multimedia Software, Institute

of Articial Intelligence, School of Computer Science, Wuhan University, Wuhan, China, 2271406579@qq.com; Zheng Wang, National Engineering

Research Center for Multimedia Software, Institute of Articial Intelligence, School of Computer Science, Wuhan University, Wuhan, China, wangzwhu@

whu.edu.cn.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not

made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components

of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to

redistribute to lists, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.

Manuscript submitted to ACM

Manuscript submitted to ACM 1

arXiv:2210.11717v2 [cs.CV] 6 Oct 2023

2 Wan and Wang, et al.

Collected Data 𝑫𝒄𝒍

Data Operation

Training Data 𝑫𝒕𝒓

Data

(𝑁)

Data

(𝑁)

Data Augmentation

Data

(M > 𝑁)

(a) W/O Data Operation

(b) W/ Data Augmentation

Data

(𝑁)

Data

(𝑁)

Data

(𝑛)

Problematic Data

Clean Data

Informative Data

N, M, n : Dataset Size

Dataset Refinement

Fig. 1. Comparison of dierent data operations. (a) Without (W/O) data operation: typically, once a dataset is collected, it is directly

used as a training set without any processing. In this case, the size of the training set

𝐷𝑡𝑟

is equal to the size

𝑁

of the collected

dataset

𝐷𝑐𝑙

. Problematic data in the collected dataset have received less aention and remain to be addressed. (b) With (W/) data

augmentation: in the case of data starvation, data augmentation is oen used to make small changes to the collected data as a way

to increase the size

𝑀

and diversity of training data, and it has received increasing aention in recent years. However, instead of

being addressed, problematic data increases, inhibiting the eectiveness of augmentation. (c) W/ dataset refinement: it is used to

address the problems faced in the collected dataset to improve the dataset quality and, thus, the performance and eiciency of model

learning. In most cases, the size of the refined dataset 𝑛≤𝑁.

1 INTRODUCTION AND MOTIVATION

Recently, deep learning has achieved impressive progress in computer vision [

]. The success of deep learning

mainly owes to three points, namely advanced deep network architectures (e.g., residual networks [

]), powerful

computing devices (e.g., Graphic Processing Units (GPUs)), and large datasets with labels (e.g., ImageNet [

]). Among

them, deep network architectures and computing devices are well-developed, but obtaining high-quality training

datasets is still very dicult. As the attention of researchers was mainly focused on the development and optimization

of deep models and computational devices, once the data was ready, it became a xed asset and received less attention

(as shown in Fig. 1 (a)). With the development of model architecture, the incremental gains from improving models are

diminishing in many tasks [

]. At the same time, relatively minor improvements in data can make Articial Intelligence

(AI) models much more reliable [

]. Therefore, more attention should be paid to data development. Furthermore, there

is a prevalent assumption that all data points are equivalently relevant to model parameter updating. In other words,

all the training data are presented equally and randomly to the model. However, numerous works have challenged

this assumption and proven that not all samples are created equal [

]. Networks should be aware of the various

complexities of the data and spend most of the computation on critical examples. With all these concerns, there is

a strong need for the community of articial intelligence (AI) to move from model-centric research to data-centric

research for further improvements in model learning.

Ideally, the collected dataset is correct, without any problem, and can be directly used for model learning. However,

since there is no uniform standard for the data collection and labeling process, and the labeling is usually left to

crowdsourcing companies rather than labeling experts, the collected dataset is often of low quality, i.e., it contains

redundant and non-informative data, and often suers from various problems such as noisy labels [

], class imbalance

[

], representation bias [

], and distribution mismatch [

144

]. For example, ImageNet, the most inuential ultra-large

benchmark in computer vision, was identied by [

] as containing noisy labels. For many computer vision tasks such

Manuscript submitted to ACM

A Survey of Dataset Renement for Problems in Computer Vision Datasets 3

as face recognition [

173

], medical image diagnosis [

], and image classication [

147

], the collected training datasets

typically exhibit a long-tailed class distribution, where a minority class has a large number of samples, and other

classes have only a small number of samples. Since deep neural networks have the capacity to essentially memorize

any characteristics of the data [

169

], those problematic data can drastically inhibit the performance of model training.

Besides, the collected data may also face challenges such as compressing the volume and reducing the annotation cost.

This is due to the fact that under the advocacy of green AI [

122

] in the AI community, the training data should not

only improve model accuracy but also enable ecient training, thus reducing resource consumption, decreasing AI’s

environmental footprint, and increasing its inclusivity. Therefore, how to eectively rene the problematic dataset

before model training becomes one of the bottleneck problems for a trustworthy and robust AI system.

With the advocacy of data-centric research, some data-centric studies have been conducted for dataset problems

in recent years. They aim to make the dataset more accurate and useful by removing irrelevant or redundant data

and correcting problems, which we call dataset renement (as shown in Fig. 1 (c)). Dataset renement studies can

be divided into three main directions according to the renement process: data sampling, data subset selection, and

active learning. Data sampling adjusts the frequency and order of training data to promote model training eectively;

data subset selection selects the most representative samples from the labeled training set to promote ecient model

training; active learning selects the most useful samples from the unlabeled dataset for labeling to minimize the cost

of labeling. When comparing them with another type of data operation, namely data augmentation, it can be seen

that the latter has received more attention in recent years. This is because data augmentation is a powerful technique

to alleviate data-hungry situations in deep learning. Data augmentation [

128

] increases the amount and diversity of

training data by adding small changes to the original data or creating new synthetic data based on the original data,

which can improve the performance of models. However, it does not address problematic data, so direct augmentation

based on original data can lead to an increase in problematic data (as shown in Fig. 1 (b)) and inhibit the eectiveness of

augmentation. Therefore, it is recommended to carry out dataset renement before or after data augmentation and

before model training.

1.1 Motivation for Work

The motivation for this comprehensive overview is outlined below:

•To thoroughly inspect the various problems inherent in or external to computer vision datasets.

•To focus on the data-centric solution and review the dataset renement methods for each dataset problem.

This article focuses on three main directions in dataset renement (i.e., data sampling, data subset selection, and

active learning), and reviews and analyzes their respective developments in light of the problems addressed. This is

a data-centric survey that aims to provide insights into dataset renement methods and to inform researchers and

practitioners in the eld of computer vision on the selection of appropriate methods to solve specic dataset problems.

1.2 Our Contributions

•The problems faced by computer vision datasets have been elaborated.

•

A comprehensive survey has been conducted to investigate dataset renement for solving problems in computer

vision datasets.

•

Comparative analysis of dierent dataset renement methods for solving the same dataset problems has been

performed.

Manuscript submitted to ACM

4 Wan and Wang, et al.

Table 1. Related Search. They may cover data problems such as Class Imbalance (CI), Noisy Labels (NL), Dataset Biases (DB), High

Computational Cost (HCC), and High Labeling Cost (HLC). “DS” indicates data sampling; “SS” indicates subset selection; “AL”

indicates active learning. “-” means that the corresponding column is not covered.

Research Articles Data Problems Covered Improvement Perspective Dataset Renement

CI NL DB HCC HLC Model-level Data-level DS SS AL

Data Collection and Quality Challenges in Deep Learning:

A Data-Centric AI Perspective [157] ✓ ✓ ✓ ✓ ✓

A Survey on Curriculum Learning [152] ✓ ✓ ✓ ✓ ✓

A Review of Instance Selection Methods [104] ✓ ✓ ✓

DeepCore: A Comprehensive Library for Coreset Selection

in Deep Learning [47] ✓ ✓ ✓

A Survey on Active Deep Learning: From Model Driven

to Data Driven [87] ✓ ✓ ✓

Deep Long-tailed Learning: A Survey [175] ✓ ✓ ✓ ✓

Learning from Noisy Labels with Deep Neural Networks: A

Survey [135] ✓ ✓ ✓ ✓ ✓

A Survey on Bias in Visual Datasets [36] ✓ ✓ — —

Our Survey ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

1.3 Related Search

In this section, we present research relevant to our survey. There is a systematic study [

157

] reviewing the data collection

and quality challenges in deep learning from a data-centric AI perspective, which divides the whole machine learning

process into data collection, data cleaning, and robust model training. While the survey [

157

] covered the data cleaning

associated with dataset renement, it provided only a cursory summary of data cleaning instead of a detailed summary

of how it has developed in recent years. It did not elaborate on the problems in datasets. In addition, there have been

some surveys in various directions of dataset renement (data sampling, data subset selection, or active learning), such

as a survey on curriculum learning [

152

], reviews of instance selection methods [

104

], and a survey on active deep

learning [

]. Then, there are surveys [

135

175

] focusing on one or two data problems but not covering all problems

as much as possible. Moreover, they mainly cover methods that overcome data problems at the model level, while

less attention is paid to dataset renement methods. In summary, there is no systematic study that comprehensively

reviews and discusses dataset renement from the perspective of data problems. To ll this gap, we aim to provide

a comprehensive survey of recent dataset renement studies in computer vision conducted before mid-2022. Table 1

presents the comparison of existing surveys related to problematic data in computer vision.

1.4 Article Organization

The survey is organized as follows: we rst summarize and describe the various problems inherent in, or external

to, computer vision datasets in Section 2. Then, in Section 3, we provide an overview of three directions of dataset

renement (i.e., data sampling, subset selection, and active learning), including their notation, components, denitions,

and taxonomy. We classify advanced dataset renement methods according to the problem solved and the result

achieved, and review them in Sections 4, 5, 6, and 7, respectively. We further compare and analyze several related

learning problems in Section 8, including data distillation, feature selection, and semi-supervised learning. Afterward,

in Section 9, we discuss some of the potential future research directions. Finally, we conclude the survey in Section 10.

Manuscript submitted to ACM

A Survey of Dataset Renement for Problems in Computer Vision Datasets 5

Sorted class index

Number of training samples

Head Tail

Decision

boundary

(a) Class imbalance (b) Noisy labels

label: dog label: cat

label: 9

label: 4

Trained classifier Optimal classifier

Training data Test data

Fig. 2. Illustrations of the problems inherent in computer vision datasets.

2 SUMMARY OF PROBLEMS IN COMPUTER VISION DATASETS

Generally, due to the lack of uniform standards and specications for the current data collection and labeling process,

published datasets are not perfect [

] and will suer from some inherent problems, such as class imbalance [

105

175

noisy labels [

], or dataset biases [

144

]. In addition, with the increasing size of datasets, data collection or

application development may face problems such as high labeling costs and excessive computational overhead, which

can also inhibit the development of AI. In the following, we will detail the main data problems faced in the eld of

computer vision and the corresponding research challenges, and briey describe how to solve them from the perspective

of dataset renement.

Class imbalance. Technically, any dataset that shows an uneven distribution between classes can be considered

imbalanced [

]. This issue has attracted considerable interest from academia, industry, and government funding

agencies. The main class imbalance of interest to computer vision researchers is the problem of long-tailed class

distribution. The long-tailed class distribution (later termed the long-tailed distribution) is a classical class imbalance

problem in which a few classes (the head classes) have massive samples while others (the tail classes) have only a small

number of samples, as shown in Fig. 2(a). Therefore, in most cases, the class imbalance problem in this article refers to

the long-tailed distribution problem. During training, this problem can bias the model toward the head class and cause it

to perform poorly on the tail class. As such, it poses a challenge to the robust learning of models. For this problem, data

sampling and data subset selection can be used to counteract the long-tail eect by balancing the distribution of head

and tail classes and enhancing the learning of tail classes, as summarized in Section 4.1.1 and Section 4.1.2, respectively.

Noisy labels. Generally, we default the data to be correctly labeled. However, there may be label issues in the datasets

due to inevitable errors by human annotators or automated label extraction tools for images, such as crowdsourcing

and web crawling. For example, there are likely at least 100,000 label issues in ImageNet, such as noisy labels that are

incorrectly labeled. Some samples with noisy labels from realistic or the MNIST [

] dataset are shown in Fig. 2(b).

Deep learning with noisy labels is practically challenging, as the capacity of deep models is so high that they can

totally memorize these noisy labels sooner or later during model training [

]. As a result, these noisy labels inevitably

degenerate the robustness of learned models [

]. For improving the model robustness, there has been a lot of work

to overcome this problem by data sampling or data subset selection, as summarized in Section 4.2.1 and Section 4.2.2,

respectively. The noisy labels studied in most of the current work are synthetic random label noise (symmetric or

asymmetric label noise [

]). In fact, there exists much more instance-dependent label noise [

] in real-world datasets.

For example, annotators could easily label a cat as a lion, but would not easily label a cat as a table. Compared to random

Manuscript submitted to ACM

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ASurveyofDatasetRefinementforProblemsinComputerVisionDatasetsZHIJINGWAN,NationalEngineeringResearchCenterforMultimediaSoftware,InstituteofArtificialIntelligence,SchoolofComputerScience,WuhanUniversity,ChinaZHIXIANGWANG,GraduateSchoolofInformationScienceandTechnology,TheUniversityofTokyo,JapanCHEUKTI...

展开>> 收起<<

A Survey of Dataset Refinement for Problems in Computer Vision Datasets.pdf

共33页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Survey of Dataset Refinement for Problems in Computer Vision Datasets

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: