A Survey of Dataset Refinement for Problems in Computer Vision Datasets

2025-04-30 1 0 5.31MB 33 页 10玖币
侵权投诉
A Survey of Dataset Refinement for Problems in Computer Vision Datasets
ZHIJING WAN, National Engineering Research Center for Multimedia Software, Institute of Articial Intelligence,
School of Computer Science, Wuhan University, China
ZHIXIANG WANG, Graduate School of Information Science and Technology, The University of Tokyo, Japan
CHEUKTING CHUNG, National Engineering Research Center for Multimedia Software, Institute of Articial
Intelligence, School of Computer Science, Wuhan University, China
ZHENG WANG
,National Engineering Research Center for Multimedia Software, Institute of Articial Intelligence,
School of Computer Science, Wuhan University, China
Large-scale datasets have played a crucial role in the advancement of computer vision. However, they often suer from problems such
as class imbalance, noisy labels, dataset bias, or high resource costs, which can inhibit model performance and reduce trustworthiness.
With the advocacy of data-centric research, various data-centric solutions have been proposed to solve the dataset problems mentioned
above. They improve the quality of datasets by re-organizing them, which we call dataset renement. In this survey, we provide a
comprehensive and structured overview of recent advances in dataset renement for problematic computer vision datasets
1
. Firstly,
we summarize and analyze the various problems encountered in large-scale computer vision datasets. Then, we classify the dataset
renement algorithms into three categories based on the renement process: data sampling, data subset selection, and active learning.
In addition, we organize these dataset renement methods according to the addressed data problems and provide a systematic
comparative description. We point out that these three types of dataset renement have distinct advantages and disadvantages for
dataset problems, which informs the choice of the data-centric method appropriate to a particular research objective. Finally, we
summarize the current literature and propose potential future research topics.
CCS Concepts: General and reference
Surveys and overviews;Computing methodologies
Computer vision;
Security and privacy;
Additional Key Words and Phrases: Dataset renement, data sampling, subset selection, active learning
ACM Reference Format:
Zhijing Wan, Zhixiang Wang, CheukTing Chung, and Zheng Wang. 2023. A Survey of Dataset Renement for Problems in Computer
Vision Datasets. J. ACM 37, 4, Article 111 (August 2023), 33 pages. https://doi.org/XXXXXXX.XXXXXXX
Corresponding author
1All resources are available at https://github.com/Vivian-wzj/DatasetRenement-CV.
Authors’ addresses: Zhijing Wan, National Engineering Research Center for Multimedia Software, Institute of Articial Intelligence, School of Computer
Science, Wuhan University, Wuhan, China, wanzjwhu@whu.edu.cn; Zhixiang Wang, Graduate School of Information Science and Technology, The
University of Tokyo, Tokyo, Japan, wangzx1994@gmail.com; CheukTing Chung, National Engineering Research Center for Multimedia Software, Institute
of Articial Intelligence, School of Computer Science, Wuhan University, Wuhan, China, 2271406579@qq.com; Zheng Wang, National Engineering
Research Center for Multimedia Software, Institute of Articial Intelligence, School of Computer Science, Wuhan University, Wuhan, China, wangzwhu@
whu.edu.cn.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.
©2023 Association for Computing Machinery.
Manuscript submitted to ACM
Manuscript submitted to ACM 1
arXiv:2210.11717v2 [cs.CV] 6 Oct 2023
2 Wan and Wang, et al.
Collected Data 𝑫𝒄𝒍
Data Operation
Training Data 𝑫𝒕𝒓
Data
(𝑁)
Data
(𝑁)
Data Augmentation
Data
(M > 𝑁)
(a) W/O Data Operation
(b) W/ Data Augmentation
Data
(𝑁)
Data
(𝑁)
Data
(𝑛)
Problematic Data
Clean Data
Informative Data
N, M, n : Dataset Size
Dataset Refinement
(c) W/ Dataset Refinement
Fig. 1. Comparison of dierent data operations. (a) Without (W/O) data operation: typically, once a dataset is collected, it is directly
used as a training set without any processing. In this case, the size of the training set
𝐷𝑡𝑟
is equal to the size
𝑁
of the collected
dataset
𝐷𝑐𝑙
. Problematic data in the collected dataset have received less aention and remain to be addressed. (b) With (W/) data
augmentation: in the case of data starvation, data augmentation is oen used to make small changes to the collected data as a way
to increase the size
𝑀
and diversity of training data, and it has received increasing aention in recent years. However, instead of
being addressed, problematic data increases, inhibiting the eectiveness of augmentation. (c) W/ dataset refinement: it is used to
address the problems faced in the collected dataset to improve the dataset quality and, thus, the performance and eiciency of model
learning. In most cases, the size of the refined dataset 𝑛𝑁.
1 INTRODUCTION AND MOTIVATION
Recently, deep learning has achieved impressive progress in computer vision [
85
,
87
]. The success of deep learning
mainly owes to three points, namely advanced deep network architectures (e.g., residual networks [
55
]), powerful
computing devices (e.g., Graphic Processing Units (GPUs)), and large datasets with labels (e.g., ImageNet [
29
]). Among
them, deep network architectures and computing devices are well-developed, but obtaining high-quality training
datasets is still very dicult. As the attention of researchers was mainly focused on the development and optimization
of deep models and computational devices, once the data was ready, it became a xed asset and received less attention
(as shown in Fig. 1 (a)). With the development of model architecture, the incremental gains from improving models are
diminishing in many tasks [
68
]. At the same time, relatively minor improvements in data can make Articial Intelligence
(AI) models much more reliable [
85
]. Therefore, more attention should be paid to data development. Furthermore, there
is a prevalent assumption that all data points are equivalently relevant to model parameter updating. In other words,
all the training data are presented equally and randomly to the model. However, numerous works have challenged
this assumption and proven that not all samples are created equal [
51
,
64
]. Networks should be aware of the various
complexities of the data and spend most of the computation on critical examples. With all these concerns, there is
a strong need for the community of articial intelligence (AI) to move from model-centric research to data-centric
research for further improvements in model learning.
Ideally, the collected dataset is correct, without any problem, and can be directly used for model learning. However,
since there is no uniform standard for the data collection and labeling process, and the labeling is usually left to
crowdsourcing companies rather than labeling experts, the collected dataset is often of low quality, i.e., it contains
redundant and non-informative data, and often suers from various problems such as noisy labels [
27
], class imbalance
[
65
], representation bias [
83
], and distribution mismatch [
144
]. For example, ImageNet, the most inuential ultra-large
benchmark in computer vision, was identied by [
12
] as containing noisy labels. For many computer vision tasks such
Manuscript submitted to ACM
A Survey of Dataset Renement for Problems in Computer Vision Datasets 3
as face recognition [
18
,
173
], medical image diagnosis [
62
], and image classication [
147
], the collected training datasets
typically exhibit a long-tailed class distribution, where a minority class has a large number of samples, and other
classes have only a small number of samples. Since deep neural networks have the capacity to essentially memorize
any characteristics of the data [
169
], those problematic data can drastically inhibit the performance of model training.
Besides, the collected data may also face challenges such as compressing the volume and reducing the annotation cost.
This is due to the fact that under the advocacy of green AI [
122
] in the AI community, the training data should not
only improve model accuracy but also enable ecient training, thus reducing resource consumption, decreasing AI’s
environmental footprint, and increasing its inclusivity. Therefore, how to eectively rene the problematic dataset
before model training becomes one of the bottleneck problems for a trustworthy and robust AI system.
With the advocacy of data-centric research, some data-centric studies have been conducted for dataset problems
in recent years. They aim to make the dataset more accurate and useful by removing irrelevant or redundant data
and correcting problems, which we call dataset renement (as shown in Fig. 1 (c)). Dataset renement studies can
be divided into three main directions according to the renement process: data sampling, data subset selection, and
active learning. Data sampling adjusts the frequency and order of training data to promote model training eectively;
data subset selection selects the most representative samples from the labeled training set to promote ecient model
training; active learning selects the most useful samples from the unlabeled dataset for labeling to minimize the cost
of labeling. When comparing them with another type of data operation, namely data augmentation, it can be seen
that the latter has received more attention in recent years. This is because data augmentation is a powerful technique
to alleviate data-hungry situations in deep learning. Data augmentation [
128
] increases the amount and diversity of
training data by adding small changes to the original data or creating new synthetic data based on the original data,
which can improve the performance of models. However, it does not address problematic data, so direct augmentation
based on original data can lead to an increase in problematic data (as shown in Fig. 1 (b)) and inhibit the eectiveness of
augmentation. Therefore, it is recommended to carry out dataset renement before or after data augmentation and
before model training.
1.1 Motivation for Work
The motivation for this comprehensive overview is outlined below:
To thoroughly inspect the various problems inherent in or external to computer vision datasets.
To focus on the data-centric solution and review the dataset renement methods for each dataset problem.
This article focuses on three main directions in dataset renement (i.e., data sampling, data subset selection, and
active learning), and reviews and analyzes their respective developments in light of the problems addressed. This is
a data-centric survey that aims to provide insights into dataset renement methods and to inform researchers and
practitioners in the eld of computer vision on the selection of appropriate methods to solve specic dataset problems.
1.2 Our Contributions
The problems faced by computer vision datasets have been elaborated.
A comprehensive survey has been conducted to investigate dataset renement for solving problems in computer
vision datasets.
Comparative analysis of dierent dataset renement methods for solving the same dataset problems has been
performed.
Manuscript submitted to ACM
4 Wan and Wang, et al.
Table 1. Related Search. They may cover data problems such as Class Imbalance (CI), Noisy Labels (NL), Dataset Biases (DB), High
Computational Cost (HCC), and High Labeling Cost (HLC). “DS” indicates data sampling; “SS” indicates subset selection; “AL”
indicates active learning. “-” means that the corresponding column is not covered.
Research Articles Data Problems Covered Improvement Perspective Dataset Renement
CI NL DB HCC HLC Model-level Data-level DS SS AL
Data Collection and Quality Challenges in Deep Learning:
A Data-Centric AI Perspective [157] ✓ ✓
A Survey on Curriculum Learning [152] ✓ ✓
A Review of Instance Selection Methods [104] ✓ ✓
DeepCore: A Comprehensive Library for Coreset Selection
in Deep Learning [47] ✓ ✓
A Survey on Active Deep Learning: From Model Driven
to Data Driven [87] ✓ ✓
Deep Long-tailed Learning: A Survey [175] ✓ ✓
Learning from Noisy Labels with Deep Neural Networks: A
Survey [135] ✓ ✓
A Survey on Bias in Visual Datasets [36] ✓ ✓ — —
Our Survey ✓ ✓ ✓ ✓
1.3 Related Search
In this section, we present research relevant to our survey. There is a systematic study [
157
] reviewing the data collection
and quality challenges in deep learning from a data-centric AI perspective, which divides the whole machine learning
process into data collection, data cleaning, and robust model training. While the survey [
157
] covered the data cleaning
associated with dataset renement, it provided only a cursory summary of data cleaning instead of a detailed summary
of how it has developed in recent years. It did not elaborate on the problems in datasets. In addition, there have been
some surveys in various directions of dataset renement (data sampling, data subset selection, or active learning), such
as a survey on curriculum learning [
152
], reviews of instance selection methods [
47
,
104
], and a survey on active deep
learning [
87
]. Then, there are surveys [
36
,
135
,
175
] focusing on one or two data problems but not covering all problems
as much as possible. Moreover, they mainly cover methods that overcome data problems at the model level, while
less attention is paid to dataset renement methods. In summary, there is no systematic study that comprehensively
reviews and discusses dataset renement from the perspective of data problems. To ll this gap, we aim to provide
a comprehensive survey of recent dataset renement studies in computer vision conducted before mid-2022. Table 1
presents the comparison of existing surveys related to problematic data in computer vision.
1.4 Article Organization
The survey is organized as follows: we rst summarize and describe the various problems inherent in, or external
to, computer vision datasets in Section 2. Then, in Section 3, we provide an overview of three directions of dataset
renement (i.e., data sampling, subset selection, and active learning), including their notation, components, denitions,
and taxonomy. We classify advanced dataset renement methods according to the problem solved and the result
achieved, and review them in Sections 4, 5, 6, and 7, respectively. We further compare and analyze several related
learning problems in Section 8, including data distillation, feature selection, and semi-supervised learning. Afterward,
in Section 9, we discuss some of the potential future research directions. Finally, we conclude the survey in Section 10.
Manuscript submitted to ACM
A Survey of Dataset Renement for Problems in Computer Vision Datasets 5
Sorted class index
Number of training samples
Head Tail
Decision
boundary
(a) Class imbalance (b) Noisy labels
label: dog label: cat
label: 9
label: 4
(c) Dataset biases
Trained classifier Optimal classifier
Training data Test data
Fig. 2. Illustrations of the problems inherent in computer vision datasets.
2 SUMMARY OF PROBLEMS IN COMPUTER VISION DATASETS
Generally, due to the lack of uniform standards and specications for the current data collection and labeling process,
published datasets are not perfect [
85
] and will suer from some inherent problems, such as class imbalance [
105
,
175
],
noisy labels [
27
,
46
], or dataset biases [
36
,
44
,
144
]. In addition, with the increasing size of datasets, data collection or
application development may face problems such as high labeling costs and excessive computational overhead, which
can also inhibit the development of AI. In the following, we will detail the main data problems faced in the eld of
computer vision and the corresponding research challenges, and briey describe how to solve them from the perspective
of dataset renement.
Class imbalance. Technically, any dataset that shows an uneven distribution between classes can be considered
imbalanced [
54
,
65
]. This issue has attracted considerable interest from academia, industry, and government funding
agencies. The main class imbalance of interest to computer vision researchers is the problem of long-tailed class
distribution. The long-tailed class distribution (later termed the long-tailed distribution) is a classical class imbalance
problem in which a few classes (the head classes) have massive samples while others (the tail classes) have only a small
number of samples, as shown in Fig. 2(a). Therefore, in most cases, the class imbalance problem in this article refers to
the long-tailed distribution problem. During training, this problem can bias the model toward the head class and cause it
to perform poorly on the tail class. As such, it poses a challenge to the robust learning of models. For this problem, data
sampling and data subset selection can be used to counteract the long-tail eect by balancing the distribution of head
and tail classes and enhancing the learning of tail classes, as summarized in Section 4.1.1 and Section 4.1.2, respectively.
Noisy labels. Generally, we default the data to be correctly labeled. However, there may be label issues in the datasets
due to inevitable errors by human annotators or automated label extraction tools for images, such as crowdsourcing
and web crawling. For example, there are likely at least 100,000 label issues in ImageNet, such as noisy labels that are
incorrectly labeled. Some samples with noisy labels from realistic or the MNIST [
79
] dataset are shown in Fig. 2(b).
Deep learning with noisy labels is practically challenging, as the capacity of deep models is so high that they can
totally memorize these noisy labels sooner or later during model training [
50
]. As a result, these noisy labels inevitably
degenerate the robustness of learned models [
98
]. For improving the model robustness, there has been a lot of work
to overcome this problem by data sampling or data subset selection, as summarized in Section 4.2.1 and Section 4.2.2,
respectively. The noisy labels studied in most of the current work are synthetic random label noise (symmetric or
asymmetric label noise [
27
]). In fact, there exists much more instance-dependent label noise [
23
] in real-world datasets.
For example, annotators could easily label a cat as a lion, but would not easily label a cat as a table. Compared to random
Manuscript submitted to ACM
摘要:

ASurveyofDatasetRefinementforProblemsinComputerVisionDatasetsZHIJINGWAN,NationalEngineeringResearchCenterforMultimediaSoftware,InstituteofArtificialIntelligence,SchoolofComputerScience,WuhanUniversity,ChinaZHIXIANGWANG,GraduateSchoolofInformationScienceandTechnology,TheUniversityofTokyo,JapanCHEUKTI...

展开>> 收起<<
A Survey of Dataset Refinement for Problems in Computer Vision Datasets.pdf

共33页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:33 页 大小:5.31MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 33
客服
关注