
A Survey of Dataset Refinement for Problems in Computer Vision Datasets
ZHIJING WAN, National Engineering Research Center for Multimedia Software, Institute of Articial Intelligence,
School of Computer Science, Wuhan University, China
ZHIXIANG WANG, Graduate School of Information Science and Technology, The University of Tokyo, Japan
CHEUKTING CHUNG, National Engineering Research Center for Multimedia Software, Institute of Articial
Intelligence, School of Computer Science, Wuhan University, China
ZHENG WANG
†
,National Engineering Research Center for Multimedia Software, Institute of Articial Intelligence,
School of Computer Science, Wuhan University, China
Large-scale datasets have played a crucial role in the advancement of computer vision. However, they often suer from problems such
as class imbalance, noisy labels, dataset bias, or high resource costs, which can inhibit model performance and reduce trustworthiness.
With the advocacy of data-centric research, various data-centric solutions have been proposed to solve the dataset problems mentioned
above. They improve the quality of datasets by re-organizing them, which we call dataset renement. In this survey, we provide a
comprehensive and structured overview of recent advances in dataset renement for problematic computer vision datasets
1
. Firstly,
we summarize and analyze the various problems encountered in large-scale computer vision datasets. Then, we classify the dataset
renement algorithms into three categories based on the renement process: data sampling, data subset selection, and active learning.
In addition, we organize these dataset renement methods according to the addressed data problems and provide a systematic
comparative description. We point out that these three types of dataset renement have distinct advantages and disadvantages for
dataset problems, which informs the choice of the data-centric method appropriate to a particular research objective. Finally, we
summarize the current literature and propose potential future research topics.
CCS Concepts: •General and reference
→
Surveys and overviews;•Computing methodologies
→
Computer vision;•
Security and privacy;
Additional Key Words and Phrases: Dataset renement, data sampling, subset selection, active learning
ACM Reference Format:
Zhijing Wan, Zhixiang Wang, CheukTing Chung, and Zheng Wang. 2023. A Survey of Dataset Renement for Problems in Computer
Vision Datasets. J. ACM 37, 4, Article 111 (August 2023), 33 pages. https://doi.org/XXXXXXX.XXXXXXX
†Corresponding author
1All resources are available at https://github.com/Vivian-wzj/DatasetRenement-CV.
Authors’ addresses: Zhijing Wan, National Engineering Research Center for Multimedia Software, Institute of Articial Intelligence, School of Computer
Science, Wuhan University, Wuhan, China, wanzjwhu@whu.edu.cn; Zhixiang Wang, Graduate School of Information Science and Technology, The
University of Tokyo, Tokyo, Japan, wangzx1994@gmail.com; CheukTing Chung, National Engineering Research Center for Multimedia Software, Institute
of Articial Intelligence, School of Computer Science, Wuhan University, Wuhan, China, 2271406579@qq.com; Zheng Wang, National Engineering
Research Center for Multimedia Software, Institute of Articial Intelligence, School of Computer Science, Wuhan University, Wuhan, China, wangzwhu@
whu.edu.cn.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.
©2023 Association for Computing Machinery.
Manuscript submitted to ACM
Manuscript submitted to ACM 1
arXiv:2210.11717v2 [cs.CV] 6 Oct 2023