
Improving Data Quality with Training Dynamics of Gradient Boosting Decision TreesA Preprint
In fact, different examples might not contribute equally towards learning [
Vodrahalli et al.(2018)
,
Sorscher et al.(2022)
]. This motivates studies to better understand the role of data instances with re-
spect to their contribution in obtaining good metrics. Instance hardness may be a way towards this
idea [
Zhou et al.(2020)
]. However, more than identifying how hard a given example is for the task at
hand, we believe there is significant benefit in segmenting the dataset into examples that are useful to
discover patterns, from those useless for knowledge discovery [
Hao et al.(2022)
,
Saha and Srivastava(2014)
,
Frénay and Verleysen(2013)
]. Trustworthy data are those with correct labels, ranging from typical examples
that are easy to learn, ambiguous or borderline instances which may require a more complex model to allow
learning, and atypical (or rare) that are hard-to-learn.
Therefore, in this paper we propose a method based on metrics computed from training dynamics of
Gradient Boosting Decision Trees (GBDTs) to assess the behavior of each training example. In particular,
it uses either XGBoost [
Chen and Guestrin(2016)
] or LightGBM [
Ke et al.(2017)
] as base models. Our
algorithm is based on the Dataset Cartography idea, originally proposed for Neural Networks in the context
of natural language processing datasets [
Swayamdipta et al.(2020)
]. In contrast, we focus on datasets
containing mostly tabular or structured data, for which Decision Trees ensembles are the state-of-the-art
in terms of performance, classification metrics, as well as interpretability [
Shwartz-Ziv and Armon(2022)
].
Also [
Swayamdipta et al.(2020)
] devote their main efforts to investigate the use of ambiguous examples to
improve generalization and only briefly to mislabeled examples. In this study we instead focus on detecting
noisy labels in order to either remove it or relabel it to improve models’ metrics.
Our contributions are as follows:
1.
We are the first to introduce training dynamics metrics for dataset instances, a.k.a. Dataset
Cartography, using ensembles of boosted decision trees (or GBDTs);
2.
Use the method as part of the pipeline to deploy a model in production used to classify forbidden
items in a Marketplace platform and show guidelines for users that may benefit from the practices
shown in our applied data science paper;
3.
Propose a novel algorithm that uses the computed training dynamics metrics, in particular a product
between correctness and confidence, in conjunction with LightGBM iterative instance weights to
improve noisy label detection;
4.
By investigating both Noise Completely At Random (NCAR) and noisy not at random (NNAR),
we show that removing mislabeled instances may improve performance of models, outperforming
previous work in many scenarios, including real, synthetic and a productive dataset.
2 Related Work
Previous work includes approaches to score dataset instances using confidence [
Hovy et al.(2013)
] and metrics
of hardness [
Lorena et al.(2019)
]. Beyond measuring confidence or hardness, the field known as “confident
learning” [
Northcutt et al.(2021)
] intents to address the issue of uncertainty in data labels during neural
network training. Some important conclusions were drawn in this scenario for multiclass problems, in
particular: (i) that label noise is class-conditional [
Angluin and Laird(1988)
], e.g. in an natural image
scenario a dog is more likely to be misclassified as wolf than as airplane; (ii) that joint distribution between
given (noisy) labels and unknown (true) labels can be achieved via a series of approaches: pruning, counting
and ranking. According to [
Northcutt et al.(2021)
], prune is to search for label errors, for example via
loss-reweighing to avoid iterative re-labeling [
Chen et al.(2019)
,
Patrini et al.(2016)
], or using unlabeled
data to prune labeled datasets [
Sorscher et al.(2022)
]. Count is to train on clean data in order to avoid
propagating error in learned models [
Natarajan et al.(2013)
]. Then, to rank examples to use during training
as in curriculum learning [Zhou et al.(2020)].
Using learning or training dynamics for neural networks models was shown to be useful to identify quality
of instances. For example, comparing score values with its highest non-assigned class [
Pleiss et al.(2020)
]
or instances with low loss values [
Shen and Sanghavi(2019)
]. Understand which instances represent simpler
patterns and are easy-to-learn [
Liu et al.(2020)
], as well as those that are easily forgotten [
Toneva et al.(2018)
]
(misclassified) in a later epoch. Such studies show that deep networks are biased towards learning easier
examples faster during training. In this context, making sure the deep network memorizes rare and
ambiguous instances, while avoiding memorization of easy ones, lead to better generalization [
Feldman(2020)
,
Swayamdipta et al.(2020)
,
Li and Vasconcelos(2019)
]. Noise in both training and testing data led to practical
limits in performance metrics requiring use of novel training approaches [
Ponti et al.(2021)
]. This is important
2