Improving Data Quality with Training Dynamics of Gradient Boosting Decision Trees A Preprint

2025-05-08 0 0 1.26MB 17 页 10玖币
侵权投诉
Improving Data Quality with Training Dynamics of
Gradient Boosting Decision Trees
A Preprint
Moacir A. Ponti
, Lucas de Angelis Oliveira
Mercado Livre
Osasco, Brazil
moacir.ponti@mercadolibre.com
Valentina Garcia
Mercado Libre
Medellín, Colombia
Mathias Esteban
Mercado Libre
Montevideo Uruguay
Juan Martín Román, Luis Argerich
Mercado Libre
Buenos Aires, Argentina
Abstract
Real world datasets contain incorrectly labeled instances that hamper the performance
of the model and, in particular, the ability to generalize out of distribution. Also, each
example might have different contribution towards learning. This motivates studies to
better understanding of the role of data instances with respect to their contribution in good
metrics in models. In this paper we propose a method based on metrics computed from
training dynamics of Gradient Boosting Decision Trees (GBDTs) to assess the behavior
of each training example. We focus on datasets containing mostly tabular or structured
data, for which the use of Decision Trees ensembles are still the state-of-the-art in terms of
performance. Our methods achieved the best results overall when compared with confident
learning, direct heuristics and a robust boosting algorithm. We show results on detecting
noisy labels in order clean datasets, improving models’ metrics in synthetic and real public
datasets, as well as on a industry case in which we deployed a model based on the proposed
solution.
1 Introduction
Investigating data quality is paramount to allow business analytics and data science teams to extract useful
knowledge from databases. A business rule may be incorrectly defined, or unrealistic conclusions may be
drawn from bad data. Machine Learning models may output useless scores, and Data Science techniques may
provide wrong information for decision support in this context. Therefore, it is important to be able to assess
the quality of training data [Jain et al.(2020), Smith et al.(2015)].
Datasets for learning models can grow fast due to the possibility of leveraging data from the Internet,
crowdsourcing of data in the case of academia, or storing transactions and information of business into
data lakes in the case of industry. However, such sources are prone to noise, in particular when it comes
to annotations [
Johnson and Khoshgoftaar(2022)
]. Even benchmark datasets contain incorrectly labeled
instances that affect the performance of the model and, in particular, the ability to generalize out of
distribution [
Ekambaram et al.(2017)
,
Pulastya et al.(2021)
]. In this context, while Machine Learning theory
often shows benefits of having large quantities of data in order to improve generalization of supervised models,
usually via the Law of Large Numbers [
Mello and Ponti(2018)
], it does not directly addresses the case of
data with high noise ratio.
M. Ponti is also with ICMC/Universidade de São Paulo, São Carlos-SP, Brazil
arXiv:2210.11327v2 [cs.LG] 22 Feb 2024
Improving Data Quality with Training Dynamics of Gradient Boosting Decision TreesA Preprint
In fact, different examples might not contribute equally towards learning [
Vodrahalli et al.(2018)
,
Sorscher et al.(2022)
]. This motivates studies to better understand the role of data instances with re-
spect to their contribution in obtaining good metrics. Instance hardness may be a way towards this
idea [
Zhou et al.(2020)
]. However, more than identifying how hard a given example is for the task at
hand, we believe there is significant benefit in segmenting the dataset into examples that are useful to
discover patterns, from those useless for knowledge discovery [
Hao et al.(2022)
,
Saha and Srivastava(2014)
,
Frénay and Verleysen(2013)
]. Trustworthy data are those with correct labels, ranging from typical examples
that are easy to learn, ambiguous or borderline instances which may require a more complex model to allow
learning, and atypical (or rare) that are hard-to-learn.
Therefore, in this paper we propose a method based on metrics computed from training dynamics of
Gradient Boosting Decision Trees (GBDTs) to assess the behavior of each training example. In particular,
it uses either XGBoost [
Chen and Guestrin(2016)
] or LightGBM [
Ke et al.(2017)
] as base models. Our
algorithm is based on the Dataset Cartography idea, originally proposed for Neural Networks in the context
of natural language processing datasets [
Swayamdipta et al.(2020)
]. In contrast, we focus on datasets
containing mostly tabular or structured data, for which Decision Trees ensembles are the state-of-the-art
in terms of performance, classification metrics, as well as interpretability [
Shwartz-Ziv and Armon(2022)
].
Also [
Swayamdipta et al.(2020)
] devote their main efforts to investigate the use of ambiguous examples to
improve generalization and only briefly to mislabeled examples. In this study we instead focus on detecting
noisy labels in order to either remove it or relabel it to improve models’ metrics.
Our contributions are as follows:
1.
We are the first to introduce training dynamics metrics for dataset instances, a.k.a. Dataset
Cartography, using ensembles of boosted decision trees (or GBDTs);
2.
Use the method as part of the pipeline to deploy a model in production used to classify forbidden
items in a Marketplace platform and show guidelines for users that may benefit from the practices
shown in our applied data science paper;
3.
Propose a novel algorithm that uses the computed training dynamics metrics, in particular a product
between correctness and confidence, in conjunction with LightGBM iterative instance weights to
improve noisy label detection;
4.
By investigating both Noise Completely At Random (NCAR) and noisy not at random (NNAR),
we show that removing mislabeled instances may improve performance of models, outperforming
previous work in many scenarios, including real, synthetic and a productive dataset.
2 Related Work
Previous work includes approaches to score dataset instances using confidence [
Hovy et al.(2013)
] and metrics
of hardness [
Lorena et al.(2019)
]. Beyond measuring confidence or hardness, the field known as “confident
learning” [
Northcutt et al.(2021)
] intents to address the issue of uncertainty in data labels during neural
network training. Some important conclusions were drawn in this scenario for multiclass problems, in
particular: (i) that label noise is class-conditional [
Angluin and Laird(1988)
], e.g. in an natural image
scenario a dog is more likely to be misclassified as wolf than as airplane; (ii) that joint distribution between
given (noisy) labels and unknown (true) labels can be achieved via a series of approaches: pruning, counting
and ranking. According to [
Northcutt et al.(2021)
], prune is to search for label errors, for example via
loss-reweighing to avoid iterative re-labeling [
Chen et al.(2019)
,
Patrini et al.(2016)
], or using unlabeled
data to prune labeled datasets [
Sorscher et al.(2022)
]. Count is to train on clean data in order to avoid
propagating error in learned models [
Natarajan et al.(2013)
]. Then, to rank examples to use during training
as in curriculum learning [Zhou et al.(2020)].
Using learning or training dynamics for neural networks models was shown to be useful to identify quality
of instances. For example, comparing score values with its highest non-assigned class [
Pleiss et al.(2020)
]
or instances with low loss values [
Shen and Sanghavi(2019)
]. Understand which instances represent simpler
patterns and are easy-to-learn [
Liu et al.(2020)
], as well as those that are easily forgotten [
Toneva et al.(2018)
]
(misclassified) in a later epoch. Such studies show that deep networks are biased towards learning easier
examples faster during training. In this context, making sure the deep network memorizes rare and
ambiguous instances, while avoiding memorization of easy ones, lead to better generalization [
Feldman(2020)
,
Swayamdipta et al.(2020)
,
Li and Vasconcelos(2019)
]. Noise in both training and testing data led to practical
limits in performance metrics requiring use of novel training approaches [
Ponti et al.(2021)
]. This is important
2
Improving Data Quality with Training Dynamics of Gradient Boosting Decision TreesA Preprint
in the context of neural networks since usually such models require an order of magnitude more data in order
to improve metrics in 32% [Sorscher et al.(2022)].
AdaBoost versions designed to be robust to noise have been proposed such as Logit-
Boost [
Friedman et al.(2000)
] and later BrownBoost [
Freund(2001)
]. Also in [
Rätsch et al.(2000)
]
boosting is defined as a margin maximization problem, inspired by Statistical Learning Theory, and slack
variables are introduced to allow for a soft version which allow a fraction of instances to lie inside the margin.
More than a decade after such studies, Gradient Boosting Decision Trees (GBDTs) were proposed and domi-
nated the class of tabular problems, excelling in both performance and speed [
Shwartz-Ziv and Armon(2022)
].
The most recent one, LightGBM [
Ke et al.(2017)
], is currently the standard choice in this sense. Decision
trees are also shown to be robust to low label noise [
Ghosh et al.(2017)
], making it a feasible model to
investigate under significant noise regimes.
While more recent work addresses issues closely related to deep neural networks and large scale im-
age and text datasets, studies on datasets containing tabular or structured data are still to be con-
ducted [
Renggli et al.(2023)
]. The concepts defined in the next section were defined before in different
studies such as [
Smith et al.(2015)
,
Swayamdipta et al.(2020)
], or focus on AdaBoost using an arbitrary or
manual choice as a threshold for noise robustness [
Karmaker and Kwek(2006)
,
Friedman et al.(2000)
]. In
this paper we define training dynamic metrics for the first time for GBTDs, also we were the first to use a
combination of training dynamic metrics, as well as propose an automatic algorithm to define a threshold to
assess label noise
Figure 1: Dataset cartography illustration based on training dynamics: average confidence, variability and
correctness, allowing to map the instances to how the model evolved to estimate the outputs along iterations
and classify points into easy, ambiguous, hard and even noisy.
3 Dataset Cartography using Training Dynamics of Decision Trees
In the context of Boosting-based Decision Trees ensembles, the training dynamics are given by a sequence of
trees, each learned by using as input weights for the misclassified instances in the previous iteration. For each
iteration
i
of a GBDT model
i
trees (estimators) are used to compute the probabilities/scores for each class
and all instances of the dataset. An advantage of using such ensembles, such as LightGBM and XGBoost, is
that we are able to compute scores at any iteration using an already trained model, without the need to
retrain it from scratch, as in neural networks.
3
Improving Data Quality with Training Dynamics of Gradient Boosting Decision TreesA Preprint
We define as
p
(i)
(
y
j|xj
)the score the model predicted for the classes of each instance
j
, where the input is
xj
and its training label
y
j
. The predicted label is
ˆyj
. Note that
ˆyj
(predicted by the model) may be equal or
different than
y
j
(training label). This method is supervised, requiring
y
j
, and therefore allows to assess only
training instances.
The following training dynamics statistics are computed for each instance:
Confidence: the average score for the true label y
jacross all iterations:
µj=1
T
T
X
i=1
p
(i)(y
j|xj),
where
p
(i)
is the model’s score at iteration
i
relative to the true label (not the highest score estimated
by the model);
Correctness: the percentage of iterations for which the model correctly labels xj:
cj=1
T
T
X
i=1
(ˆyj=y
j),
Variability: the standard deviation of p
(i)across iterations:
σj=sPT
i=1(p
(i)(y
j|xj)µj)2
T.
The metrics are in the range [0
,
1]. The name “dataset cartography” comes from a visualization of such
metrics as proposed by [
Swayamdipta et al.(2020)
], and illustrated in Figure 1. Algorithm 1 details how to
compute such metrics using GBDTs, given a trained model
h
(
., .
)for which it is possible to get the output for
an iteration i, and the dataset used to train this model X, Y .
Algorithm 1 Dataset Cartography: Training Dynamic Metrics for GBDTs
Require: Training set: X, Y
Require: Trained model: h(., .)
Require: Number of estimators/iterations: T
ˆ
S← ∅ {initialize set of scores}
ˆ
Y← ∅ {initialize set of labels}
{for each iteration}
for i= 0 to T1do
ˆsih(X, i){predictions for all classes}
ˆyiarg max(ˆsi){predicted labels of all items}
ˆ
Sˆ
Sp
(i)(y|x){scores for training labels at i}
ˆ
Yˆ
Yˆyi{store predicted labels at i}
end for
µmean( ˆ
S){average of scores for training labels}
σstd( ˆ
S){spread of scores}
csum( ˆ
Y== Y)/T {percentage of correct predictions}
return µ, σ, c
Possible interpretations are as follows. A high confidence example can be considered easy for the model to
learn. An example for which the model assigns the same label has low variability, which associated with
high confidence makes the example even easier. In contrast, high variability means different subset of trees
provide different scores for an instance, indicating harder instances, that are usually associated to complex
patterns, rare or atypical ones. High correctness are associated to easy instances, but such instances may
have confidence ranging from 1.0 to 0.5), while near zero correctness indicates an instance that the model
cannot learn.
4
摘要:

ImprovingDataQualitywithTrainingDynamicsofGradientBoostingDecisionTreesAPreprintMoacirA.Ponti∗,LucasdeAngelisOliveiraMercadoLivreOsasco,Brazilmoacir.ponti@mercadolibre.comValentinaGarciaMercadoLibreMedellín,ColombiaMathiasEstebanMercadoLibreMontevideoUruguayJuanMartínRomán,LuisArgerichMercadoLibreBu...

展开>> 收起<<
Improving Data Quality with Training Dynamics of Gradient Boosting Decision Trees A Preprint.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:1.26MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注