Improving Data Quality with Training Dynamics of Gradient Boosting Decision Trees A Preprint

2025-05-08 1 0 1.26MB 17 页 10玖币

侵权投诉

Improving Data Quality with Training Dynamics of

Gradient Boosting Decision Trees

A Preprint

Moacir A. Ponti∗

, Lucas de Angelis Oliveira

Mercado Livre

Osasco, Brazil

moacir.ponti@mercadolibre.com

Valentina Garcia

Mercado Libre

Medellín, Colombia

Mathias Esteban

Mercado Libre

Montevideo Uruguay

Juan Martín Román, Luis Argerich

Mercado Libre

Buenos Aires, Argentina

Abstract

Real world datasets contain incorrectly labeled instances that hamper the performance

of the model and, in particular, the ability to generalize out of distribution. Also, each

example might have diﬀerent contribution towards learning. This motivates studies to

better understanding of the role of data instances with respect to their contribution in good

metrics in models. In this paper we propose a method based on metrics computed from

training dynamics of Gradient Boosting Decision Trees (GBDTs) to assess the behavior

of each training example. We focus on datasets containing mostly tabular or structured

data, for which the use of Decision Trees ensembles are still the state-of-the-art in terms of

performance. Our methods achieved the best results overall when compared with conﬁdent

learning, direct heuristics and a robust boosting algorithm. We show results on detecting

noisy labels in order clean datasets, improving models’ metrics in synthetic and real public

datasets, as well as on a industry case in which we deployed a model based on the proposed

solution.

1 Introduction

Investigating data quality is paramount to allow business analytics and data science teams to extract useful

knowledge from databases. A business rule may be incorrectly deﬁned, or unrealistic conclusions may be

drawn from bad data. Machine Learning models may output useless scores, and Data Science techniques may

provide wrong information for decision support in this context. Therefore, it is important to be able to assess

the quality of training data [Jain et al.(2020), Smith et al.(2015)].

Datasets for learning models can grow fast due to the possibility of leveraging data from the Internet,

crowdsourcing of data in the case of academia, or storing transactions and information of business into

data lakes in the case of industry. However, such sources are prone to noise, in particular when it comes

to annotations [

Johnson and Khoshgoftaar(2022)

]. Even benchmark datasets contain incorrectly labeled

instances that aﬀect the performance of the model and, in particular, the ability to generalize out of

distribution [

Ekambaram et al.(2017)

Pulastya et al.(2021)

]. In this context, while Machine Learning theory

often shows beneﬁts of having large quantities of data in order to improve generalization of supervised models,

usually via the Law of Large Numbers [

Mello and Ponti(2018)

], it does not directly addresses the case of

data with high noise ratio.

∗M. Ponti is also with ICMC/Universidade de São Paulo, São Carlos-SP, Brazil

arXiv:2210.11327v2 [cs.LG] 22 Feb 2024

Improving Data Quality with Training Dynamics of Gradient Boosting Decision TreesA Preprint

In fact, diﬀerent examples might not contribute equally towards learning [

Vodrahalli et al.(2018)

Sorscher et al.(2022)

]. This motivates studies to better understand the role of data instances with re-

spect to their contribution in obtaining good metrics. Instance hardness may be a way towards this

idea [

Zhou et al.(2020)

]. However, more than identifying how hard a given example is for the task at

hand, we believe there is signiﬁcant beneﬁt in segmenting the dataset into examples that are useful to

discover patterns, from those useless for knowledge discovery [

Hao et al.(2022)

Saha and Srivastava(2014)

Frénay and Verleysen(2013)

]. Trustworthy data are those with correct labels, ranging from typical examples

that are easy to learn, ambiguous or borderline instances which may require a more complex model to allow

learning, and atypical (or rare) that are hard-to-learn.

Therefore, in this paper we propose a method based on metrics computed from training dynamics of

Gradient Boosting Decision Trees (GBDTs) to assess the behavior of each training example. In particular,

it uses either XGBoost [

Chen and Guestrin(2016)

] or LightGBM [

Ke et al.(2017)

] as base models. Our

algorithm is based on the Dataset Cartography idea, originally proposed for Neural Networks in the context

of natural language processing datasets [

Swayamdipta et al.(2020)

]. In contrast, we focus on datasets

containing mostly tabular or structured data, for which Decision Trees ensembles are the state-of-the-art

in terms of performance, classiﬁcation metrics, as well as interpretability [

Shwartz-Ziv and Armon(2022)

Also [

Swayamdipta et al.(2020)

] devote their main eﬀorts to investigate the use of ambiguous examples to

improve generalization and only brieﬂy to mislabeled examples. In this study we instead focus on detecting

noisy labels in order to either remove it or relabel it to improve models’ metrics.

Our contributions are as follows:

We are the ﬁrst to introduce training dynamics metrics for dataset instances, a.k.a. Dataset

Cartography, using ensembles of boosted decision trees (or GBDTs);

Use the method as part of the pipeline to deploy a model in production used to classify forbidden

items in a Marketplace platform and show guidelines for users that may beneﬁt from the practices

shown in our applied data science paper;

Propose a novel algorithm that uses the computed training dynamics metrics, in particular a product

between correctness and conﬁdence, in conjunction with LightGBM iterative instance weights to

improve noisy label detection;

By investigating both Noise Completely At Random (NCAR) and noisy not at random (NNAR),

we show that removing mislabeled instances may improve performance of models, outperforming

previous work in many scenarios, including real, synthetic and a productive dataset.

2 Related Work

Previous work includes approaches to score dataset instances using conﬁdence [

Hovy et al.(2013)

] and metrics

of hardness [

Lorena et al.(2019)

]. Beyond measuring conﬁdence or hardness, the ﬁeld known as “conﬁdent

learning” [

Northcutt et al.(2021)

] intents to address the issue of uncertainty in data labels during neural

network training. Some important conclusions were drawn in this scenario for multiclass problems, in

particular: (i) that label noise is class-conditional [

Angluin and Laird(1988)

], e.g. in an natural image

scenario a dog is more likely to be misclassiﬁed as wolf than as airplane; (ii) that joint distribution between

given (noisy) labels and unknown (true) labels can be achieved via a series of approaches: pruning, counting

and ranking. According to [

Northcutt et al.(2021)

], prune is to search for label errors, for example via

loss-reweighing to avoid iterative re-labeling [

Chen et al.(2019)

Patrini et al.(2016)

], or using unlabeled

data to prune labeled datasets [

Sorscher et al.(2022)

]. Count is to train on clean data in order to avoid

propagating error in learned models [

Natarajan et al.(2013)

]. Then, to rank examples to use during training

as in curriculum learning [Zhou et al.(2020)].

Using learning or training dynamics for neural networks models was shown to be useful to identify quality

of instances. For example, comparing score values with its highest non-assigned class [

Pleiss et al.(2020)

]

or instances with low loss values [

Shen and Sanghavi(2019)

]. Understand which instances represent simpler

patterns and are easy-to-learn [

Liu et al.(2020)

], as well as those that are easily forgotten [

Toneva et al.(2018)

]

(misclassiﬁed) in a later epoch. Such studies show that deep networks are biased towards learning easier

examples faster during training. In this context, making sure the deep network memorizes rare and

ambiguous instances, while avoiding memorization of easy ones, lead to better generalization [

Feldman(2020)

Swayamdipta et al.(2020)

Li and Vasconcelos(2019)

]. Noise in both training and testing data led to practical

limits in performance metrics requiring use of novel training approaches [

Ponti et al.(2021)

]. This is important

Improving Data Quality with Training Dynamics of Gradient Boosting Decision TreesA Preprint

in the context of neural networks since usually such models require an order of magnitude more data in order

to improve metrics in 3−2% [Sorscher et al.(2022)].

AdaBoost versions designed to be robust to noise have been proposed such as Logit-

Boost [

Friedman et al.(2000)

] and later BrownBoost [

Freund(2001)

]. Also in [

Rätsch et al.(2000)

]

boosting is deﬁned as a margin maximization problem, inspired by Statistical Learning Theory, and slack

variables are introduced to allow for a soft version which allow a fraction of instances to lie inside the margin.

More than a decade after such studies, Gradient Boosting Decision Trees (GBDTs) were proposed and domi-

nated the class of tabular problems, excelling in both performance and speed [

Shwartz-Ziv and Armon(2022)

The most recent one, LightGBM [

Ke et al.(2017)

], is currently the standard choice in this sense. Decision

trees are also shown to be robust to low label noise [

Ghosh et al.(2017)

], making it a feasible model to

investigate under signiﬁcant noise regimes.

While more recent work addresses issues closely related to deep neural networks and large scale im-

age and text datasets, studies on datasets containing tabular or structured data are still to be con-

ducted [

Renggli et al.(2023)

]. The concepts deﬁned in the next section were deﬁned before in diﬀerent

studies such as [

Smith et al.(2015)

Swayamdipta et al.(2020)

], or focus on AdaBoost using an arbitrary or

manual choice as a threshold for noise robustness [

Karmaker and Kwek(2006)

Friedman et al.(2000)

]. In

this paper we deﬁne training dynamic metrics for the ﬁrst time for GBTDs, also we were the ﬁrst to use a

combination of training dynamic metrics, as well as propose an automatic algorithm to deﬁne a threshold to

assess label noise

Figure 1: Dataset cartography illustration based on training dynamics: average conﬁdence, variability and

correctness, allowing to map the instances to how the model evolved to estimate the outputs along iterations

and classify points into easy, ambiguous, hard and even noisy.

3 Dataset Cartography using Training Dynamics of Decision Trees

In the context of Boosting-based Decision Trees ensembles, the training dynamics are given by a sequence of

trees, each learned by using as input weights for the misclassiﬁed instances in the previous iteration. For each

iteration

of a GBDT model

trees (estimators) are used to compute the probabilities/scores for each class

and all instances of the dataset. An advantage of using such ensembles, such as LightGBM and XGBoost, is

that we are able to compute scores at any iteration using an already trained model, without the need to

retrain it from scratch, as in neural networks.

Improving Data Quality with Training Dynamics of Gradient Boosting Decision TreesA Preprint

We deﬁne as

p∗

(i)

(

y∗

j|xj

)the score the model predicted for the classes of each instance

, where the input is

and its training label

y∗

. The predicted label is

ˆyj

. Note that

ˆyj

(predicted by the model) may be equal or

diﬀerent than

y∗

(training label). This method is supervised, requiring

y∗

, and therefore allows to assess only

training instances.

The following training dynamics statistics are computed for each instance:

•Conﬁdence: the average score for the true label y∗

jacross all iterations:

µj=1

i=1

p∗

(i)(y∗

j|xj),

where

p∗

(i)

is the model’s score at iteration

relative to the true label (not the highest score estimated

by the model);

•Correctness: the percentage of iterations for which the model correctly labels xj:

cj=1

i=1

(ˆyj=y∗

j),

•Variability: the standard deviation of p∗

(i)across iterations:

σj=sPT

i=1(p∗

(i)(y∗

j|xj)−µj)2

The metrics are in the range [0

1]. The name “dataset cartography” comes from a visualization of such

metrics as proposed by [

Swayamdipta et al.(2020)

], and illustrated in Figure 1. Algorithm 1 details how to

compute such metrics using GBDTs, given a trained model

(

., .

)for which it is possible to get the output for

an iteration i, and the dataset used to train this model X, Y .

Algorithm 1 Dataset Cartography: Training Dynamic Metrics for GBDTs

Require: Training set: X, Y

Require: Trained model: h(., .)

Require: Number of estimators/iterations: T

S← ∅ {initialize set of scores}

Y← ∅ {initialize set of labels}

{for each iteration}

for i= 0 to T−1do

ˆsi←h(X, i){predictions for all classes}

ˆyi←arg max(ˆsi){predicted labels of all items}

S←ˆ

S∪p∗

(i)(y∗|x){scores for training labels at i}

Y←ˆ

Y∪ˆyi{store predicted labels at i}

end for

µ←mean( ˆ

S){average of scores for training labels}

σ←std( ˆ

S){spread of scores}

c←sum( ˆ

Y== Y)/T {percentage of correct predictions}

return µ, σ, c

Possible interpretations are as follows. A high conﬁdence example can be considered easy for the model to

learn. An example for which the model assigns the same label has low variability, which associated with

high conﬁdence makes the example even easier. In contrast, high variability means diﬀerent subset of trees

provide diﬀerent scores for an instance, indicating harder instances, that are usually associated to complex

patterns, rare or atypical ones. High correctness are associated to easy instances, but such instances may

have conﬁdence ranging from 1.0 to 0.5), while near zero correctness indicates an instance that the model

cannot learn.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ImprovingDataQualitywithTrainingDynamicsofGradientBoostingDecisionTreesAPreprintMoacirA.Ponti∗,LucasdeAngelisOliveiraMercadoLivreOsasco,Brazilmoacir.ponti@mercadolibre.comValentinaGarciaMercadoLibreMedellín,ColombiaMathiasEstebanMercadoLibreMontevideoUruguayJuanMartínRomán,LuisArgerichMercadoLibreBu...

展开>> 收起<<

Improving Data Quality with Training Dynamics of Gradient Boosting Decision Trees A Preprint.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Improving Data Quality with Training Dynamics of Gradient Boosting Decision Trees A Preprint

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: