Parametric PDF for Goodness of Fit Natan Katz natan.katzgmail.comUri Utai

2025-05-02 0 0 726.63KB 14 页 10玖币
侵权投诉
Parametric PDF for Goodness of Fit
Natan Katz
natan.katz@gmail.com
Uri Utai
uri.itai@gmail.com
October 2022
Abstract
The methods for the goodness of fit in classification problems require
a prior threshold for determining the confusion matrix. Nonetheless, this
fixed threshold removes information that the model’s curves provide, and
can be used, for further studies such as risk evaluation and stability analy-
sis. We present a different framework that allows us to perform this study
using a parametric PDF.
1 Introduction
Machine learning (ML) projects have become a leading tool in enormous do-
mains of the computer industry. Their rule is far beyond computational as-
pects. Indeed, they are a focal point in designing analytical business decisions.
The commercial usage of these models raises new challenges. The ML academic
research often assumes that :
The data in the database represents well the global data distribution.
Training methodology aligns with the model’s KPI.
There are no production-driven drawbacks.
Unfortunately, none of these assumptions hold in real-world models. In addi-
tion, cardinal issues that focus on complexity and stability and questions such
as ”what is the efficient way to set a threshold to have both good
and stable performance” rarely exist in the academy. Hence, deploying ML
models in the real world requires a methodology that the academy does not
provide. In the academy, researchers focus mainly on common KPIs such as
accuracy and precision. We use these KPIs for other scaling indicators such as
Creamer’s V, F1-score, AUC [Uri22] and Matthew correlation coefficient (MCC)
[CJ20; JRF12; AD54; Uri22]. These indicators require a prior threshold for us-
ing them. Thus they all act as discrete signals . In the following sections, we
discuss the derived drawbacks of discrete signals and suggest solutions.
1
arXiv:2210.14005v2 [cs.LG] 1 Nov 2022
2 Discrete Signals
In this section, we discuss the disadvantages of discrete signals. To do so, we
need to review the typical inference process.
2.1 Inference Overview
Consider a well trained model Mand an evaluation set Dtest one can eas-
ily deduce fromfig 1 that the confusion matrix fully determines the model’s
evaluation. It leads to the following definition.
Definition: [Discrete signal] Let Mbe the confusion matrix. Consider the func-
tion
F:MR
If Fis monotone for each entry of M, then F is a Discrete signal.
If Fdoes not depend on Mthen it is called Continuous signals. We note
that the domain on the Discrete signal can be every nonempty subset of the
entries of M.
The output of a classification model is a probabilities vector [pyt16; skl]. We
use these vectors to calculate FR and TR curves. For classifying the data, we
set a threshold. This threshold determines the confusion matrix. This matrix
is the domain of the discrete signals [Uri22]. Most of the common goodness of
fit KPIs are discrete signals, nonetheless, these signals may suffer from three
essential disadvantages:
Unstable concerning the threshold
Difficult for risk calculations
Absence of good mathematical toolbox
In the following subsections, we discuss these disadvantages.
2.2 Instability
Model’s performances have a substantial capital impact. Therefore it is crucial
to evaluate our indicators accurately. Setting a fixed threshold on the model
graphs may provide two caveats:
Typical graphs suffer from steep slopes concerning the thresholds
Real-world statistics do not always identical to the distribution of the
evaluation test
Academically, these phenomena are seldom studied. Nonetheless, different dis-
tributions and steep slopes often indicate instability. Thus, we find these caveats
cardinal in the commercial world.
2
Figure 1: Generic Inference Process
3
摘要:

ParametricPDFforGoodnessofFitNatanKatznatan.katz@gmail.comUriUtaiuri.itai@gmail.comOctober2022AbstractThemethodsforthegoodnessof tinclassi cationproblemsrequireapriorthresholdfordeterminingtheconfusionmatrix.Nonetheless,this xedthresholdremovesinformationthatthemodel'scurvesprovide,andcanbeused,forf...

展开>> 收起<<
Parametric PDF for Goodness of Fit Natan Katz natan.katzgmail.comUri Utai.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:726.63KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注