Classification by estimating the cumulative distribution function for small data

2025-04-27 0 0 1.01MB 39 页 10玖币
侵权投诉
Classification by estimating the cumulative distribution
function for small data
Meng-Xian Zhua, Yuan-Hai Shaob,
aSchool of Economics, Hainan University, Haikou, 570228, P.R.China
bSchool of Management, Hainan University, Haikou, 570228, P.R.China
Abstract
In this paper, we study the classification problem by estimating the conditional prob-
ability function of the given data. Different from the traditional expected risk esti-
mation theory on empirical data, we calculate the probability via Fredholm equation,
this leads to estimate the distribution of the data. Based on the Fredholm equation, a
new expected risk estimation theory by estimating the cumulative distribution func-
tion is presented. The main characteristics of the new expected risk estimation is to
measure the risk on the distribution of the input space. The corresponding empirical
risk estimation is also presented, and an ε-insensitive L1cumulative support vector
machines (ε-L1VSVM) is proposed by introducing an insensitive loss. It is worth
mentioning that the classification models and the classification evaluation indicators
based on the new mechanism are different from the traditional one. Experimental
results show the effectiveness of the proposed ε-L1VSVM and the corresponding cu-
mulative distribution function indicator on validity and interpretability of small data
classification.
Keywords: Classification, cumulative distribution function, Fredholm equation,
expected risk estimation, support vector machines.
1. Introduction
Classification [1,2] is also known as pattern recognition. It is one of the main
research problems of machine learning. For binary classification, generally, we are
given a set of training samples S={(xi, yi), i = 1,2, ..., m}, where xiRdis
the input and yi∈ {0,1}is the corresponding output with i= 1, . . . , m. Suppose
Corresponding author.
Email address: shaoyuanhai21@163.com (Yuan-Hai Shao)
Preprint submitted to Elsevier October 14, 2022
arXiv:2210.05953v2 [cs.LG] 13 Oct 2022
the given data is generate from an independent and identically distributed (i.i.d.)
distribution, our task is to deduce the output yof any input xfrom the given data
set. In fact, the classification problem corresponds to the problem of conditional
probability P(y|x) estimation in statistical inference [3,4], and it has been studied
extensively in both machine learning and statistical inference [57].
A common way to estimate the conditional probability in classification is to
suppose a function f(x)P(y|x). For the binary classification problem, since
P(y= 1|x) + P(y= 0|x) = 1, without loss of generality, we describe the problem
of estimating the conditional conditional probability as estimating P(y= 1|x) from
now on. One of the most theory for classification is the expected risk minimization
theory [8], and the core idea of expected risk minimization theory is to minimize the
error of estimation function f(x) and real output in the expected space. The advan-
tage of expected risk minimization theory is that it does not need to have too many
assumptions about the estimation function. Most discriminant classifiers are based
on the expected risk minimization theory, such as least squares classifiers (LSC)
[9], neural networks (NN) [10], support vector machines (SVM) [11,12], ensemble
learning (EL) [13] et.al. In fact, when estimating the expected risk, it is necessary
to estimate the distribution structure F(x, y) of data. For simplicity of calculation,
most of the models assumes that the data are uniformly distributed, thus there is no
operation to estimate the data distribution.
Recently, Vapnik and Izmailov [14] further studied the classification problem by
estimating the cumulative distribution function and conditional probability. And
proposed a new way to estimate the conditional probability P(y= 1|x) by solving
the integral equation based on the definition of conditional probability and the den-
sity function directly. To solve the integral equation, [14,15] presented a form of
least squares implementation called VSVM and compares it with the traditional least
squares and support vector machine classifiers. The advantages of using the integral
equation is that f(x) can be solved by estimating the distribution function in input
space, and the integral equation has wider convergence properties. Therefore, this
method has attracted the attention of different machine learning fields [1619]. How-
ever, the general paradigm of using integral equation inference f(x) is not discussed,
and the corresponding expected risk minimization theory and empirical estimation
theory are not clear yet.
In this paper, different from the traditional expected risk estimation theory on
empirical data, our paper studies to give the original theoretical model of the di-
rect estimation of conditional probability from the framework of expectation risk via
Fredholm equation, and presents the corresponding empirical model based on the
cumulative distribution function estimation. One ε-insensitive empirical risk estima-
2
tion is established and an ε-insensitive L1loss cumulative support vector machines
(ε-L1VSVM) is presented. It is worth mentioning that the classification evaluation
indicator via estimating Fredholm equation are different from the traditional one.
Experimental results show the effectiveness of the proposed ε-L1VSVM and the cor-
responding cumulative distribution function indicator on validity and interpretability
of classification, especially when the training data is insufficient. The main contri-
butions are summarized as follows:
i) For estimating P(y= 1|x) by using the cumulative distribution function, we
perform an objective transformation of the Fredholm integral equation to es-
timated conditional probabilities from a probabilistic perspective. The main
characteristic of using Fredholm integral equation is the function to be evalu-
ated and the real output are equal under the cumulative distribution function.
ii) For estimating P(y= 1|x) by using statistical estimation, the expected risk
minimization and empirical risk minimization theory based on the Fredholm
integral equation is presented. Due to the cumulative distribution function
should be estimate, the loss function and the expected risk defined by Fredholm
integral equation is different from the traditional one.
iii) Different from traditional 0-1 loss, an insensitive way of estimating the empir-
ical risk using ε-insensitive loss is proposed, it has sparsity and with higher
efficiency than least squares loss. And an ε-L1VSVM is proposed based on
ε-insensitive loss. A new evaluation criterion with distribution information is
proposed, an out-of-sample estimation of the vvalue of testing sample.
iv) Experimental results show the effectiveness of the proposed ε-L1VSVM and
the corresponding cumulative distribution function indicator on validity and
interpretability of classification, especially when the training data is insufficient.
The rest of the paper is organized as follows. In section 2, we introduce the
implementation path of classical model for classification and the brief overview of
the method for direct estimation of conditional probabilities described in VSVM.
In section 3, we give a theoretical framework for estimating conditional probabili-
ties with empirical distribution function, which includes a rigorous derivation and
transformation of the integral equation from a probabilistic point of view, and a the-
oretical study of estimating the equation in a statistical sense. In section 4, specific
estimation models based on the proposed theoretical framework and an improved
evaluation indicator is presented. In section 5, we show experimental results on ar-
tificial datasets and a number of publicly available datasets. In section 6, we give
some concluding remarks.
3
2. Preliminaries
2.1. Risk minimization in classification
In the following, we consider binary classification model of learning from training
samples through three components [3]
1) A generator of random vectors x∈ X, drawn independently from a fixed but
unknown probability distribution function F(x).
2) A supervisor who returns a binary output value yi∈ Y := {0,1}to every in-
put vector xi, according to a conditional probability function P(y= 1|x), where
P(y= 0|x) = 1 P(y= 1|x), also fixed but unknown.
3) A learning machine capable of implementing a set of indicator functions (functions
which take only two values: zero and one) F.
The problem of learning is that of choosing from the given set of indicator functions
F, the one that best approximates the supervisor’s response. The selection of the
desired function is based on a training set of m i.i.d. observations
(x1, y1),(x2, y2)..., (xm, ym), xi∈ X, yi∈ Y,(1)
drawn according to F(x, y) = F(x)P(y|x). To find the function y=f(x) (the map-
ping f:X → Y) based on a training set that satisfies the above basic assumptions,
we need to have some measure of how good a function as a decision function. To this
end, we introduce the concept of loss function. This is a function L(y, f(x)) which
tells us the cost of classifying instance x∈ X as y∈ Y. For example, the natural
loss function is the 0-1 loss
L0/1(y, f(x)) = I(y, f(x)),(2)
where
I(u, v) = 0, if u =v,
1, if u 6=v.
The function (2) determines the probability of different answers given by the indicator
function f(x) and label y. We call the case of different answers a classification
error. While the loss function measures the error of a function on some individual
data sample, the risk of a function is the average loss over data samples generated
according to the underlying distribution F(x, y),
R(f) = EX,Y [L(y, f(x))] = ˆX×Y
L(y, f(x))dF (x, y).(3)
4
Therefore, the goal of learning is to find the function which minimizes the risk func-
tion (3) in the given set of indicator function Fwhen the probability measure F(x, y)
is unknown but training data (1) are given.
Since the training data (1) sampled from the underlying probability distribution
F(x, y), we can infer a function f(x) from (1) whose risk is close to expect risk. In
practical, we approximate the unknown cumulative distribution function F(x, y) by
its joint empirical cumulative distribution function
Fm(x, y) =
m
X
i=1
θ(xxi)θ(yyi),(4)
where the one-dimensional step function is defined as
θ(xxi) = 1, if x xi
0, otherwise (5)
and the multi-dimensional step function for x=x1, ..., xdis defined as
θ(xxi) =
d
Y
k=1
θxkxk
i.
Using approximation (4) in the expected loss function (3), we obtain the following
empirical loss function
Rm(f) = 1
m
m
X
i=1
L(yi, f(xi)).(6)
That is, given some training data (1), a loss function Land a function space Fto
work with, we define the classifier fmas
fm:= arg min
f∈F
Remp(f).
This approach is called the empirical risk minimization (ERM) induction principle.
The ERM principle is conducted with the law of large numbers well. However, when
the sample size mis relatively small, ERM may be inconsistent with the expected
risk minimization. To solve this problem, Vapnik [3] considerers the effect of the set
of functions Fon the expected risk and restrict the set of admissible functions to
render empirical risk minimization consistent, and gets the following two lemmas.
5
摘要:

Classi cationbyestimatingthecumulativedistributionfunctionforsmalldataMeng-XianZhua,Yuan-HaiShaob,aSchoolofEconomics,HainanUniversity,Haikou,570228,P.R.ChinabSchoolofManagement,HainanUniversity,Haikou,570228,P.R.ChinaAbstractInthispaper,westudytheclassi cationproblembyestimatingtheconditionalprob-a...

展开>> 收起<<
Classification by estimating the cumulative distribution function for small data.pdf

共39页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:39 页 大小:1.01MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 39
客服
关注