Classification by estimating the cumulative distribution function for small data

2025-04-27 3 0 1.01MB 39 页 10玖币

侵权投诉

Classiﬁcation by estimating the cumulative distribution

function for small data

Meng-Xian Zhua, Yuan-Hai Shaob,∗

aSchool of Economics, Hainan University, Haikou, 570228, P.R.China

bSchool of Management, Hainan University, Haikou, 570228, P.R.China

Abstract

In this paper, we study the classiﬁcation problem by estimating the conditional prob-

ability function of the given data. Diﬀerent from the traditional expected risk esti-

mation theory on empirical data, we calculate the probability via Fredholm equation,

this leads to estimate the distribution of the data. Based on the Fredholm equation, a

new expected risk estimation theory by estimating the cumulative distribution func-

tion is presented. The main characteristics of the new expected risk estimation is to

measure the risk on the distribution of the input space. The corresponding empirical

risk estimation is also presented, and an ε-insensitive L1cumulative support vector

machines (ε-L1VSVM) is proposed by introducing an insensitive loss. It is worth

mentioning that the classiﬁcation models and the classiﬁcation evaluation indicators

based on the new mechanism are diﬀerent from the traditional one. Experimental

results show the eﬀectiveness of the proposed ε-L1VSVM and the corresponding cu-

mulative distribution function indicator on validity and interpretability of small data

classiﬁcation.

Keywords: Classiﬁcation, cumulative distribution function, Fredholm equation,

expected risk estimation, support vector machines.

1. Introduction

Classiﬁcation [1,2] is also known as pattern recognition. It is one of the main

research problems of machine learning. For binary classiﬁcation, generally, we are

given a set of training samples S={(xi, yi), i = 1,2, ..., m}, where xi∈Rdis

the input and yi∈ {0,1}is the corresponding output with i= 1, . . . , m. Suppose

∗Corresponding author.

Email address: shaoyuanhai21@163.com (Yuan-Hai Shao)

Preprint submitted to Elsevier October 14, 2022

arXiv:2210.05953v2 [cs.LG] 13 Oct 2022

the given data is generate from an independent and identically distributed (i.i.d.)

distribution, our task is to deduce the output yof any input xfrom the given data

set. In fact, the classiﬁcation problem corresponds to the problem of conditional

probability P(y|x) estimation in statistical inference [3,4], and it has been studied

extensively in both machine learning and statistical inference [5–7].

A common way to estimate the conditional probability in classiﬁcation is to

suppose a function f(x)≈P(y|x). For the binary classiﬁcation problem, since

P(y= 1|x) + P(y= 0|x) = 1, without loss of generality, we describe the problem

of estimating the conditional conditional probability as estimating P(y= 1|x) from

now on. One of the most theory for classiﬁcation is the expected risk minimization

theory [8], and the core idea of expected risk minimization theory is to minimize the

error of estimation function f(x) and real output in the expected space. The advan-

tage of expected risk minimization theory is that it does not need to have too many

assumptions about the estimation function. Most discriminant classiﬁers are based

on the expected risk minimization theory, such as least squares classiﬁers (LSC)

[9], neural networks (NN) [10], support vector machines (SVM) [11,12], ensemble

learning (EL) [13] et.al. In fact, when estimating the expected risk, it is necessary

to estimate the distribution structure F(x, y) of data. For simplicity of calculation,

most of the models assumes that the data are uniformly distributed, thus there is no

operation to estimate the data distribution.

Recently, Vapnik and Izmailov [14] further studied the classiﬁcation problem by

estimating the cumulative distribution function and conditional probability. And

proposed a new way to estimate the conditional probability P(y= 1|x) by solving

the integral equation based on the deﬁnition of conditional probability and the den-

sity function directly. To solve the integral equation, [14,15] presented a form of

least squares implementation called VSVM and compares it with the traditional least

squares and support vector machine classiﬁers. The advantages of using the integral

equation is that f(x) can be solved by estimating the distribution function in input

space, and the integral equation has wider convergence properties. Therefore, this

method has attracted the attention of diﬀerent machine learning ﬁelds [16–19]. How-

ever, the general paradigm of using integral equation inference f(x) is not discussed,

and the corresponding expected risk minimization theory and empirical estimation

theory are not clear yet.

In this paper, diﬀerent from the traditional expected risk estimation theory on

empirical data, our paper studies to give the original theoretical model of the di-

rect estimation of conditional probability from the framework of expectation risk via

Fredholm equation, and presents the corresponding empirical model based on the

cumulative distribution function estimation. One ε-insensitive empirical risk estima-

tion is established and an ε-insensitive L1loss cumulative support vector machines

(ε-L1VSVM) is presented. It is worth mentioning that the classiﬁcation evaluation

indicator via estimating Fredholm equation are diﬀerent from the traditional one.

Experimental results show the eﬀectiveness of the proposed ε-L1VSVM and the cor-

responding cumulative distribution function indicator on validity and interpretability

of classiﬁcation, especially when the training data is insuﬃcient. The main contri-

butions are summarized as follows:

i) For estimating P(y= 1|x) by using the cumulative distribution function, we

perform an objective transformation of the Fredholm integral equation to es-

timated conditional probabilities from a probabilistic perspective. The main

characteristic of using Fredholm integral equation is the function to be evalu-

ated and the real output are equal under the cumulative distribution function.

ii) For estimating P(y= 1|x) by using statistical estimation, the expected risk

minimization and empirical risk minimization theory based on the Fredholm

integral equation is presented. Due to the cumulative distribution function

should be estimate, the loss function and the expected risk deﬁned by Fredholm

integral equation is diﬀerent from the traditional one.

iii) Diﬀerent from traditional 0-1 loss, an insensitive way of estimating the empir-

ical risk using ε-insensitive loss is proposed, it has sparsity and with higher

eﬃciency than least squares loss. And an ε-L1VSVM is proposed based on

ε-insensitive loss. A new evaluation criterion with distribution information is

proposed, an out-of-sample estimation of the vvalue of testing sample.

iv) Experimental results show the eﬀectiveness of the proposed ε-L1VSVM and

the corresponding cumulative distribution function indicator on validity and

interpretability of classiﬁcation, especially when the training data is insuﬃcient.

The rest of the paper is organized as follows. In section 2, we introduce the

implementation path of classical model for classiﬁcation and the brief overview of

the method for direct estimation of conditional probabilities described in VSVM.

In section 3, we give a theoretical framework for estimating conditional probabili-

ties with empirical distribution function, which includes a rigorous derivation and

transformation of the integral equation from a probabilistic point of view, and a the-

oretical study of estimating the equation in a statistical sense. In section 4, speciﬁc

estimation models based on the proposed theoretical framework and an improved

evaluation indicator is presented. In section 5, we show experimental results on ar-

tiﬁcial datasets and a number of publicly available datasets. In section 6, we give

some concluding remarks.

2. Preliminaries

2.1. Risk minimization in classiﬁcation

In the following, we consider binary classiﬁcation model of learning from training

samples through three components [3]

1) A generator of random vectors x∈ X, drawn independently from a ﬁxed but

unknown probability distribution function F(x).

2) A supervisor who returns a binary output value yi∈ Y := {0,1}to every in-

put vector xi, according to a conditional probability function P(y= 1|x), where

P(y= 0|x) = 1 −P(y= 1|x), also ﬁxed but unknown.

3) A learning machine capable of implementing a set of indicator functions (functions

which take only two values: zero and one) F.

The problem of learning is that of choosing from the given set of indicator functions

F, the one that best approximates the supervisor’s response. The selection of the

desired function is based on a training set of m i.i.d. observations

(x1, y1),(x2, y2)..., (xm, ym), xi∈ X, yi∈ Y,(1)

drawn according to F(x, y) = F(x)P(y|x). To ﬁnd the function y=f(x) (the map-

ping f:X → Y) based on a training set that satisﬁes the above basic assumptions,

we need to have some measure of how good a function as a decision function. To this

end, we introduce the concept of loss function. This is a function L(y, f(x)) which

tells us the cost of classifying instance x∈ X as y∈ Y. For example, the natural

loss function is the 0-1 loss

L0/1(y, f(x)) = I(y, f(x)),(2)

where

I(u, v) = 0, if u =v,

1, if u 6=v.

The function (2) determines the probability of diﬀerent answers given by the indicator

function f(x) and label y. We call the case of diﬀerent answers a classiﬁcation

error. While the loss function measures the error of a function on some individual

data sample, the risk of a function is the average loss over data samples generated

according to the underlying distribution F(x, y),

R(f) = EX,Y [L(y, f(x))] = ˆX×Y

L(y, f(x))dF (x, y).(3)

Therefore, the goal of learning is to ﬁnd the function which minimizes the risk func-

tion (3) in the given set of indicator function Fwhen the probability measure F(x, y)

is unknown but training data (1) are given.

Since the training data (1) sampled from the underlying probability distribution

F(x, y), we can infer a function f(x) from (1) whose risk is close to expect risk. In

practical, we approximate the unknown cumulative distribution function F(x, y) by

its joint empirical cumulative distribution function

Fm(x, y) =

i=1

θ(x−xi)θ(y−yi),(4)

where the one-dimensional step function is deﬁned as

θ(x−xi) = 1, if x ≥xi

0, otherwise (5)

and the multi-dimensional step function for x=x1, ..., xdis deﬁned as

θ(x−xi) =

k=1

θxk−xk

i.

Using approximation (4) in the expected loss function (3), we obtain the following

empirical loss function

Rm(f) = 1

i=1

L(yi, f(xi)).(6)

That is, given some training data (1), a loss function Land a function space Fto

work with, we deﬁne the classiﬁer fmas

fm:= arg min

f∈F

Remp(f).

This approach is called the empirical risk minimization (ERM) induction principle.

The ERM principle is conducted with the law of large numbers well. However, when

the sample size mis relatively small, ERM may be inconsistent with the expected

risk minimization. To solve this problem, Vapnik [3] considerers the eﬀect of the set

of functions Fon the expected risk and restrict the set of admissible functions to

render empirical risk minimization consistent, and gets the following two lemmas.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ClassicationbyestimatingthecumulativedistributionfunctionforsmalldataMeng-XianZhua,Yuan-HaiShaob,aSchoolofEconomics,HainanUniversity,Haikou,570228,P.R.ChinabSchoolofManagement,HainanUniversity,Haikou,570228,P.R.ChinaAbstractInthispaper,westudytheclassicationproblembyestimatingtheconditionalprob-a...

展开>> 收起<<

Classification by estimating the cumulative distribution function for small data.pdf

共39页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Classification by estimating the cumulative distribution function for small data

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: