Robust self-healing prediction model for high dimensional data Anirudha Rayasam

2025-05-03 0 0 377.91KB 7 页 10玖币

侵权投诉

Robust self-healing prediction model for high dimensional

data

Anirudha Rayasam

Samsung Research and Development Institute,

Bangalore

aniriudha_r.c@samsung.com

Nagamma Patil

National Institute of Technology Karnataka,

Surathkal

nagammapatil@nitk.ac.in

ABSTRACT

Owing to the advantages of increased accuracy and the po-

tential to detect unseen patterns, provided by data mining

techniques they have been widely incorporated for standard

classiﬁcation problems. They have often been used for high

precision disease prediction in the medical ﬁeld, and several

hybrid prediction models capable of achieving high accura-

cies have been proposed. Though this stands true most of

the previous models fail to eﬃciently address the recurring

issue of bad data quality which plagues most high dimen-

sional data, and especially proves troublesome in the highly

sensitive medical data. This work proposes a robust self-

healing (RSH) hybrid prediction model which functions by

using the data in its entirety by removing errors and incon-

sistencies from it rather than discarding any data. Initial

processing involves data preparation followed by cleansing

or scrubbing through context-dependent attribute correc-

tion, which ensures that there is no signiﬁcant loss of rele-

vant information before the feature selection and prediction

phases. An ensemble of heterogeneous classiﬁers, subjected

to local boosting, is utilized to build the prediction model

and genetic algorithm based wrapper feature selection tech-

nique wrapped on the respective classiﬁers is employed to

select the corresponding optimal set of features, which war-

rant higher accuracy. The proposed method is compared

with some of the existing high performing models and the

results are analyzed.

Keywords

Data cleansing or scrubbing, Attribute correction, Ensemble

classiﬁer, Genetic Algorithm based wrapper feature selection

1. INTRODUCTION

The wide usage of data mining techniques can be attributed

the vast functionalities it oﬀers to perform a varied num-

ber of useful tasks. It can eﬀectively help to extract useful

knowledge from enormous amount of data which are readily

available these days, thus providing worth and also enabling

them to promote the eﬃcient usage of other related tasks.

Classiﬁcation and prediction are two types of data analysis

that can be used to create classiﬁer models and predict fu-

ture data trends. Clustering is a process by which data is

grouped into clusters according to their similarities or dis-

similarities. Feature selection is yet another data mining

task used in prediction models to ﬁnd an optimal set of fea-

tures from the given set of features. Data cleaning is the

necessary precursor of knowledge discovery and data ware-

house building. As data collected from a varied number of

sources can have issues of missing, erroneous, duplicated or

structurally and semantically heterogeneous data etc. data

cleansing plays a crucial role in todays world in order for

the accurate performance of other complementary mining

techniques. Using the combination of these basic techniques

powerful hybrid models capable of extensive applications can

be developed.

High dimensional datasets like medical datasets, biological

sequences etc. have a large set of features and data mining

techniques prove particularly very helpful in their analyses

and for disease prediction. One such application is predic-

tion of Type-2 diabetes, which is a disease caused due to

insulin deﬁciency and if left unaddressed could prove fatal.

Previously several high accuracy systems which employ a

varied degree of data mining techniques have been proposed

for the early prediction of diabetes on the Pima Indians dia-

betes data. Most of the prevailing solutions do not strongly

address the bad quality of the data and just discard the

tuples which seem incomplete or damaged. By doing so a

large quantity of information is lost and not considered in

building the prediction model, which aﬀects the robustness

of the system. In the sense that when instances similar to

the excluded tuples occur at a later point in time, then the

model would fail to correctly classify them as the model has

not been allowed to learn any relevant features to classify

instances of this type. Also the ignored tuples are not in-

cluded during the evaluation of the system, which results

in the accuracies showcased tending to be superﬂuous. Also

most times these missing data may greatly inﬂuence the fea-

ture selection process resulting in crucial attributes being

unconsidered. Therefore, it is imperative that we develop

mechanisms to eﬀectively incorporate the entire or most of

the data in the development of the prediction models. And

to ensure that this inclusion does not adversely aﬀect the

system, measures to improve data quality are to be taken.

In this work a robust self-healing (RSH) hybrid prediction

model is proposed, which is comprehensive and provides im-

proved accuracies for disease prediction. The initial pre-

processing of the data is performed consisting of data nor-

malization and grouping, followed by attribute correction.

The crux of the prediction model comprises of an ensemble

of heterogeneous classiﬁers, each of which are trained with

a speciﬁc set of optimal features that are chosen via genetic

algorithm based wrapper feature technique. The model is

evaluated on the Type-2 diabetes Pima Indians dataset and

a comparative analysis is performed with the existing meth-

ods.

The organization of the rest of the paper is as follows: an

overview of the related work is briefed in Section 2. Section

3 details the work on the proposed model, while Section 4

provides a comparative analysis of the results obtained with

the previous models. The conclusion and the future scope

of the work is stated in Section 5.

2. RELATED WORK

Data mining techniques have been incorporated in the med-

ical domain for the prediction of diseases for a long time

and various models have been proposed over time. Initially,

several traditional classiﬁcation techniques were used for the

purpose, yielding average accuracies. The successive mod-

els resorted to clustering [8][3][21][14] and feature selection

[20][6][16] techniques to enhance the achievable accuracies of

the classiﬁer models as seen in. The advantages of ensemble

idea in supervised learning has encouraged their usage for a

long time and boosting has been widely used to improve the

accuracy of ensemble models. The following works provides

an overview of several ensemble models and their applica-

tions [17][19].

Recent work by researchers have shown that the hybrid mod-

els have been very prosperous in enhancing the accuracies in

disease prediction. By the use of several data mining tech-

niques in collaboration with each other it has been possible

to amplify the eﬃciency of the systems. A hybrid model

that uses a multi-objective local search to perfectly balance

between local and genetic searches has been described in the

work of Ishibuchi et al. [11]. The model proposed by Vafaie

et al. in [1] uses genetic search techniques in comparison

to greedy search and skilfully gleans an initially unknown

search space to bias the successive search into promising sub-

spaces. Several fuzzy hybrid models have also been proposed

and have proven to perform well. It can be seen in works

of Carlos et al. [15] which uses a fuzzy-genetic approach

adopting an evolutionary model to enable classiﬁcation; Fan

et al. [7] that combines soft computing techniques with de-

cision tree tools to diagonise and classify breast cancer and

liver disorder; and the work of Amit et al. [4] discussing a

fuzzy system developed by heuristically learning from neu-

ral networks. The model proposed in [12] also incorporates

a hybrid neural network of Artiﬁcial Neural Network(ANN)

and Fuzzy Neural Network.

More recent work on hybrid models by B.M Patil et al.

[9] uses a clustering algorithm as a preprocessing step be-

fore the classiﬁcation process to eliminate the incorrectly

clustered tuples. The cleansed data is used to build the

classiﬁer model which is then tested on the same cleansed

data by k-fold cross validation. The proposed technique re-

sults in high accuracies for disease prediction. The model

proposed in [18][2] extends the the previous model by B.M

Patil et al. through the usage of more eﬃcient clustering

techniques for the elimination of outliers and genetic algo-

rithm based wrapper feature selection to select an optimal

subset of attributes from the dataset, resulting in further in-

crease of prediction accuracy. Though these models achieve

high accuracies they are attained at the cost of robustness

of the system and are not reliable. As an alternative to re-

moving tuples completely, data quality can be made better

through data scrubbing and attribute correction techniques.

Overview and details of the types and classiﬁcation of data

quality issues, various design suggestion, model frameworks

and common techniques including the approaches of clus-

tering and association rule mining for data cleansing and

attribute correction is detailed in [5][24][23]. A fuzzy data

mining technique to mine association rules from quantita-

tive, which we have adopted in this paper is detailed in [10].

3. PROPOSED ROBUST SELF-HEALING HY-

BRID PREDICTION MODEL

In this model the dataset is cleansed in the preprocessing

phase which involves attribute correction to logically pre-

dict the missing values and ﬁx any inconsistencies or spuri-

ous tuples, followed by data normalization which prepares

the data to be modeled by the classiﬁcation phase. An en-

semble model consisting of heterogeneous base classiﬁers is

used for the prediction. To further enhance the prediction

accuracy boosting is performed individually on each classi-

ﬁers with varying levels of iteration and then the resultant

outputs are combined through maximum vote. Each of the

base classiﬁers utilizes only a fraction of the attributes for

modeling chosen through a feature selection process. An

overview of the system architecture is provided in Figure

1. Working of each of the components is detailed in the

following sections.

3.1 Data preprocessing

The raw dataset obtained can have data quality issues on in-

stance level or on the record/schema level. Data mis-entry,

redundancy, inconsistent aggregation etc. are some of the

instance level issues and the issues on the schema level in-

clude uniqueness, referential integrity and naming or struc-

tural conﬂicts. In order to handle these, context-dependent

attribute correction is incorporated. Context-dependent im-

plies that the correction is achieved not only with reference

to the data values it is similar to, but also depends on the

values of the other attributes within the native record un-

like context-independent correction in which all record at-

tributes are cleaned in isolation.

3.1.1 Non-numeric attribute correction

The numeric and the non-numeric attributes are handled

diﬀerently. For the non-numeric attributes all the frequent

sets are generated using the apriori algorithm [22]. The as-

sociation rules are generated for these stets and may have

either 1, 2 or 3 predecessors and a single successor. These

rules form the set of validation rules. For each of the tu-

ples having attribute values which vary from a validation

rule, a check is performed with all successors of the rule.

If the resulting normalized Levenshtein distance with any

of them is lower than a particular distance threshold then

the attribute value of the corresponding tuple is altered.The

normalized Levenshtein distance between two strings aand

bis calculated as below:

NormLev(a, b) = 1

2∗LevDist(a, b)

|a|+LevDist(a, b)

|b|(1)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Robustself-healingpredictionmodelforhighdimensionaldataAnirudhaRayasamSamsungResearchandDevelopmentInstitute,Bangaloreaniriudha_r.c@samsung.comNagammaPatilNationalInstituteofTechnologyKarnataka,Surathkalnagammapatil@nitk.ac.inABSTRACTOwingtotheadvantagesofincreasedaccuracyandthepo-tentialtodetectuns...

展开>> 收起<<

Robust self-healing prediction model for high dimensional data Anirudha Rayasam.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Robust self-healing prediction model for high dimensional data Anirudha Rayasam

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: