Robust self-healing prediction model for high dimensional data Anirudha Rayasam

2025-05-03 0 0 377.91KB 7 页 10玖币
侵权投诉
Robust self-healing prediction model for high dimensional
data
Anirudha Rayasam
Samsung Research and Development Institute,
Bangalore
aniriudha_r.c@samsung.com
Nagamma Patil
National Institute of Technology Karnataka,
Surathkal
nagammapatil@nitk.ac.in
ABSTRACT
Owing to the advantages of increased accuracy and the po-
tential to detect unseen patterns, provided by data mining
techniques they have been widely incorporated for standard
classification problems. They have often been used for high
precision disease prediction in the medical field, and several
hybrid prediction models capable of achieving high accura-
cies have been proposed. Though this stands true most of
the previous models fail to efficiently address the recurring
issue of bad data quality which plagues most high dimen-
sional data, and especially proves troublesome in the highly
sensitive medical data. This work proposes a robust self-
healing (RSH) hybrid prediction model which functions by
using the data in its entirety by removing errors and incon-
sistencies from it rather than discarding any data. Initial
processing involves data preparation followed by cleansing
or scrubbing through context-dependent attribute correc-
tion, which ensures that there is no significant loss of rele-
vant information before the feature selection and prediction
phases. An ensemble of heterogeneous classifiers, subjected
to local boosting, is utilized to build the prediction model
and genetic algorithm based wrapper feature selection tech-
nique wrapped on the respective classifiers is employed to
select the corresponding optimal set of features, which war-
rant higher accuracy. The proposed method is compared
with some of the existing high performing models and the
results are analyzed.
Keywords
Data cleansing or scrubbing, Attribute correction, Ensemble
classifier, Genetic Algorithm based wrapper feature selection
1. INTRODUCTION
The wide usage of data mining techniques can be attributed
the vast functionalities it offers to perform a varied num-
ber of useful tasks. It can effectively help to extract useful
knowledge from enormous amount of data which are readily
available these days, thus providing worth and also enabling
them to promote the efficient usage of other related tasks.
Classification and prediction are two types of data analysis
that can be used to create classifier models and predict fu-
ture data trends. Clustering is a process by which data is
grouped into clusters according to their similarities or dis-
similarities. Feature selection is yet another data mining
task used in prediction models to find an optimal set of fea-
tures from the given set of features. Data cleaning is the
necessary precursor of knowledge discovery and data ware-
house building. As data collected from a varied number of
sources can have issues of missing, erroneous, duplicated or
structurally and semantically heterogeneous data etc. data
cleansing plays a crucial role in todays world in order for
the accurate performance of other complementary mining
techniques. Using the combination of these basic techniques
powerful hybrid models capable of extensive applications can
be developed.
High dimensional datasets like medical datasets, biological
sequences etc. have a large set of features and data mining
techniques prove particularly very helpful in their analyses
and for disease prediction. One such application is predic-
tion of Type-2 diabetes, which is a disease caused due to
insulin deficiency and if left unaddressed could prove fatal.
Previously several high accuracy systems which employ a
varied degree of data mining techniques have been proposed
for the early prediction of diabetes on the Pima Indians dia-
betes data. Most of the prevailing solutions do not strongly
address the bad quality of the data and just discard the
tuples which seem incomplete or damaged. By doing so a
large quantity of information is lost and not considered in
building the prediction model, which affects the robustness
of the system. In the sense that when instances similar to
the excluded tuples occur at a later point in time, then the
model would fail to correctly classify them as the model has
not been allowed to learn any relevant features to classify
instances of this type. Also the ignored tuples are not in-
cluded during the evaluation of the system, which results
in the accuracies showcased tending to be superfluous. Also
most times these missing data may greatly influence the fea-
ture selection process resulting in crucial attributes being
unconsidered. Therefore, it is imperative that we develop
mechanisms to effectively incorporate the entire or most of
the data in the development of the prediction models. And
to ensure that this inclusion does not adversely affect the
system, measures to improve data quality are to be taken.
In this work a robust self-healing (RSH) hybrid prediction
model is proposed, which is comprehensive and provides im-
proved accuracies for disease prediction. The initial pre-
processing of the data is performed consisting of data nor-
malization and grouping, followed by attribute correction.
The crux of the prediction model comprises of an ensemble
of heterogeneous classifiers, each of which are trained with
a specific set of optimal features that are chosen via genetic
algorithm based wrapper feature technique. The model is
evaluated on the Type-2 diabetes Pima Indians dataset and
a comparative analysis is performed with the existing meth-
ods.
The organization of the rest of the paper is as follows: an
overview of the related work is briefed in Section 2. Section
3 details the work on the proposed model, while Section 4
provides a comparative analysis of the results obtained with
the previous models. The conclusion and the future scope
of the work is stated in Section 5.
2. RELATED WORK
Data mining techniques have been incorporated in the med-
ical domain for the prediction of diseases for a long time
and various models have been proposed over time. Initially,
several traditional classification techniques were used for the
purpose, yielding average accuracies. The successive mod-
els resorted to clustering [8][3][21][14] and feature selection
[20][6][16] techniques to enhance the achievable accuracies of
the classifier models as seen in. The advantages of ensemble
idea in supervised learning has encouraged their usage for a
long time and boosting has been widely used to improve the
accuracy of ensemble models. The following works provides
an overview of several ensemble models and their applica-
tions [17][19].
Recent work by researchers have shown that the hybrid mod-
els have been very prosperous in enhancing the accuracies in
disease prediction. By the use of several data mining tech-
niques in collaboration with each other it has been possible
to amplify the efficiency of the systems. A hybrid model
that uses a multi-objective local search to perfectly balance
between local and genetic searches has been described in the
work of Ishibuchi et al. [11]. The model proposed by Vafaie
et al. in [1] uses genetic search techniques in comparison
to greedy search and skilfully gleans an initially unknown
search space to bias the successive search into promising sub-
spaces. Several fuzzy hybrid models have also been proposed
and have proven to perform well. It can be seen in works
of Carlos et al. [15] which uses a fuzzy-genetic approach
adopting an evolutionary model to enable classification; Fan
et al. [7] that combines soft computing techniques with de-
cision tree tools to diagonise and classify breast cancer and
liver disorder; and the work of Amit et al. [4] discussing a
fuzzy system developed by heuristically learning from neu-
ral networks. The model proposed in [12] also incorporates
a hybrid neural network of Artificial Neural Network(ANN)
and Fuzzy Neural Network.
More recent work on hybrid models by B.M Patil et al.
[9] uses a clustering algorithm as a preprocessing step be-
fore the classification process to eliminate the incorrectly
clustered tuples. The cleansed data is used to build the
classifier model which is then tested on the same cleansed
data by k-fold cross validation. The proposed technique re-
sults in high accuracies for disease prediction. The model
proposed in [18][2] extends the the previous model by B.M
Patil et al. through the usage of more efficient clustering
techniques for the elimination of outliers and genetic algo-
rithm based wrapper feature selection to select an optimal
subset of attributes from the dataset, resulting in further in-
crease of prediction accuracy. Though these models achieve
high accuracies they are attained at the cost of robustness
of the system and are not reliable. As an alternative to re-
moving tuples completely, data quality can be made better
through data scrubbing and attribute correction techniques.
Overview and details of the types and classification of data
quality issues, various design suggestion, model frameworks
and common techniques including the approaches of clus-
tering and association rule mining for data cleansing and
attribute correction is detailed in [5][24][23]. A fuzzy data
mining technique to mine association rules from quantita-
tive, which we have adopted in this paper is detailed in [10].
3. PROPOSED ROBUST SELF-HEALING HY-
BRID PREDICTION MODEL
In this model the dataset is cleansed in the preprocessing
phase which involves attribute correction to logically pre-
dict the missing values and fix any inconsistencies or spuri-
ous tuples, followed by data normalization which prepares
the data to be modeled by the classification phase. An en-
semble model consisting of heterogeneous base classifiers is
used for the prediction. To further enhance the prediction
accuracy boosting is performed individually on each classi-
fiers with varying levels of iteration and then the resultant
outputs are combined through maximum vote. Each of the
base classifiers utilizes only a fraction of the attributes for
modeling chosen through a feature selection process. An
overview of the system architecture is provided in Figure
1. Working of each of the components is detailed in the
following sections.
3.1 Data preprocessing
The raw dataset obtained can have data quality issues on in-
stance level or on the record/schema level. Data mis-entry,
redundancy, inconsistent aggregation etc. are some of the
instance level issues and the issues on the schema level in-
clude uniqueness, referential integrity and naming or struc-
tural conflicts. In order to handle these, context-dependent
attribute correction is incorporated. Context-dependent im-
plies that the correction is achieved not only with reference
to the data values it is similar to, but also depends on the
values of the other attributes within the native record un-
like context-independent correction in which all record at-
tributes are cleaned in isolation.
3.1.1 Non-numeric attribute correction
The numeric and the non-numeric attributes are handled
differently. For the non-numeric attributes all the frequent
sets are generated using the apriori algorithm [22]. The as-
sociation rules are generated for these stets and may have
either 1, 2 or 3 predecessors and a single successor. These
rules form the set of validation rules. For each of the tu-
ples having attribute values which vary from a validation
rule, a check is performed with all successors of the rule.
If the resulting normalized Levenshtein distance with any
of them is lower than a particular distance threshold then
the attribute value of the corresponding tuple is altered.The
normalized Levenshtein distance between two strings aand
bis calculated as below:
NormLev(a, b) = 1
2LevDist(a, b)
|a|+LevDist(a, b)
|b|(1)
摘要:

Robustself-healingpredictionmodelforhighdimensionaldataAnirudhaRayasamSamsungResearchandDevelopmentInstitute,Bangaloreaniriudha_r.c@samsung.comNagammaPatilNationalInstituteofTechnologyKarnataka,Surathkalnagammapatil@nitk.ac.inABSTRACTOwingtotheadvantagesofincreasedaccuracyandthepo-tentialtodetectuns...

展开>> 收起<<
Robust self-healing prediction model for high dimensional data Anirudha Rayasam.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:377.91KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注