
ods.
The organization of the rest of the paper is as follows: an
overview of the related work is briefed in Section 2. Section
3 details the work on the proposed model, while Section 4
provides a comparative analysis of the results obtained with
the previous models. The conclusion and the future scope
of the work is stated in Section 5.
2. RELATED WORK
Data mining techniques have been incorporated in the med-
ical domain for the prediction of diseases for a long time
and various models have been proposed over time. Initially,
several traditional classification techniques were used for the
purpose, yielding average accuracies. The successive mod-
els resorted to clustering [8][3][21][14] and feature selection
[20][6][16] techniques to enhance the achievable accuracies of
the classifier models as seen in. The advantages of ensemble
idea in supervised learning has encouraged their usage for a
long time and boosting has been widely used to improve the
accuracy of ensemble models. The following works provides
an overview of several ensemble models and their applica-
tions [17][19].
Recent work by researchers have shown that the hybrid mod-
els have been very prosperous in enhancing the accuracies in
disease prediction. By the use of several data mining tech-
niques in collaboration with each other it has been possible
to amplify the efficiency of the systems. A hybrid model
that uses a multi-objective local search to perfectly balance
between local and genetic searches has been described in the
work of Ishibuchi et al. [11]. The model proposed by Vafaie
et al. in [1] uses genetic search techniques in comparison
to greedy search and skilfully gleans an initially unknown
search space to bias the successive search into promising sub-
spaces. Several fuzzy hybrid models have also been proposed
and have proven to perform well. It can be seen in works
of Carlos et al. [15] which uses a fuzzy-genetic approach
adopting an evolutionary model to enable classification; Fan
et al. [7] that combines soft computing techniques with de-
cision tree tools to diagonise and classify breast cancer and
liver disorder; and the work of Amit et al. [4] discussing a
fuzzy system developed by heuristically learning from neu-
ral networks. The model proposed in [12] also incorporates
a hybrid neural network of Artificial Neural Network(ANN)
and Fuzzy Neural Network.
More recent work on hybrid models by B.M Patil et al.
[9] uses a clustering algorithm as a preprocessing step be-
fore the classification process to eliminate the incorrectly
clustered tuples. The cleansed data is used to build the
classifier model which is then tested on the same cleansed
data by k-fold cross validation. The proposed technique re-
sults in high accuracies for disease prediction. The model
proposed in [18][2] extends the the previous model by B.M
Patil et al. through the usage of more efficient clustering
techniques for the elimination of outliers and genetic algo-
rithm based wrapper feature selection to select an optimal
subset of attributes from the dataset, resulting in further in-
crease of prediction accuracy. Though these models achieve
high accuracies they are attained at the cost of robustness
of the system and are not reliable. As an alternative to re-
moving tuples completely, data quality can be made better
through data scrubbing and attribute correction techniques.
Overview and details of the types and classification of data
quality issues, various design suggestion, model frameworks
and common techniques including the approaches of clus-
tering and association rule mining for data cleansing and
attribute correction is detailed in [5][24][23]. A fuzzy data
mining technique to mine association rules from quantita-
tive, which we have adopted in this paper is detailed in [10].
3. PROPOSED ROBUST SELF-HEALING HY-
BRID PREDICTION MODEL
In this model the dataset is cleansed in the preprocessing
phase which involves attribute correction to logically pre-
dict the missing values and fix any inconsistencies or spuri-
ous tuples, followed by data normalization which prepares
the data to be modeled by the classification phase. An en-
semble model consisting of heterogeneous base classifiers is
used for the prediction. To further enhance the prediction
accuracy boosting is performed individually on each classi-
fiers with varying levels of iteration and then the resultant
outputs are combined through maximum vote. Each of the
base classifiers utilizes only a fraction of the attributes for
modeling chosen through a feature selection process. An
overview of the system architecture is provided in Figure
1. Working of each of the components is detailed in the
following sections.
3.1 Data preprocessing
The raw dataset obtained can have data quality issues on in-
stance level or on the record/schema level. Data mis-entry,
redundancy, inconsistent aggregation etc. are some of the
instance level issues and the issues on the schema level in-
clude uniqueness, referential integrity and naming or struc-
tural conflicts. In order to handle these, context-dependent
attribute correction is incorporated. Context-dependent im-
plies that the correction is achieved not only with reference
to the data values it is similar to, but also depends on the
values of the other attributes within the native record un-
like context-independent correction in which all record at-
tributes are cleaned in isolation.
3.1.1 Non-numeric attribute correction
The numeric and the non-numeric attributes are handled
differently. For the non-numeric attributes all the frequent
sets are generated using the apriori algorithm [22]. The as-
sociation rules are generated for these stets and may have
either 1, 2 or 3 predecessors and a single successor. These
rules form the set of validation rules. For each of the tu-
ples having attribute values which vary from a validation
rule, a check is performed with all successors of the rule.
If the resulting normalized Levenshtein distance with any
of them is lower than a particular distance threshold then
the attribute value of the corresponding tuple is altered.The
normalized Levenshtein distance between two strings aand
bis calculated as below:
NormLev(a, b) = 1
2∗LevDist(a, b)
|a|+LevDist(a, b)
|b|(1)