In this study, to develop a prediction
model that can predict heart disease cases
based on measurements taken from transthoracic
echocardiography examination, and we have used the Knowledge Discovery in Database (KDD)
methodology as described by2.
To define the problem and determine
medical goals, I have thoroughly discussed with
medical fraternity particularly cardiologists at PGI,
Chandigarh. Since the knowledge gained from the
different experts are a high-level description of the
problem from the medical point of view, a literature
review was carried out and relevant works related to
data mining and heart disease have been reviewed
to have more knowledge about the domain.
Furthermore, a real time observation
of the system was performed to understand the
business process of the hospital. A key sub goal in
this step is determination of data mining goals and
their success criteria. The goals are obtained by
translating medical goals into data mining goals.
I have used data collected from PGI,
Chandigarh which contains transthoracic
echocardiography report of 7,008 patients from
the year 2008 to the first quarter of year 2010. Data
was collected from various measurements that were
taken during the echocardiography examination that
also included information of 20 variables. In an effort
to reduce the number of variables, then I turned to
a domain expert for assistance. The expert selected
15 of the most important variables for inclusion in
the dataset.
As the hospital keeps the record of each
patient in a separate hard file, therefore that file is
converted into a separate Microsoft Word file, in
order to integrate the data it was needed to create
a database with variables of interest and record the
values of each variable into the new database. After
recording, the new database now contains 7,339
instances each instance resembling a single file.
The selected data was checked for noise,
inconsistency and missing values using distribution
frequency while outlier detection was done using
box plots. Noises and inconsistencies identified in
the dataset were corrected manually, while missing
values were replaced with the most probable value
determined with regression and outliers were
replaced with the mean value of the attributes. All
the data cleaning was performed after addressing