3. Predictive Analysis System Architecture
The architecture of predictive analysis system includes various phases like data collection, data warehousing,
predictive analysis, processing analyzed reports. Figure 1 shows the complete architecture of proposed method.
3.1 Data Collection
The raw diabetic big data or data set is given as input to the system. The unstructured voluminous input data can
be obtained from various Electronic Health Record (EHR) / Patient Health Record (PHR), Clinical systems and
external sources (government sources, laboratories, pharmacies, insurance companies etc.), in various formats (flat
files, .csv, tables, ASCII/text, etc.) and residing at various locations [8].
3.2 Data Warehousing
In this phase massive unstructured data warehoused into single unit in which, data from various sources is
cleansed, accumulated and made ready for further processing. Integration of various EHRs can help in identifying
the patterns for diabetes prediction system.3.3 Predictive Analysis
Predictive analysis can help healthcare providers accurately expect and respond to the patient needs. It provides
the ability to make financial and clinical decisions based on predictions made by the system. This system uses the
predictive analysis algorithm in Hadoop/Map Reduce environment to predict and classify the type of DM,
complications associated with it and the type of treatment to be provided.
Hadoop:
Hadoop is the open-source distributed data processing platform from Apache. Hadoop can serve the twin roles of
data organizer and analytics tool [8]. Hadoop has the potential to process extremely large amounts of health data
mainly by allocating partitioned data sets to numerous servers like clusters, each of which solves different parts of
the larger problem and then integrates them for the final result. Hadoop uses two main components to do its job:
Map/Reduce and Hadoop Distributed File System.
• Map/Reduce: Hadoop’s implementation of Map/Reduce is based on programming models to process large
data or datasets by dividing them into small blocks of tasks. Map/Reduce uses distributed algorithms, on a group of
computers in a cluster, to process large datasets. It consists of two functions:
x The Map ( ) function which resides on the master node and then divides the input data or task into smaller
subtasks, which it then distributes to worker nodes that process the smaller tasks and pass the answers
back to the master node. The subtasks are run in parallel on multiple computers.
x The Reduce ( ) function collects the results of all the subtasks and combines them to produce an aggregated
final result — which it returns as the answer to the original big query.
• Hadoop Distributed File System (HDFS): HDFS replicates the data blocks that reside on other computers
in the data center (to ensure reliability) and manages the transfer of data to the various parts of the distributed
system.
Pattern discovery:
For diabetic treatment it is necessary to test the patterns like, plasma glucose concentration, serum insulin,
diastolic blood pressure, diabetes pedigree, Body Mass Index (BMI), age, number of times pregnant.
The pattern discovery of predictive analysis must include the following [14]:
• Association rule mining- Association between diabetic type and pages viewed (e.g. laboratory results)
• Clustering- clustering of similar patterns of usage, etc.
• Classification- Classification of health risk value by the level of patient health condition.
• Usage of statistics
• Application of pre-defined deductive rules across data