3. Big data management
Basically, data processing is seen as the gathering, processing,
management of data for producing “new” information for
end users [3]. Over time, key challenges are related to storage,
transportation and processing of high throughput data. It is
different from Big Data challenges to which we have to add
ambiguity, uncertainty and variety [3]. Consequently, these requirements
imply an additional step where data are cleaned,
tagged, classified and formatted [3,14]. Karmasphere5 currently
splits Big Data analysis into four steps: Acquisition or
Access, Assembly or Organization, Analyze and Action or Decision.
Thus, these steps are mentioned as the “4 A’s”. The Computing
Community Consortium [14] similarly to [3], divides the organization
step into an Extraction/Cleaning step and an Integration
step.
3 http://www.gartner.com/newsroom/id/1731916.
4 http://www.emc.com/collateral/analyst-reports/idcextracting-value-from-chaos-ar.pdf.
5 http://www.reuters.com/article/2011/09/21/idUS132142+21-
Sep-2011+BW20110921.
Acquisition. Big Data architecture has to acquire high
speed data from a variety of sources (web, DBMS(OLTP),
NoSQL, HDFS) and has to deal with diverse access protocols. It
is where a filter could be established to store only data which
could be helpful or “raw” data with a lower degree of uncertainty
[14]. In some applications, the conditions of generation
of data are important, thus it could be interesting for further
analysis to capture these metadata and store them with the
corresponding data [14].
Organization. At this point the architecture has to deal
with various data formats (texts formats, compressed files,
variously delimited, etc.) and must be able to parse them and
extract the actual information like named entities, relation
between them, etc. [14]. Also this is the point where data
have to be clean, put in a computable mode, structured or
semi-structured, integrated and stored in the right location
(existing data warehouse, data marts, Operational Data Store,
Complex Event Processing engine, NoSQL database) [14].
Thus, a kind of ETL (extract, transform, load) had to be
done. Successful cleaning in Big Data architecture is not
entirely guaranteed; in fact “the volume, velocity, variety, and
variability of Big Data may preclude us from taking the time
to cleanse it all thoroughly”.6
Analyze. Here we have running queries, modeling, and
building algorithms to find new insights. Mining requires integrated,
cleaned, trustworthy data; at the same time, data
mining itself can also be used to help improve the quality and
trustworthiness of the data, understand its semantics, and
provide intelligent querying functions [14]. Decision. Being
able to take valuable decisions means to be able to efficiently
interpret results from analysis. Consequently it is very important
for the user to “understand and verify” outputs [14].
Furthermore, provenance of the data (supplementary information
that explains how each result was derived) should be provided
to help the user to understand what he obtains.
If we can easily see how volume, velocity, veracity and variety
influence the pipeline of Big Data architecture, there is
another important aspect in data to handle in Big Data Architecture:
privacy. R. Hillard7 considers it to be very important
that privacy appears in a good place in his definition of
Big Data. Privacy can cause problems at the creation of data
(someone who wants to hide some piece of information), at
the analysis on data [1] because if we want to aggregate data or
to correlate it we could have to access private data; and privacy
can also cause inconsistencies at the purging of database.
Indeed if we delete all individuals data we can get incoherences
with aggregate data.
To sum up handle Big Data implies having an infrastructure
linear scalable, able to handle high throughput multi-formatted
data, fault tolerant, auto recoverable, with a high degree of
parallelism and a distributed data processing [3]. It is important
to note that, in this management, integrating data (i.e
“access, parse, normalize, standardize, integrate, cleanse,
extract, match, classify, mask, and deliver data.” [4, chap. 21])
represents 80% of a Big Data project. This aspect is deeply
discussed in Section 3.3.