Basically, data processing is seen as the gathering, process- ing, management of data for producing “new” information for end users [3]. Over time, key challenges are related to stor- age, transportation and processing of high throughput data. It is different from Big Data challenges to which we have to add ambiguity, uncertainty and variety [3]. Consequently, these re- quirements imply an additional step where data are cleaned, tagged, classified and formatted [3,14]. Karmasphere5 cur- rently splits Big Data analysis into four steps: Acquisition or Access, Assembly or Organization, Analyze and Action or Decision. Thus, these steps are mentioned as the “4 A’s”. The Computing Community Consortium [14] similarly to [3], divides the organi- zation step into an Extraction/Cleaning step and an Integration step.
Acquisition. Big Data architecture has to acquire high speed data from a variety of sources (web, DBMS(OLTP), NoSQL, HDFS) and has to deal with diverse access protocols. It is where a filter could be established to store only data which could be helpful or “raw” data with a lower degree of uncer- tainty [14]. In some applications, the conditions of generation of data are important, thus it could be interesting for further analysis to capture these metadata and store them with the corresponding data [14].
Organization. At this point the architecture has to deal
with various data formats (texts formats, compressed files, variously delimited, etc.) and must be able to parse them and extract the actual information like named entities, relation between them, etc. [14]. Also this is the point where data have to be clean, put in a computable mode, structured or semi-structured, integrated and stored in the right location (existing data warehouse, data marts, Operational Data Store, Complex Event Processing engine, NoSQL database) [14]. Thus, a kind of ETL (extract, transform, load) had to be done. Successful cleaning in Big Data architecture is not entirely guaranteed; in fact “the volume, velocity, variety, and variability of Big Data may preclude us from taking the time to cleanse it all thoroughly”.6
Analyze. Here we have running queries, modeling, and building algorithms to find new insights. Mining requires in- tegrated, cleaned, trustworthy data; at the same time, data mining itself can also be used to help improve the quality and trustworthiness of the data, understand its semantics, and provide intelligent querying functions [14]. Decision. Being able to take valuable decisions means to be able to efficiently interpret results from analysis. Consequently it is very im- portant for the user to “understand and verify” outputs [14]. Furthermore, provenance of the data (supplementary informa- tion that explains how each result was derived) should be pro- vided to help the user to understand what he obtains.
If we can easily see how volume, velocity, veracity and va-
riety influence the pipeline of Big Data architecture, there is another important aspect in data to handle in Big Data Ar- chitecture: privacy. R. Hillard7 considers it to be very impor- tant that privacy appears in a good place in his definition of Big Data. Privacy can cause problems at the creation of data (someone who wants to hide some piece of information), at the analysis on data [1] because if we want to aggregate data or to correlate it we could have to access private data; and pri- vacy can also cause inconsistencies at the purging of database. Indeed if we delete all individuals data we can get incoher- ences with aggregate data.
To sum up handle Big Data implies having an infrastructure linear scalable, able to handle high throughput multi-formatted data, fault tolerant, auto recoverable, with a high degree of parallelism and a distributed data processing [3]. It is important to note that, in this management, integrating data (i.e “access, parse, normalize, standardize, integrate, cleanse, extract, match, classify, mask, and deliver data.” [4, chap. 21]) represents 80% of a Big Data project. This aspect is deeply discussed in Section 3.3.