If we can easily see how volume, velocity, veracity and variety influence the pipeline of Big Data architecture, there is
another important aspect in data to handle in Big Data Architecture: privacy. R. Hillard7 considers it to be very important that privacy appears in a good place in his definition of
Big Data. Privacy can cause problems at the creation of data
(someone who wants to hide some piece of information), at
the analysis on data [1] because if we want to aggregate data or
to correlate it we could have to access private data; and privacy can also cause inconsistencies at the purging of database.
Indeed if we delete all individuals data we can get incoherences with aggregate data.
To sum up handle Big Data implies having an infrastructure
linear scalable, able to handle high throughput multi-formatted
data, fault tolerant, auto recoverable, with a high degree of
parallelism and a distributed data processing [3]. It is important
to note that, in this management, integrating data (i.e
“access, parse, normalize, standardize, integrate, cleanse,
extract, match, classify, mask, and deliver data.” [4, chap. 21])
represents 80% of a Big Data project. This aspect is deeply
discussed in Section 3.3.