2. What is big data?
Manyika et al. [10, page 1] define Big Data as “datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze”. Likewise, Davis and Patterson [1, page 4] say “Big data is data too big to be handled and analyzed by traditional database protocols such as SQL”; and the same opinion is shared by [11,3,4], etc. Both groups of authors previously mentioned go beyond the only size aspects of data when defining Big Data! Edd Dumbill in [12, page 3] explicitly conveys the multi-dimensionality of Big Data when adding that “the data is too big, moves too fast, or doesn’t fit the strictures of your database architectures”. This quotation allows us to see that extra characteristics should be added to large datasets to be considered as Big Data, or big size data as often found throughout the literature [2].
Now it is assumed that size is not the only feature of Big Data. Many authors [1,12,11,9,13,4] explicitly use the Three V’s (Volume, Variety and Velocity) to characterize Big Data. If the three V’s are largely found in the literature, many authors [10,13] and institutes like IEEE focus on Big Data Value, Veracity and Visualization. This last “V” to notice how important it is to
provide good tools to figure out data and analysis’ results.2
Volume (Data in rest). The benefit gained from the ability to process large amounts of information is the main attraction of big data analytics. Having more data beats having better models [12]. The consequence is that it is a trend for many companies to store vast amount of various sorts of data: social networks data, health care data, financial data, biochemistry and genetic data, astronomical data, etc.
Variety (Data in many forms). These data do not have a fixed structure and rarely present themselves in a per- fectly ordered form and ready for processing [12]. Indeed,such data can be highly structured (data from relational databases), semi-structured (web logs, social media feeds, raw feed directly from a sensor source, email, etc.) or unstructured (video, still images, audio, clicks) [12]. Another “V”, for Vari- ability, can be added to variety to emphasize on semantics, or the variability of meaning in language and communication protocols.
Velocity (Data in motion). Velocity involves streams of data, structured records creation, and availability for access and delivery.3 Indeed it is not just the velocity of the incoming
data that is the issue: it is possible to stream fast-moving data into bulk storage for later batch processing, for example. The importance lies in the speed of the feedback loop, taking data from input through to decision [12].
Value (Data in highlight). This feature is the purpose of Big Data technology. This view is well expressed by the International Data Corporation4 when saying that Big
Data architectures are: “designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis”. This value falls into two categories: analytical use (replacing/supporting human decision, discovering needs, segmenting populations to customize actions) and enabling new business models, products and services [12,10].
Veracity (Data in doubt). Veracity is what is conform with truth or fact, or in short, Accuracy, Certainty, Precision. Uncertainty can be caused by inconsistencies, model approximations, ambiguities, deception, fraud, duplication, incompleteness, spam and latency. Due to veracity, results derived from Big data cannot be proven; but they can be assigned a probability.
To conclude, dealing effectively with Big Data requires one to create value against the volume, variety and veracity of data while it is still in motion (velocity), not just after it is at rest [11]. And at the end, as recommended by [13], scientists must jointly tackle Big Data with all its features