• Data Integration Capability
– Apache Sqoop: a tool designed for transferring data from a relational database directly into HDFS or into Hive [12,18]. It automatically generates classes needed to import data into HDFS after analyzing the schema’s tables; then the reading of tables’ contents is a parallel MapReduce job;
– Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It is designed to import streaming data flows [12,27].
Visualization techniques
Making valuable decisions is the ultimate goal of Big Data analysis and the achievement of this goal requires good visualization of Big Data content. For this reason, there is a real interest in the field of visualization [4,3] i.e “techniques and technologies used for creating images, diagrams, or animations to communicate, understand, and improve the results of big data analyses” [10]. Let us note that visualization in Big Data context is static. Indeed, data are not stored in a relational way and real-time updates require processing large amount of data; but this problem has started to be addressed [3]. Here we present some techniques for Big Data visualization.9
• Tag Cloud. It is a method for visualizing and linking concepts of a precise domain or web site. These concepts are written using text properties such as font size, weight, or color.
• Clustergram. M. Schonlau [28] defines clustergram as
a visualization technique used for cluster analysis displaying how individual members of a dataset are assigned to clusters as the number of clusters increases. As for every clustering process the number of clusters is important and it has the advantage to easily perceive how the number influences partitioning results.
• History Flow. F.B. Viégas, M. Wattenberg and K. Dave [29]
present history flow as a visualization technique designed to show the evolution of a document efficiently with respect to the contributions of its different authors. The horizontal axis of a history flow carries time and the vertical axis the names of the authors. A color code is assigned to each author and the vertical length of a bar indicates the amount of text written by each author.
• Spatial information flow. It is another visualization
technique that represents spatial information flows. It is mostly represented as a lighting graph where edges connect sites located on a map.
Visualization can also be used to solve Big Data problems. For a brief review on this topic, see [30].