• Crowdsourcing used to collect data and/or features and
metadata to enhance the current semantics of data.
• Text analytics which aims to analyze large text collections
(email, web pages, etc.) to extract information. It is used
for topics modeling, question answering, etc.
Some proposals emphasize that those techniques rely on a
generalized picture of the underlying knowledge. Due to their
design they fail to capture the subtleties of the processes
which produce these data [33,34]. Moreover, these techniques
sometimes behave badly with very large datasets. It is the
case for example of learning-based techniques. There, size of
training data can exceed memory or the fast growing number
of features can lead to a high execution time. Sengamedu [35]
presents some scalable methods which can be applied for
machine learning (Random Projections, Stochastic Gradient
Descent and MinClosed sequences). Trends about big data
analytics are summarized within [31]. They mainly concern
visualization of multi-form, multi-source and real-time data.
Moreover, the size of data limits in-memory processing.
3.3. Adding Big Data capability to an existing information
system
A whole book can be written on this topic. It is what had
been done by [3] by the study of data warehousing in the
age of Big Data. A number of strategies of this integration
are presented in Table 1. The first step of that integration is
about data acquisition. Since traditional databases have to
deal with structured data, existing ecosystem needs to be
extended across all of the data types and domains. Then,
data integration capability needs to deal with velocity and
frequency. The challenge here is also about ever growing
volume and, because many technologies leverage Hadoop, use
technologies that allow you to interact with Hadoop in a bidirectional
manner: load and store data (HDFS) and process
and reuse the output (MapReduce) for further processing.
[14, page 12] reminds us that the main challenge is not to
build “that is ideally suited for all processing tasks” but to
have an underlying architecture flexible enough to permit to
processes built on top to work at their full potential. For sure
there is not a commonly agreed solution, an infrastructure is
intimately tied to the purpose of the organization in which
it is used and consequently to the kind of integration (realtime
or batch). More and other important questions have to
be answered: are Big Data stored timeliness or not [4]?