Recent studies show that the use of a multiple layer architecture is an option for dealing with big data. The Distributed Parallel architecture distributes data across multiple processing units and parallel processing units provide data much faster, by improving processing speeds. This type of architecture inserts data into a parallel DBMS, which implements the use of MapReduce and Hadoop frameworks. This type of framework looks to make the processing power transparent to the end user by using a front end application server.[33]
Big Data Analytics for Manufacturing Applications can be based on a 5C architecture (connection, conversion, cyber, cognition, and configuration). Please see http://www.imscenter.net/cyber-physical-platform . In the "Connection" level, devices can be designed to self-connect and self-sensing for its behavior. In the "Conversion" level, data from self-connected devices and sensors are measuring the features of critical issues with self-aware capabilities, machines can use the self-aware information to self-predict its potential issues. In the "Cyber" level, each machine is creating its own "twin" by using these instrumented features and further characterize the machine health pattern based on a "Time-Machine" methodology. The established "twin" in the cyber space can perform self-compare for peer-to-peer performance for further synthesis. In the "Cognition" level, the outcomes of self-assessment and self-evaluation will be presented to users based on an "infographic" meaning to show the content and context of the potential issues. In the "Configuration" level, the machine or production system can be reconfigured based on the priority and risk criteria to achieve resilient performance.[34]
The 5C Level Architecture can be described as: Smart Connection- Acquiring accurate and reliable data from machines and their components is the first step in developing a cyber-physical system application. The data might be directly measured by sensors or obtained from controller or enterprise manufacturing systems such as ERP, MES, SCM and CMM. Two important factors at this level have to be considered. First, considering various types of data, a seamless and tether-free method to manage data acquisition procedure and transferring data to the central server is required where specific protocols such as MTConnect, etc. are effectively useful. On the other hand, selecting proper sensors (type and specification) is the second important consideration for the first level. Data-to-Information Conversion-Meaningful information has to be inferred from the data. Currently, there are several tools and methodologies available for the data to information conversion level. In recent years, extensive focus has been applied to develop these algorithms specifically for prognostics and health management applications. By calculating health value, estimated remaining useful life, etc., the second level of CPS architecture brings self-awareness to machines. Cyber-The cyber level acts as central information hub in this architecture. Information is being pushed to it from every connected machine to form the machines network. Having massive information gathered, specific analytics has to be used to extract additional information that provide better insight over the status of individual machines among the fleet. These analytics provide machines with self-comparison ability, where the performance of a single machine can be compared with and rated among the fleet and on the other hand, similarities between machine performance and previous assets (historical information) can be measured to predict the future behavior of the machinery. In this paper we briefly introduce an efficient yet effective methodology for managing and analyzing information at cyber level. Cognition-Implementing CPS upon this level generates a thorough knowledge of the monitored system. Proper presentation of the acquired knowledge to expert users supports the correct decision to be taken. Since comparative information as well as individual machine status is available, decision on priority of tasks to optimize the maintaining process can be made. For this level, proper info-graphics are necessary to completely transfer acquired knowledge to the users. Configuration-The configuration level is the feedback from cyber space to physical space and act as supervisory control to make machines self-configure and self-adaptive. This stage acts as resilience control system (RCS) to apply the corrective and preventive decisions, which has been made in cognition level, to the monitored system.[24][25]
Technologies
Big data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. A 2011 McKinsey report[35] suggests suitable technologies include A/B testing, crowdsourcing, data fusion and integration, genetic algorithms, machine learning, natural language processing, signal processing, simulation, time series analysis and visualisation. Multidimensional big data can also be represented as tensors, which can be more efficiently handled by tensor-based computation,[36] such as multilinear subspace learning.[37] Additional technologies being applied to big data include massively parallel-processing (MPP) databases, search-based applications, data mining, distributed file systems, distributed databases, cloud based infrastructure (applications, storage and computing resources) and the Internet.[citation needed]
Some but not all MPP relational databases have the ability to store and manage petabytes of data. Implicit is the ability to load, monitor, back up, and optimize the use of the large data tables in the RDBMS.[38]
DARPA’s Topological Data Analysis program seeks the fundamental structure of massive data sets and in 2008 the technology went public with the launch of a company called Ayasdi.[39]
The practitioners of big data analytics processes are generally hostile to slower shared storage,[40] preferring direct-attached storage (DAS) in its various forms from solid state drive (SSD) to high capacity SATA disk buried inside parallel processing nodes. The perception of shared storage architectures—Storage area network (SAN) and Network-attached storage (NAS) —is that they are relatively slow, complex, and expensive. These qualities are not consistent with big data analytics systems that thrive on system performance, commodity infrastructure, and low cost.
Real or near-real time information delivery is one of the defining characteristics of big data analytics. Latency is therefore avoided whenever and wherever possible. Data in memory is good—data on spinning disk at the other end of a FC SAN connection is not. The cost of a SAN at the scale needed for analytics applications is very much higher than other storage techniques.
There are advantages as well as disadvantages to shared storage in big data analytics, but big data analytics practitioners as of 2011 did not favour it.[41]