Oracle’s Big Data Solution
Oracle is the first vendor to offer a complete and integrated solution to address the full spectrum of enterprise big data requirements. Oracle’s big data strategy is centered on the idea that you can extend your current enterprise information architecture to incorporate big data. New big data technologies, such as Hadoop and Oracle NoSQL database, run alongside your Oracle data warehouse to deliver business value and address your big data requirements.
Figure 2 Oracle’s Big Data Solutions
Oracle Big Data Appliance
Oracle Big Data Appliance is an engineered system that combines optimized hardware with a comprehensive big data software stack to deliver a complete, easy-to-deploy solution for acquiring and organizing big data.
Oracle Big Data Appliance comes in a full rack configuration with 18 Sun servers for a total storage capacity of 648TB. Every server in the rack has 2 CPUs, each with 8 cores for a total of 288 cores per full rack. Each server has 64GB1 memory for a total of 1152GB of memory per full rack.
1 Upgradeable to a maximum of 512GB per node
Oracle White Paper—Big Data for the Enterprise
9
Figure 3 High-level overview of software on Big Data Appliance
Oracle Big Data Appliance includes a combination of open source software and specialized software developed by Oracle to address enterprise big data requirements.
The Oracle Big Data Appliance software includes:
Full distribution of Cloudera’s Distribution including Apache Hadoop (CDH4)
Oracle Big Data Appliance Plug-In for Enterprise Manager
Cloudera Manager to administer all aspects of Cloudera CDH
Oracle distribution of the statistical package R
Oracle NoSQL Database Community Edition2
And Oracle Enterprise Linux operating system and Oracle Java VM
2 Oracle NoSQL Database Enterprise Edition is available for Oracle Big Data Appliance as a separately licensed component
Oracle White Paper—Big Data for the Enterprise
10
Oracle NoSQL Database
Oracle NoSQL Database is a distributed, highly scalable, key-value database based on Oracle Berkeley DB. It delivers a general purpose, enterprise class key value store adding an intelligent driver on top of distributed Berkeley DB. This intelligent driver keeps track of the underlying storage topology, shards the data and knows where data can be placed with the lowest latency. Unlike competitive solutions, Oracle NoSQL Database is easy to install, configure and manage, supports a broad set of workloads, and delivers enterprise-class reliability backed by enterprise-class Oracle support.
Figure 4 NoSQL Database Architecture
The primary use cases for Oracle NoSQL Database are low latency data capture and fast querying of that data, typically by key lookup. Oracle NoSQL Database comes with an easy to use Java API and a management framework. The product is available in both an open source community edition and in a priced enterprise edition for large distributed data centers. The former version is installed as part of the Big Data Appliance integrated software.
Oracle Big Data Connectors
Where Oracle Big Data Appliance makes it easy for organizations to acquire and organize new types of data, Oracle Big Data Connectors tightly integrates the big data environment with Oracle Exadata and Oracle Database, so that you can analyze all of your data together with extreme performance. The Oracle Big Data Connectors consist of four components:
Oracle Loader for Hadoop
Oracle Loader for Hadoop (OLH) enables users to use Hadoop MapReduce processing to create optimized data sets for efficient loading and analysis in Oracle Database 11g. Unlike other Hadoop loaders, it generates Oracle internal formats to load data faster and use less database system resources. OLH is added as the last step in the MapReduce transformations as a separate
Oracle White Paper—Big Data for the Enterprise
11
map – partition – reduce step. This last step uses the CPUs in the Hadoop cluster to format the data into Oracle’s internal database formats, allowing for a lower CPU utilization and higher data ingest rates on the Oracle Database platform. Once loaded, the data is permanently available in the database providing very fast access to this data for general database users leveraging SQL or business intelligence tools.
Oracle SQL Connector for Hadoop Distributed File System
Oracle SQL Connector for Hadoop Distributed File System (HDFS) is a high speed connector for accessing data on HDFS directly from Oracle Database. Oracle SQL Connector for HDFS gives users the flexibility of querying data from HDFS at any time, as needed by their application.
It allows the creation of an external table in Oracle Database, enabling direct SQL access on data stored in HDFS. The data stored in HDFS can then be queried via SQL, joined with data stored in Oracle Database, or loaded into the Oracle Database. Access to the data on HDFS is optimized for fast data movement and parallelized, with automatic load balancing. Data on HDFS can be in delimited files or in Oracle data pump files created by Oracle Loader for Hadoop.
Oracle Data Integrator Application Adapter for Hadoop
Oracle Data Integrator Application Adapter for Hadoop simplifies data integration from Hadoop and an Oracle Database through Oracle Data Integrator’s easy to use interface. Once the data is accessible in the database, end users can use SQL and Oracle BI Enterprise Edition to access data.
Enterprises that are already using a Hadoop solution, and don’t need an integrated offering like Oracle Big Data Appliance, can integrate data from HDFS using Big Data Connectors as a stand-alone software solution.
Oracle R Connector for Hadoop
Oracle R Connector for Hadoop is an R package that provides transparent access to Hadoop and to data stored in HDFS.
R Connector for Hadoop provides users of the open-source statistical environment R with the ability to analyze data stored in HDFS, and to scalably run R models against large volumes of data leveraging MapReduce processing – without requiring R users to learn yet another API or language. End users can leverage over 3500 open source R packages to analyze data stored in HDFS, while administrators do not need to learn R to schedule R MapReduce models in production environments.
R Connector for Hadoop can optionally be used together with the Oracle Advanced Analytics Option for Oracle Database. The Oracle Advanced Analytics Option enables R users to transparently work with database resident data without having to learn SQL or database concepts but with R computations executing directly in-database.
Oracle White Paper—Big Data for the Enterprise
12
In-Database Analytics
Once data has been loaded from Oracle Big Data Appliance into Oracle Database or Oracle Exadata, end users can use one of the following easy-to-use tools for in-database, advanced analytics:
Oracle R Enterprise – Oracle’s version of the widely used Project R statistical environment enables statisticians to use R on very large data sets without any modifications to the end user experience. Examples of R usage include predicting airline delays at a particular airports and the submission of clinical trial analysis and results.
In-Database Data Mining – the ability to create complex models and deploy these on very large data volumes to drive predictive analytics. End-users can leverage the results of these predictive models in their BI tools without the need to know how to build the models. For example, regression models can be used to predict customer age based on purchasing behavior and demographic data.
In-Database Text Mining – the ability to mine text from micro blogs, CRM system comment fields and review sites combining Oracle Text and Oracle Data Mining. An example of text mining is sentiment analysis based on comments. Sentiment analysis tries to show how customers feel about certain companies, products or activities.
In-Database Graph Analysis – the ability to create graphs and connections between various data points and data sets. Graph analysis creates, for example, networks of relationships determining the value of a customer’s circle of friends. When looking at customer churn customer value is based on the value of his network, rather than on just the value of the customer.
In-Database Spatial – the ability to add a spatial dimension to data and show data plotted on a map. This ability enables end users to understand geospatial relationships and trends much more efficiently. For example, spatial data can visualize a network of people and their geographical proximity. Customers who are in close proximity can readily influence each other’s purchasing behavior, an opportunity which can be easily missed if spatial visualization is left out.
In-Database MapReduce – the ability to write procedural logic and seamlessly leverage Oracle Database parallel execution. In-database MapReduce allows data scientists to create high-performance routines with complex logic. In-database MapReduce can be exposed via SQL. Examples of leveraging in-database MapReduce are sessionization of weblogs or organization of Call Details Records (CDRs).
Every one of the analytical components in Oracle Database is valuable. Combining these components creates even more value to the business. Leveraging SQL or a BI Tool to expose the results of these analytics to end users gives an organization an edge over others who do not leverage the full potential of analytics in Oracle Database.
Oracle White Paper—Big Data for the Enterprise
13
Connections between Oracle Big Data Appliance and Oracle Exadata are via InfiniBand, enabling high-speed data transfer for batch or query workloads. Oracle Exadata provides outstanding performance in hosting data warehouses and transaction processing databases.
Now that the data is in mass-consumption format, Oracle Exalytics can be used to deliver the wealth of information to the business analyst. Oracle Exalytic