The State of The Art in Hadoop
It is possible to blend operational and analytics together in Hadoop today, and in fact we see many of our customers doing it.
The pieces you need are already in Hadoop:
Apache HBase is the NoSQL database for Hadoop and is great at fast updates and low latency data access.
Apache Phoenix (pioneered by Salesforce) is a SQL skin for data in HBase. Phoenix is already investigating integration with transaction managers like Tephra (from Cask).
Apache Hive is the de-facto SQL engine for Hadoop providing the deepest SQL analytics and supporting both batch and interactive query patterns. See our recent Stinger.Next post for advances such as Hive LLAP.
We see our customers using these parts today to build applications with deep analytics, for example a very common pattern we see includes:
Using HBase as the online operational data store for fast updates on hot data such as current partition for the hour, day etc.
Executing operational queries directly against HBase using Apache Phoenix.
Aging data in HBase to Hive tables using standard ETL patterns.
Performing deep SQL analytics using Hive
This works but it creates a number of complexities for developers. For example:
Which SQL interface do I use and when? Do I use Hive which offers deep SQL but low TPS? Or do I use Phoenix with high TPS and basic SQL? Or do I use both?
If I use both, how do I share data between Hive and HBase?
How do I tune my cluster so that I can successfully co-locate HBase and Hive while meeting my SLAs?
These questions suggest deeper integration is needed to simplify building applications with deep analytics on Hadoop.