As workloads evolve from onedimensional applications—search and
the like—to much more complex workflows, second-generation big data systems must achieve not just scalability,
resiliency, and usability, but also incorporate appropriate structures to support
multiple analytic methods on varied
data types, as well as the ability to respond in near real time.
Figure 2a illustrates a logical view
of the big data system stack. At the
bottom is a common data layer that
allows each analytics engine access to
the required data; this layer also facilitates data sharing among the analytics engines while providing resilient
persistent storage. In the middle is a
resource scheduler that efficiently divides and distributes workload tasks
among available infrastructure resources. At the top are the analytics
engines; these are the frameworks that
translate the user’s analytics codes
into consumable tasks for the resource
manager to distribute.
Figure 2b shows the first-generation embodiment for this logical view.
First-generation big data systems were
designed to provide capability based
mainly on scalable and resilient batch
analytics; workflowas an aspect of big
data—illustrated in Figure A of the
sidebar—was given little attention.
Now, equally enabled by technological trends and driven by workload
demands, a new paradigm is rapidly
emerging. Figure 2c outlines this
second-generation embodiment for
big data systems.