In big data era, it is very hard for single machine to undertake the task of machine learning because of the
limitation of CPU, memory, or disk space. More and more learning algorithms resort to distributed big-data
platforms such as Hadoop, Spark, and Hazelcast [1]. All of these platforms involve coordination between
machines to accomplish a common computation task. The coordination come down to the cooperation with
respects to computation, storage, algorithm, knowledge and data between machines, and it is far from a simple
job.
Hadoop uses map-reduce model to decompose tasks. A central machine takes charge, assigns tasks and
invokes other machines to execute the tasks. Hazelcast is a data grid system and distributed computing platform.
When an execution of “ExecutorService” is launched by a client, Hazelcast chooses a node to accomplish the
task. These tasks of Hadoop and Hazelcast are usually coordinated by a central machine. If the central node is
down, it is hard, if not impossible, to switch the control to other nodes seamlessly.