Results
Although our implementation of Spark is still at an early stage, we relate the results of three experiments that show its promise as a cluster computing framework.
Logistic Regression: We compared the performance of the logistic regression job in Section 3.2 to an implementation of logistic regression for Hadoop, using a 29 GB dataset on 20 “m1.xlarge” EC2 nodes with 4 cores each. The results are shown in Figure 2. With Hadoop, each iteration takes 127s, because it runs as an independent MapReduce job. With Spark, the first iteration takes 174s (likely due to using Scala instead of Java), but subsequent
Figure 2: Logistic regression performance in Hadoop and Spark.
iterations take only 6s, each because they reuse cached data. This allows the job to run up to 10x faster.
We have also tried crashing a node while the job was running. In the 10-iteration case, this slows the job down by 50s (21%) on average. The data partitions on the lost node are recomputed and cached in parallel on other nodes, but the recovery time was rather high in the cur- rent experiment because we used a high HDFS block size (128 MB), so there were only 12 blocks per node and the recovery process could not utilize all cores in the cluster. Smaller block sizes would yield faster recovery times.
Alternating Least Squares: We have implemented the alternating least squares job in Section 3.3 to measure the benefit of broadcast variables for iterative jobs that copy a shared dataset to multiple nodes. We found that without using broadcast variables, the time to resend the ratings matrix R on each iteration dominated the job’s running time. Furthermore, with a na ̈ıve implementation of broad- cast (using HDFS or NFS), the broadcast time grew lin- early with the number of nodes, limiting the scalability of the job. We implemented an application-level multicast system to mitigate this. However, even with fast broad- cast, resending R on each iteration is costly. Caching R in memory on the workers using a broadcast variable im- proved performance by 2.8x in an experiment with 5000 movies and 15000 users on a 30-node EC2 cluster.
Interactive Spark: We used the Spark interpreter to load a 39 GB dump of Wikipedia in memory across 15 “m1.xlarge” EC2 machines and query it interactively. The first time the dataset is queried, it takes roughly 35 sec- onds, comparable to running a Hadoop job on it. How- ever, subsequent queries take only 0.5 to 1 seconds, even if they scan all the data. This provides a qualitatively dif- ferent experience, comparable to working with local data.