Alternating Least Squares: We have implemented the alternating least squares job in Section 3.3 to measure the benefit of broadcast variables for iterative jobs that copy a shared dataset to multiple nodes. We found that without using broadcast variables, the time to resend the ratings matrix R on each iteration dominated the job’s running time. Furthermore, with a na ̈ıve implementation of broad- cast (using HDFS or NFS), the broadcast time grew linearly with the number of nodes, limiting the scalability of the job. We implemented an application-level multicast system to mitigate this. However, even with fast broad- cast, resending R on each iteration is costly. Caching R in memory on the workers using a broadcast variable im- proved performance by 2.8x in an experiment with 5000 movies and 15000 users on a 30-node EC2 cluster.
Interactive Spark: We used the Spark interpreter to load a 39 GB dump of Wikipedia in memory across 15 “m1.xlarge” EC2 machines and query it interactively. The first time the dataset is queried, it takes roughly 35 sec- onds, comparable to running a Hadoop job on it. How- ever, subsequent queries take only 0.5 to 1 seconds, even if they scan all the data. This provides a qualitatively dif- ferent experience, comparable to working with local data.