Spatial Hadoop is gaining popularity as a big data platform for processing large volumes of Spatial data. But many real-world applications encounter performance and scalability problems while porting their applications to Spatial Hadoop. One of the main reasons for this is the lack of automatic data partitioning and distribution mechanism on HDFS for Spatial data. Ideas from the previous research on shared nothing models of parallel computing can be applied to the current Hadoop systems to improve the performance and throughput of Spatial applications using HDFS storage. Since the network latency between different nodes of Hadoop are different for different systems (infini band to ethernet), these techniques need to be adopted according to the system.
Current Spatial applications deal with a vast variety of data including Raster, Vector, and real-time sensor data. Storage models for data depend on the type of the data and nature of the application. This talk will present some of the common problems encountered in storing Spatial data on HDFS and offer some initial thoughts on clustering and declustering of data on HDFS for both Raster and Vector data types. The goal of this talk is to start the discussion on the storage models for Hadoop so that the storage can be automatically handled by the Spatial Hadoop systems and leave the application developers to focus on the analysis algorithms.