This study investigates the feasibility and performance of running global climate models on a cloud computing environment.
We create an
AWS EC2 virtual cluster using StarCluster software package and carry
out the CESM simulations to benchmark the running time and
parallelization efficiency. The CESM model can be run in parallel mode
on such cloud virtual cluster with minimum amount of efforts of
packaging and compiling the code as well as transferring input data
sets to the cloud environment. We test the parallelization efficiency of
the CESM model on the AWS EC2 virtual cluster and find that, up to 64
cores, the AWS EC2 can render a parallelization efficiency comparable
to (or even better than) that of the traditional Linux cluster with
InfiniBand connection. For the case that we test on the AWS virtual
cluster, the communication between virtual EC2 nodes outweighs the
saving from distributed computing when number of cores exceeds 64
(i.e. 4 nodes). This is different from the case on a local HPC cluster,
where the running time is still decreasing when number of cores
increases from 64 to 112. Such difference is due to the network on the
AWS virtual cluster (10 Gigabit Ethernet with latency of ~80 μs for
message size of 1 byte) is higher than that on local HPC cluster (40
Gigabit InfiniBand with latency of ~1.7 μs for message size of 1 byte).
Because 112 cores are the maximum cores available to us on the local
HPC cluster, we don’t know what is the maximum scalability of CESM
model when using 40 Gigabit InfiniBand network on the local HPC
cluster with Intel Xeon system. However, Worley et al. (2011) showed
the maximum scalability of simulation performance of CESM model
can be 8 times when using CPU cores from 64 to 2048 on Cray XT5
system (57.6 Gigabit three-dimensional torus network with latency of
~1 μs). All these confirm that CESM model is latency sensitive, likely
due to the extensive exchange of information among modules and
inside each module at each time step of numerical integration. This
brings up a question worthy for future investigation: current ways of
parallelization in the CESM (likely in other climate models as well) are
optimized for the traditional supercomputing facilities, can the code
and parallelization be optimized for cloud computing environment?