Consider the typical single-program
multiple-data parallel-programming
or bulk-synchronous parallel model,
where application data is partitioned
and distributed across the individual
memories or disks of the computation
nodes, and the nodes share data via network
message passing. In turn, the application
code on each node manages the
local, multilevel computation hierarchy—
typically multiple, multithreaded, possibly
heterogeneous cores, and (often) a GPU
accelerator—and coordinates I/O, manages
application checkpointing, and oversees
power budgets and thermal dissipation.
This daunting level of complexity
and detailed configuration and tuning
makes developing robust applications
an arcane art accessible to only a dedicated
and capable few.
Ideally, future software design, development,
and deployment will raise
the abstraction level and include performance
and correctness in mind at the
outset rather than ex situ. Beyond more performance-aware design and development
of applications based on integrated
performance and correctness
models, these tools must be integrated
with compilers and runtime systems,
provide more support for heterogeneous
hardware and mixed programming
models, and provide more sophisticated
data processing and analysis.