environment in parallel while subsequent measurements
are taken at the instrument and other instruments are
sending data.
Once the data file is stored within an HPC environment,
the next stage of the workflow includes conversion
to a data model suitable for HPC-based analysis,
generally using the Hierarchical Data Format version 5
(HDF5). With the data set now converted and resident
on a parallel file system, the next stage of the workflow,
analysis via scalable methods, can be executed. At this
juncture, an analysis algorithm is selected based on the
instrument, the measurement, the material composition,
and other user-specified criteria. Once selected, the analysis
is executed on an HPC system. The resultant data
and statistics are then made available to the user for inspection
and further analysis. Initial experimentation of
this concept has shown that analysis can be completed
in seconds, allowing near real-time feedback from the
measurement. Upon completion of the analysis, the data
is then organized for possible archival. Once data movement
and analysis is completed, interactive visual analysis
is made available for further inspection of the data.
Scalable analytics
It is important to note that the difficulties surrounding
scalable analytics in the context of the imaging methods
insofar discussed extend far beyond the need for taskbased
and data-based parallelism. In particular, one of
the primary challenges expected to impede further progress
is the application of statistical methods in extremely
high dimensions. Due to the structure of the
analysis problems in computational settings, the complexity
of the problem space manifests itself as a highdimensional
analysis problem, where dimensionality is
most often associated with the number of measurements
being considered simultaneously. The curse of dimensionality
is a persistent phenomenon in modern statistics due
to our ability to measure at rates and scales unheard of
until the modern era [94]. However, there are many strategies
to mitigate the statistical consequences of high
dimensionality.
While some of the methods noted earlier in this paper
are computationally scalable, in many cases, they are not
appropriate for other reasons. For example, although
PCA, ICA, k-means, and back propagation for neural
networks all fit the Statistical Query Model, and thus belong
to a known set of problems that can essentially
scale linearly, this does not necessarily solve the issues
raised by high-dimensional analysis [95]. For example, it
is important to observe that in high-dimensional spaces,
nearest neighbors become nearly equidistant [96]. This
is particularly problematic for clustering algorithms but
also has significant consequences for other dimensionality
reduction techniques.