Both inexpensive storage and the ability to generate
and collect information has outpaced any reasonable
expectation to interpret this information with-out automatic or at least semi-automatic techniques.
In the area of bioinformatics, e.g. genomics, the accumulation
and kinds of information discovered from
high-throughput sequencing of proteins1
far outpaces
even Moore’s law. There are two kinds of signficant
challenges facing bioinformatics: operations and computation.
Operations defined loosely is the design
and implementation of information systems that allow
general search; provide grid services via the web
for deployment of software, data sets; provide web portals
for scientists focused on various aspects of bioinformatics
to submit and post new findings to public
repositories of bioinformatic information that are
shared throughout bioinformatics communities; provide
a suitable structured environment to do in silico
science–computation. See [14] for an excellent
overview. The other challenge is computation, including
the development of models, algorithms, and data
mining2
. One of the primary tasks of bioinformaticians
is to make sense of the sequenced proteins, i.e.
function. A protein’s function, as has become to be
widely believed, is determined by its three-dimensional
structure that is, in turn, determined directly by the
linear sequence of amino acids making up the protein.
Crystallization is currently the only means of directly
determining structure, but is labor intenstive and very
difficult. High-throughput techniques have matured to
the point that it is far easier to sequence many thousands
of proteins rather than crystallizing a few. In
treating proteins as collections of strings, bioinformaticians
realized that by “aligning” strings, similar proteins
would likely share similar 3D structure, and therfore,
function. As an example we present an alignment
of about the last 30 residues of human alpha globulin