The wealth of genomic data and the expanding number of
three-dimensional structures of protein has enhanced our understanding
of the evolution of protein structure and function relationships.
Nature’s ability to capitalize and exploit solutions that
work is highlighted by where we are currently – more than 60 million
protein sequences group into 16,295 protein families that use
1400 different three-dimensional folds. What do these numbers
tell us and how can we use that information?
The limited number of three-dimensional folds is a consequence
of physical chemistry and the energetics of protein folding.
Gene/protein sequence changes are typically deleterious to function
and/or organism survival (Wilson et al., 1977, 1987;
Creighton, 1992). Therefore, the possible value of mutations for
the evolution of new function or organism fitness is balanced by
the thermodynamics of protein folding and stability, as well as biochemical
function. We can use a variety of approaches (i.e., SCOP
and CATH) to define protein ‘fold’ families, but there is no algorithm
that can assign specific function from structure alone. Structural
similarity is valuable for helping deduce general function. For
example, is sequence ‘‘X” a kinase or a dehydrogenase can be
determined, but what the specific substrates are requires functional
studies (i.e., old-fashioned biochemistry). The variety of protein
families indicates that functional diversity is more complex
than structural variation. Knowing all of the protein folds is insuf-
ficient information to assign function, as certain folds are used in
different sequence families. Moreover, annotation of protein function
based solely on sequence similarity can lead to incorrect
assignments because subtle changes in key residues found in active
sites (or interaction sites) can lead to new specificities.
Comparative structure/function studies and sequence analysis
can provide evolutionary context that define relevant features of
proteins. Knowledge of three-dimensional structure combined
with sequence information can pinpoint regions required for biochemical
function. For example, the conservation of catalytic residues,
amino acids in cofactor binding sites, or allosteric sites can
reveal mechanistic details of molecular function and lead to the
generation of testable hypothesis about how a given protein operates.
In addition, structural information and bioinformatics can
provide context for focusing on sequence changes in regions
known to influence specificity or mode of action. Understanding
where to look in the vast amount of sequence information may also
help in assessing if changes in protein sequence and/or structure
are relevant for safety assessments of new commercial products.