Predicting phenotypes (e.g., traits and disease risks) from
biomarkers such as the genome is, in principle, a
supervised machine learning problem. The inputs are a
stretch of DNA sequence (genotype) relevant to the
underlying biology, and the outputs are the phenotypes.
This approach [Fig. 2(a)] is not ideal for most complex
phenotypes and diseases for two reasons. First is the sheer
complexity of the relationship between a full genotype and
its phenotype. Even within a single cell, the genome
directs the state of the cell through many layers of intricate
and interconnected biophysical processes and control
mechanisms that have been shaped ad hoc by evolution.
Attempting to infer the outcomes of these complex
regulatory processes by observing only genomes and
phenotypes is rather like trying to learn how computer
chess playing programs work by examining binary code
and wins and losses, while ignoring which moves were
taken. Second, even if one could infer such models (those
that are predictive of disease risks), it is likely that the
hidden variables of these models would not correspond to
biological mechanisms that can be acted upon. Insight into
disease mechanisms is important for the purpose of