Predicting binding sites
Understanding the principles that govern where and when genes are expressed is essential for deciphering how genome information is turned into the molecular and cellular phenomena that underlie the biology of complex organisms. Gene expression programs are determined through the recognition of specific promoter and enhancer sequences within the DNA by regulatory transcription-factor proteins. Transcription-factor-binding sites (TFBSs; see the 'Background' box) are short sequences, many of which have been painstakingly elucidated over the years using experimental procedures such as DNAse footprinting and electrophoretic mobility shift assays (EMSA). TFBSs tend to be short, often less that 10 base-pairs long, and thus they are likely to occur within a genome quite often simply by chance. In addition, each transcription factor appears to tolerate a wide range of variations from its simple consensus sequence, making it extremely difficult to predict binding sites by simply searching a genome sequence for consensus motifs.