Characterization of the promoter regions of eukaryotic genes remains one of the most elusive problems in computational genome analysis," says Roderic Guigó (Institut Municipal d'Investigació Mèdica, Barcelona, Spain). To address these challenges, bioinformaticians have developed approaches using position weight matrices (PWMs) that take into account the observed frequency of tolerated sequence variations at each nucleotide position within a consensus TFBS and give a quantitative score that reflects the actual binding specificity of the factor. Extensive investigation of transcriptional regulation has provided insights into how gene expression is finely regulated by the sequence and distribution of multiple TFBSs within cis-regulatory regions upstream of each gene. Combinations of TFBSs for different factors can form cis-regulatory modules, with complex functional synergy, that drive the transcriptional machinery.
The first thing that Wyeth Wasserman's group did was build a library of high-quality PWMs. The quality of these matrices is critical for accurate site prediction. The best way to build a PWM is to plunge into the published literature and pull out relevant information from papers describing in vitro and in vivo experiments on individual transcription factors. "The collection of binding profiles, collectively termed the JASPAR database, was produced by the pure determination of Albin Sandelin for his thesis project studying the binding similarities of transcription factors in the same structural families," says Wasserman. (See the 'Behind the scenes' box for further discussion of the motivation for the work.) The team constructed over a hundred binding-profile matrices for different transcription factors. Any DNA sequence can be screened using these matrices to locate potential TFBSs. A certain number of potential sites will be identified just by chance, however, and finding a potential site doesn't guarantee that the cognate factor actually binds there or that the site is of biological relevance.