When the draft of the human genome sequence was published in 2001, David Baltimore wrote the following in an accompanying commentary [2]: "Gene-regulatory sequences are now there for all to see, but initial attempts to find them were also disappointing. This is where the genomic sequences of other species - in which the regulatory sequences, but not the functionally insignificant DNA, are likely to be much the same - will open up a cornucopia". This is the basis of the method of 'phylogenetic footprinting'. The idea is that important regulatory modules are under selective pressure during evolution and that comparing two (or more) genomes will identify the conserved sequences that are most likely to be biologically relevant [3]. "Having multiple orthologous genes available provides a tremendous amount of information about what the most important features of the sequences are. It is the most valuable of 'sequence only' data," says computational biologist Gary Stormo (Washington University School of Medicine, St Louis, USA). Guigó adds "in fact, we can say that without the genomes of other species, it will be impossible to fully understand the human genome."
Having assembled the JASPAR database, the second feature of the Wasserman team's approach was to create tools for aligning long stretches of genomic DNA. "The alignment algorithm by Luis Mendoza (originally called DPB and now re-engineered and named ORCA) is part of a bioinformatics system termed OrthoSeq that is undergoing final revisions," says Wasserman. Phylogenetic footprinting approaches have proved powerful in previous studies of particular genomic loci but have rarely been applied on a genome-wide scale [4-7].
The final challenge was to combine the genome-alignment tools with the PWMs to create a system that was easy to use. "The third component, the computer methods, were the focus of a project by Boris Lenhard to create a suite of computer programming resources for researchers engaged in the study of regulatory sequences. This system, the TFBS Perl module, has been available for about a year and is already being broadly used in the field," says Wasserman.
When these three elements were combined, ConSite was born [8]. The authors are eager for their tools to be widely used and have done their best to make them accessible and user-friendly. "This collection is a resource for the global bioinformatics community," says Wasserman. "As opposed to commercial databases of transcription-factor information, we make our data available without restriction to academic research groups. Consistent with the philosophy of Journal of Biology and the Public Library of Science [9], we believe in open data access.