We present an algorithm that uses phylogenetic footprinting to identify potential TFBSs. The approach to identifying regulatory elements presented here yields greater specificity than previous approaches that were based purely on profile searches of single genomic sequences. In short, using phylogenetic footprinting to filter the computational predictions significantly reduces noise at the price of a slight decrease in sensitivity. The web application we present enables researchers to utilize this approach in a straightforward manner. With the culmination of the human and mouse genome sequencing efforts [34,35], we believe this new algorithm will be of significant use in the ongoing efforts to ascribe function to non-coding sequences.
Materials and methods
Genomic sequence alignment
As a result of the low overall similarity of non-coding regions across moderate evolutionary distances (for example, between human and mouse), many alignment algorithms will fail to produce biologically meaningful alignments or will require an arduous process to tune the algorithm parameters. In order to obtain high-quality global alignments, we utilized the DPB algorithm (L.M. and W.W., unpublished; see [23]), which is optimized for the global alignment of long genomic sequences containing short, colinear segments of similarity.
Measurement of local similarity in global alignments The most common approach used to measure local similarity between two globally aligned orthologous sequences utilizes a fixed-size sliding window to scan an alignment and identify segments containing a minimum number of identical nucleotides. The difficulties that arise with slidingwindow approaches are related to the treatment of edges and gaps in the alignment. Sliding a window along the alignment itself will assign a low identity score to short regions of high identity flanked by long regions of greater variation (for example, a large gap or insertion in one of the sequences). We elected to collapse the gaps in the alignment (that is, to remove the positions containing gaps in the sequence in question) and to calculate a separate conservation profile for each orthologous sequence.
Classification of motif-match conservation within aligned genomic sequences
Within the conserved segments, conserved sites are detected by, firstly, scanning each of the two orthologous sequences with
position-specific weight matrices [1] for the TFs of interest, and secondly, retaining only those predicted sites (for each given TF model) that are in equivalent positions in the alignment. The scores for matches to the positionspecific weight matrix models must exceed the user-defined relative matrix score threshold.
Collection and annotation of binding models
All profiles are derived from published collections of experimentally defined TFBSs for multicellular eukaryotes. The database, named JASPAR [15], represents a curated collection of target sequences. The motif-detection program ANNSpec [36] was used to align each binding site set. The ANN-Spec alignments were performed with a range of motif widths, using three random seeds and 80,000 iterations. The profile matrices and associated information are stored in a relational database (MySQL); a flat file representation of the data is available for academic use [22]. Users may also submit their own profiles for private use within the ConSite system.
Identification of relative matrix score thresholds Candidate TFBSs in individual sequences have a score as determined by the position weight matrix for the given sequence, which has been reviewed elsewhere [1]. The score ranges are unique for each binding model, so it is advantageous to convert the score range to a common, relative unit scale as given by