those generated with the same human genes filtered by comparative analysis with orthologous mouse gene sequences (Table 1). The sequence pairs ranged in length between 680 and 2,900 base-pairs (bp), but all included the region -500 to +100 relative to the transcription start site. Within the 14 paired sequences are 40 experimentally defined TFBSs (Table 1) for 13 distinct TFs within the set of available matrices. For clarity, these binding sites were not utilized in the construction of the matrix models. A conservation cutoff was set to 70% for all tests, while the window size for conservation analysis was set to 50 bp.
Selectivity
Insufficient experimental data are available to confidently classify predictions as false, because many functional sites remain to be discovered. As the population of true TFBSs within a genomic sequence is anticipated to be small, we define the false-positive rate as the total number of predictions from all models divided by the length of the query sequence. The number of predicted TFBSs was determined for incrementally increasing relative matrix score thresholds (described in the Materials and methods section) between 65% and 90% for both single sequences and the corresponding orthologous pairs:
Sel(c) = ———————mM Pm,c
L
where M is the set of 108 models, Pm,c the number of predicted sites using model m and relative matrix score threshold c, and L the length of the analyzed sequence in base-pairs (Figure 2a).
Predictive selectivity (measured by the average number of predicted TFBSs per 100 bp of promoter sequence when scanning with all models) improved by 85% (average ratio: 0.15) when phylogenetic footprinting is applied. The ratios of the observed selectivity scores using phylogenetic footprinting to those obtained using single-sequence analysis modes are shown in Figure 2c.
Sensitivity
Sensitivity measures the ability to correctly detect known sites (that is, when a prediction and an annotated TFBS overlap by at least 50% of the width of the thinnest pattern), given a corresponding transcription-factor binding-profile model. Analyses were performed with incrementally increasing relative matrix score thresholds between 65% and 90%. The overall sensitivity (the fraction of known sites detected) was reduced slightly under the conservation requirement: 65.5% were detected with phylogenetic footprinting (settings of 75% relative matrix score threshold, 70% identity cut-off, 50 bp window) as compared to 72.5% when analyzing single sequences (Figure 2b). The fact that a few sites were not detected with the stringent requirements for both regional sequence and specific-site conservation can be attributed to multiple causes. For instance, TFBSs may not be conserved or may be present but not detected by the profile under the thresholds. We conclude that most