We introduce a general probabilistic model of the gene structure of
human genomic sequences which incorporates descriptions of the basic
transcriptional, translational and splicing signals, as well as length distributions
and compositional features of exons, introns and intergenic
regions. Distinct sets of model parameters are derived to account for the
many substantial differences in gene density and structure observed in
distinct C G compositional regions of the human genome. In addition,
new models of the donor and acceptor splice signals are described which
capture potentially important dependencies between signal positions. The
model is applied to the problem of gene identi®cation in a computer program,
GENSCAN, which identi®es complete exon/intron structures of
genes in genomic DNA. Novel features of the program include the capacity
to predict multiple genes in a sequence, to deal with partial as
well as complete genes, and to predict consistent sets of genes occurring
on either or both DNA strands. GENSCAN is shown to have substantially
higher accuracy than existing methods when tested on standardized
sets of human and vertebrate genes, with 75 to 80% of exons identi®ed
exactly. The program is also capable of indicating fairly accurately the reliability
of each predicted exon. Consistently high levels of accuracy are
observed for sequences of differing C G content and for distinct groups
of vertebrates.