The NCBI RefSeq pipeline uses a combination of homology
searching with ab initiomodeling. First cDNAs and ESTs were
aligned to the genomic sequences using Splign [98] and proteins
were aligned to the genomic sequences using ProSplign [99].
The best scoring coding sequence was identified for all cDNA
alignments using the same scoring system used by Gnomon
[100], the NCBI ab initio prediction tool. All cDNAs with a
coding sequence scoring above a certain threshold were marked
as coding cDNAs, and all others were marked as UTRs. Coding
sequences that lack a translation initiation or termination signal
were categorized as incomplete. Protein alignments were scored
the same way, and coding sequences that did not satisfy the
threshold criterion for a valid coding sequence were removed.
After determining the UTR/CDS nature of each alignment, the
alignments were assembled using a modification of the Maximal
Transcript Alignment algorithm [101], accounting for not only
exon-intron structure compatibility but also the compatibility of
the reading frames. Two coding alignments were connected
only if they both had open and compatible coding sequences.
UTRs were connected to coding alignments only if the
necessary translation initiation or termination signals were
present. There were no restrictions on the connection of UTRs
other than the exon-intron structure compatibility. All assembled models with a complete coding sequence, including the
translation initiation and termination signals, were combined
into alternatively spliced isoform groups. Incomplete or partially
supported models were directed to Gnomon [100] for extension
byab initioprediction. Models containing a debilitating mutation
such as a frameshift or nonsense mutation were categorized as
either transcribed or non-transcribed pseudogenes. A subset of
pseudogenes are likely to be functional genes that have errors in
the Acyr_1.0 assembly and may be reclassified as protein-coding
genes with subsequent improvements to the assembly and
annotation. Gnomon [3] was also used to predict pureab initio
models in regions of the genome that lacked any cDNA, EST, or
protein alignments