of gene expression (CAGE) [5], systematic empirical annotation of a
set of transcript products by 5′ rapid amplification of cDNA ends
(RACE) and high-density resolution tiling arrays [6]. However, they
are experimentally labor-intensive and they have not been widely applied
in comparison with the standard expressed sequence tag (EST)
approach for fast characterization of cDNAs [7,8].
We previously used individual EST-based gene model refinement
by classic in silico sequence analysis to revise the mRNA sequence
of 109 human chromosome 21 protein-coding genes [1]. The success
of this approach encouraged us to develop a piece of software
(“5′_ORF_Extender” software) in order to automate the steps that
were previously performed manually, applying it to the Danio rerio
(zebrafish) genome [9].
The aim of this work was to perform a systematic identification of
coding regions at the 5′ end of all human known mRNAs. However, it
proved difficult to simply transfer the method used for D. rerio to
Homo sapiens, due to the much larger size and complexity of RNA and
EST sequence databases as well as the sequence analysis (BLAST, Basic
Local Alignment Search Tool) results file. In order to overcome these
problems, a fully revised computational biology strategy was adopted,
which has been able to conclude the task for human mRNAs. We have
thus been able to compile a database containing 477 loci, out of a total
of 18,665 investigated (2.6%), where an extension of the RNA 5′ coding
region has been identified. Proof-of-concept confirmation has been
obtained by actual in vitro cloning and sequencing for GNB2L1, QARS
and TDP2 genes. The availability of the database with the results of thewhole analysis should help further to reduce the incidence of 5′ endmRNA artifacts when studying human gene structure and function in biomedical research