Book of Abstracts: Albany 2003

category image Albany 2003
Conversation 13
Abstract Book
June 17-21 2003

Accurate Exon-intron Sequence Classification by Neural Networks using 3-base Frame-dependent NucleotIde Frequencies to Represent Sequences

Computational methods for automated gene recognition are important to make sense of the voluminous genomic sequence data presently being generated. A predictor for the coding potential of a given window of DNA sequence is the foundation for all ?ab initio? gene finding methods, which search for content and signal characterizations of coding sequences. A majority of these methods use frame-dependent hexamer (dicodons) as the statistical measure in the search by content approach.

We investigated frame-dependent nucleotide, dinucleotide, and trinucleotide frequency representations of exon and intron sequence classes as content measures for their classification (1). Using feed-forward neural networks and support vector machines as classifiers on these content measures, we showed that the 3-base frame-dependent nucleotide, dinucleotide, and trinucleotide frequencies (but not other base frames) yielded good classification results for various sequence lengths, up to 92% accuracy, on three standard data sets (2-4). These results are better than the reported classification results using the dicodons measure. The success of the 3-base frame-dependent frequencies to differentiate coding and noncoding regions can be explained by the triplet codon constraints on genomic sequences. The 3-base frame frequency measurements we describe here reduces the size of the coding measure vector by over an order of magnitude from that of the hexamer measurement used in many gene finding programs. Thus, the 3-base frame representation makes it easier to perform several classification algorithms and decreases the requirement of the size of the training data for the coding/ noncoding classification.

The authors acknowledge a CFCI/RA grant from University of Massachusetts Lowell.

Dang D. Long1
Patrick Hoffman1,2
Ivo Grosse3
Kenneth A. Marx1,*
Georges Grinstein4

1Center for Intelligent Biomaterials
Dept. of Chemistry
University of Massachusetts Lowell
Lowell, MA 01854, USA
2Anvil Inc.
Burlington, MA 01803, USA
3Cold Spring Harbor Laboratory
Cold Spring Harbor, NY 11724, USA
4Institute for Visualization and Perception Research
Dept. of Computer Science
University of Massachusetts Lowell
Lowell, MA 01854, USA
*Phone/Fax: 978-934-3658/3013

References and Footnotes
  1. Hoffman, P., Grinstein, G., Marx, K., Grosse, I., Stanley, E. Proc. IEEE Visualization 97, 437-441 (1997).
  2. Fickett, J. W., Tung, C. S. Nucleic Acids Res. 20, 6441-6450 (1992).
  3. Burset, M., Guigo, R. Genomics 34, 353-367 (1996).
  4. Reese, M. G., Kulp, D., Tammana, H., Haussler, D. Genome Res. 10, 529-538.