Book of Abstracts: Albany 2003
June 17-21 2003
Accurate Exon-intron Sequence Classification by Neural Networks using 3-base Frame-dependent NucleotIde Frequencies to Represent Sequences
Computational methods for automated gene recognition are important to make sense of the voluminous genomic sequence data presently being generated. A predictor for the coding potential of a given window of DNA sequence is the foundation for all ?ab initio? gene finding methods, which search for content and signal characterizations of coding sequences. A majority of these methods use frame-dependent hexamer (dicodons) as the statistical measure in the search by content approach.
We investigated frame-dependent nucleotide, dinucleotide, and trinucleotide frequency representations of exon and intron sequence classes as content measures for their classification (1). Using feed-forward neural networks and support vector machines as classifiers on these content measures, we showed that the 3-base frame-dependent nucleotide, dinucleotide, and trinucleotide frequencies (but not other base frames) yielded good classification results for various sequence lengths, up to 92% accuracy, on three standard data sets (2-4). These results are better than the reported classification results using the dicodons measure. The success of the 3-base frame-dependent frequencies to differentiate coding and noncoding regions can be explained by the triplet codon constraints on genomic sequences. The 3-base frame frequency measurements we describe here reduces the size of the coding measure vector by over an order of magnitude from that of the hexamer measurement used in many gene finding programs. Thus, the 3-base frame representation makes it easier to perform several classification algorithms and decreases the requirement of the size of the training data for the coding/ noncoding classification.
The authors acknowledge a CFCI/RA grant from University of Massachusetts Lowell.
Dang D. Long1
1Center for Intelligent Biomaterials