Book of Abstracts: Albany 2005
Multiple Open Reading Frames and Codon Bias in All Genomes
Examination of over 1.08 million genes in the SWISS-PROT/TrEMBL data bank revealed that 15% of them have one or more ?stop? free frames in addition to the putative coding frame. By far the most prevalent multiple open reading frame (MORF) is the sense/antisense in frame double open reading frame (DORF). DORFs occur in 74, 124 of the genes in the gene bank (6.8%). Other highly represented MORFs include a triple open reading frame (TORF) composed of the sense/antisense open reading frames (SASORFs) and the +1 nucleotide frame shift in the sense frame (16,294 observations). To test the statistical significance of these observations we generated a million ?stop? free hypothetical genes of totally random composition (GC=50%). The frequency of occurrence of MORFs in a million random composition genes is 1.6%. Only the five possible DORFs are found in hypothetical genes. None of the other 24 types of MORFs that are observed in nature are ever generated. In addition, we have found a remarkable pattern of GC-rich codon/nucleotide triples bias in MORFs in some species and protein families including the short chain oxidoreductase (SCOR) enzymes. In Streptomyces coelicolor 34% of all genes have SASORFs, 27% have TORFs and 2.7% have four open reading frames. In all of the S. coelicolor genes having MORFs, 85% of the codons in the reading frame and nucleotide triples in the genes are GC-rich (two out of three nucleotides are G or C). It is important to note that 14% of the genes in the human genome also have MORFs. Each of the 29 possible types of MORFs are represented by 25 or more human genes but the vast majority are the five possible DORFs. Distinct differences in MORF patterns are observed between bacteria, eukaryote and archeal and viral genomes. SCOR enzyme genes exhibit a GC-rich triple bias similar to that of GC-rich bacteria, suggesting that the primordial SCOR enzyme evolved in a GC-rich species. AT-rich species have very few MORFs. In Fusobacterium nucleatum (GC=27%), only 1.3% of the 2000 genes have MORFs. Because we speculate that the most primitive genetic code was GC-rich and have noted that AT-only codons are rarely used in GC-rich species, we screened SWISS-PROT/TrEMBL database for genes that contain no AT-rich triples and examined their MORF content. We found 4899 genes without AT-rich triples, 94% of which have MORFs, including 1156 TORFs, and 831 with SASORFs. Closer examination of these genes revealed a correlation between the absence of AT-rich triples and the absence of the triples TAG and TTG and the triples correspond to the codons for cysteine (TGT and TGC). These data are consistent with our hypothesis that the most primitive genetic code was GC-rich and that cysteine may have been the last of the classic 20 amino acids to enter protein composition. The observed patterns support our contention that it is possible to reliably trace the evolution of the genetic code and the amino acid composition of proteins in MORF rich genes. Supported in part by NIH Grant DK26546.
1Hauptman-Woodward Medical Research Inst., Inc.