Book of Abstracts: Albany 2003
June 17-21 2003
Low Complexity Sequences in Proteins: New Perspectives for Sequence and Structure from Comparative Genomics
Ever since the publication of Atlas of protein sequence and structure by Margaret Dayhoff, analysis of protein sequences with respect to structure and function was taken up by several investigators. Russell Doolittle observed redundancies at the dipeptide and tripeptide levels (1). Subsequently, computational algorithms were developed by Wootton to assess sequence complexity in protein sequences (2). It was observed that most of the sequences whose crystal structures were available were of high complexity and proteins had different segments differing in sequence complexities. Low complexity sequences (sequences with repeats) usually had non globular structures in contrast to sequences of high complexity that had globular shapes. In routine sequence search against the database for prediction of protein function, low complexity sequences are filtered to reveal true sequence similarities for unambiguous predictions (3). Perhaps due to this perspective, low complexity sequences have not received detailed attention for quite some time.
The initiation of genome projects has uncovered thousands of new sequences allowing us to re-examine the role of low complexity sequences with respect to protein sequence, structure and evolution. Meanwhile it has also become apparent from several recent reports that low complexity sequences in several proteins may lack a rigid structure (unstructured) but may adopt a defined structure upon binding a target (4, 5). Further, the analysis of low complexity has received attention from several groups recently including ours (6-9).
Analysis of protein sequences from new sequence information of bacteria using novel complexity measure developed for comparative genomics has revealed that low complexity sequences constitute a minor fraction of the bacterial genomes (9). This indicates that they are generally selected against in bacterial evolution. A few bacteria such as Mycobacterium tuberculosis, Pseudomonas aeruginosa, Deinococcus radiodurans have high proportion of low complexity sequences in their genomes. It was also observed that the top ranking amino acids in these sequences were those that have been classified as having evolved early using a variety of criteria (10, 11). These observations indicate that the low complexity sequences evolved and expanded in bacteria using the early codons (9). In pathogenic organisms, low complexity sequences are thought to provide repertoire for variation whereas in organisms (12) such as D. radiodurans they are thought to aid in rapid recombination mediated repair (13). In eukaryotes, many low complexity regions are of extreme types (X)-n ; for example, X = N (Asparagine) in Plasmodium, X= Q (Glutamine) in Humans.
The structural aspects of low complexity sequences are emerging into an interesting stage. It has been shown that poly glutamine sequences adopt a beta sheet structure. In proteins such as myosin, low complexity sequences (in the form of alternating hydrophobic and hydrophilic residues) adopt helical conformation. In several other proteins they are unstructured but adopt a defined conformation upon binding to a target. With thousands of new sequences, it would be of interest to probe into the structural aspects of low complexity sequences. To this end we have identified several such sequences for peptide models of structure investigations (9).