Book of Abstracts: Albany 2007
June 19-23 2007
Estimating the number of protein superfamilies without known structure
In the past decade, the number of known protein sequences has been increasing steadily owing to the progress of genome sequencing. Structural Genomics (SG) projects were designed to counter this sequence explosion by high-throughput structural characterization of all protein sequence families. Although SG projects have been providing novel protein structures at an accelerated pace, it is not clear how far we are from the ultimate goal of obtaining at least one structure for each protein family.
We approached this problem by building 22126 profile Hidden Markov Models (HMMs) of all protein families in CD, COG, KOG, PFAM, TIGRFAM and SMART databases. These profiles were compared to each other and to 14100 profile HMMs derived from protein structures in PDB clustered at 95% identity. Significant scores from ~720 million profile-profile comparisons were stored in a matrix and used to cluster protein families and PDB structures. Of ~7400 distinct superfamily clusters, more than 4600 had no associated PDB structures. In this category, ~4000 of protein superfamilies contain soluble proteins, while ~600 are predicted to have at least three trans-membrane regions.
Taken at face value, the results above suggest that less than 5000 carefully chosen protein structures are needed to create the set of structural representatives for all protein families. However, this is likely an overestimate as even the most sensitive profile-profile methods tend to miss genuine structural relationships between protein families that are not reflected in their distant sequences. Furthermore, when considering only those clusters with large number of proteins (arbitrarily set at >50), the number of protein families without structural relatives in PDB drops to ~1200 (900 families with soluble proteins and 300 predicted to be trans-membrane).
We created the top 10 list of most abundant protein families, which includes 8 enzyme folds (dehydrogenases, ATPases, methyltransferases, GTPases, kinases, α/β hydrolases, TIM-barrel and aminotransferases) and 2 non-enzymes (IG and WD folds). In addition, we offer the top 10 list of ?most wanted? families which contain the largest number of proteins without PDB structure.
Dept of Microbiology,