Issue October 2010No. 2 (133-288) October 2010 ISSN 0739-1102
Open Access A Stoichiometry Driven Universal Spatial Organization of Backbones of Folded Proteins: Are there Chargaff’s Rules for Protein Folding? (133-142)Protein folding is at least a six decade old problem, since the times of Pauling and Anfinsen. However, rules of protein folding remain elusive till date. In this work, rigorous analyses of several thousand crystal structures of folded proteins reveal a surprisingly simple unifying principle of backbone organization in protein folding. We find that protein folding is
a direct consequence of a narrow band of stoichiometric occurrences of amino-acids in
primary sequences, regardless of the size and the fold of a protein. We observe that “preferential interactions” between amino-acids do not drive protein folding, contrary to all prevalent views. We dedicate our discovery to the seminal contribution of Chargaff which
was one of the major keys to elucidation of the stoichiometry-driven spatially organized double helical structure of DNA.
Aditya Mittal1#* 1School of Biological Sciences, Indian Institute of Technology Delhi,
New Delhi, 110016, India Introduction
Protein folding is a grand challenge till date. From the time of elegant work of Pauling (1) and Anfinsen (2, 3), and formulation of the Levinthal’s paradox (4, 5), several views on protein folding have emerged in the last several decades (6-14). With a remarkable increase in available computational hardware and tools in recent times, attempts at determination of protein structure and function by modeling have become somewhat routine (15-26). However, research in the field has primarily focused on the possible variety of different interactions that can lead to specific features in functional protein structures. This has led to a strong prevalence of somewhat diverse views on how a protein folds (e.g., via electrostatics/hydrogen bonding vs. hydrophobic interactions). Therefore, protein folding still remains the biggest unsolved problem in modern biology. Thus we asked the question: Is there a unifying theme or concept underlying the magnificent diversity of folded protein structures. The understanding of DNA structure came from a careful analysis of the backbone from X-ray diffraction data of a single molecular species-fibers of B-DNA (27). Inspired by this, we decided to investigate the backbones of, not one or two, but several thousands of folded proteins from their published crystal structures in the protein data bank, PDB (28). Assuming protein folding resulted from specific amino-acid interactions, the backbones of folded proteins would be organized within the constraints of defined “neighborhoods” for Cα atoms of each amino-acid. For example, if two amino-acids were to interact with each other (e.g., via side-chains), their respective Cα atoms would be expected to occur in fixed neighborhoods relative to each other, regardless of their actual position in the protein. Further, Cα of an amino-acid occurring mostly in the “center” of folded proteins would always be expected to be surrounded by a higher number of Cα atoms of other amino-acids. Figure 1 shows our approach for analyzing the protein backbones in terms of neighborhoods of every Cα atom in several thousands of folded proteins from their crystal structures. Methods Coordinates of all atoms in crystal structures of 4000+ proteins were taken from the Protein Data Bank. After specifically extracting the Cα coordinates for all the amino-acids (i.e., backbone of the folded protein) from a given PDB file, neighborhood analysis was done as described in Figure 1. Of the total crystal structures, 3718 were finally analyzed in detail (see legend to Figure 1). For each protein, a 20 × 20 matrix of number of “neighbors”, within a defined neighborhood distance, resulted by considering each of the amino-acids individually. Thus, the total number of 20 × 20 matrices was equal to the total number of the defined neighborhood distances. Data of all the 20 20 matrices was analyzed in MATLAB (Mathworks Inc., USA). The PDB ids of proteins classified as “unstructured” were taken from the database at www.ebi.ac.uk/interpro (29). ![]() Figure 1:Analyzing backbones of 3718 folded proteins – Cα backbone of the protein Mitochondrial Serine Protease HtrA2 (PDB: 1LCY). (A) shows how the neighborhood of the Cα shown in grey was investigated. If the Cα of any amino-acid was found to occur within a particular distance (represented by blue circles in 2-D) in the 3-D crystal structure, it was scored as a neighbor. Cα of the peptide-bonded partners i.e., residues adjacent, to the amino-acid being investigated, along the primary sequence (shown in red) were not scored as neighbors. (B) shows the neighborhood analysis for another Cα in the same protein. This analysis was done for all the Cα atoms in the protein, with the neighborhood distances (represented by radii of the blue circles) fixed at 0-9 Å, with increments of 1 Å, and 10-90 Å, with increments of 5 Å. Distances of 0-3 Å were chosen as an internal check (since zero neighbors were expected at these distances). Beginning with neighborhood analysis of 4000+ crystal structures, we finally analyzed 3718 total crystal structures (see supplementary Table S1) by including only those proteins with 50 or more residues and removing those structures that did not pass the internal check. Results and Discussion To investigate the presence of preferential neighborhoods expected to arise out of the four well established non-covalent interactions (hydrogen bonds, electrostatic, hydrophobic and van der Waals), we plotted the number of times a specific amino-acid appears as a neighbor of a given amino-acid. For example, Figure 2A shows 20 data sets, each representing the number of times each of the 20 amino-acids appears as a neighbor for leucine within a defined neighborhood distance, in 3718 folded protein crystal structures. A clear sigmoidal trend is observed regardless of the identity of the neighbor. Similar sigmoidal trends were observed for neighborhoods of all the 20 amino-acids, as shown (as examples) for glycine, tryptophan and asparagine in Figures 2B-2D respectively. These sigmoidal trends, surprisingly independent of the nature of amino-acids (e.g., polar vs. non-polar or big vs. small), could essentially imply existence of several spatial distributions, each uniquely defining neighborhoods of each amino-acid based on its presumed preferences. Alternatively it could also imply a single underlying spatial distribution of neighborhoods for all amino-acids, regardless of their conventional classification. If the former were true then one would require different equations to fit the different sigmoids. If the latter were true then one would need only one equation that would fit all of the sigmoids for all the amino-acid neighborhoods. The first extraordinary result found by us was that a generalized, single, sigmoidal equation fits all the sigmoids (see legend of Figure 2). This reflected existence of an underlying (single) spatial distribution of neighborhoods of amino-acids, contrary to all prevailing views of the role of amino-acids in folded proteins. Visual inspection of the sigmoids appeared to show “clustering” or “groups” of specific amino-acids within the neighborhoods of a given amino-acid based on their asymptotic values. These asymptotes of sigmoids, reflecting the maximum possible total number of contacts between two amino-acids, were clearly different and pointed to our obvious next step. If an amino-acid occurs the highest number of times in a primary sequence, it would be expected to be found as a neighbor of all 20 amino-acids (including itself) highest number of times, assuming no preferential interactions between amino-acids. Thus, the total number of contacts for the amino-acid occurring most number of times in a protein, reflected by the sum of all asymptotes of the 20 sigmoids for that amino-acid, would be expected to be the highest. Therefore, the total number of contacts for each amino-acid would be expected to be directly correlated to the frequency of occurrence of each amino-acid. However, if this were the case, it would imply that neighborhoods of amino-acids in folded proteins are simply governed by their individual frequencies of occurrences (stoichiometries) rather than any preferential interactions with other partners. Alternatively, in case amino-acids prefer certain neighborhoods due to preferential interactions (e.g., hydrophobic, hydrogen bonding, electrostatics), one would not be able to predict a direct relationship between the total number of contacts of a given amino-acid with its frequency of occurrence in folded proteins. For example, if an amino-acid occurs the highest number of times, but does not prefer to interact with many partners, the total number of contacts for this amino-acid would be expected to be much lower than an amino-acid occurring lesser number of times but interacting preferentially with several other amino-acids. To test which of the above two hypotheses was true, we plotted the sum of the 20 asymptotes for each amino-acid against the average percentage occurrence of that particular amino-acid in our 3718 folded proteins. To our surprise, the total number of contacts made by an amino-acid were correlated excellently with the average occurrence of that amino-acid (stoichiometry) in folded proteins as shown by Figure 2E. This strongly supported our first hypothesis and directly implied an “absence” of any preferential interactions between amino-acids. Now, asymptotic distances essentially define neighborhoods of the amino-acids only in terms of possible long range interactions. Thus, in case of a complete absence of any long range preferential interactions between amino-acids, their total number of contacts would directly reflect their frequencies of occurrences, as observed above. Therefore we carried out a closer inspection of the sigmoids to investigate the presence of short and medium range interactions. We utilized the fact that the sigmoids are characterized by two other parameters, namely “n” and “k” (see legend of Figure 2), which signify neighborhoods in much closer proximities of the reference amino-acid. “n” reflects the distance at which the lift-off of the sigmoid occurs (i.e., less than 5-12 Å for all sigmoids) and “k” reflects the intermediate neighborhoods (i.e., between 20-30 Å at the inflection point in all sigmoids). Thus, if there were any preferential neighborhoods, they would be reflected particularly in “n” (representing “close” interactions) and possibly in “k” (medium range interactions). Figure 2F shows (i) both n and k have very similar values regardless of the amino-acid, and, (ii) both n and k are independent of the frequency of occurrence for any amino-acid.
Enlarge Figure 2 Figure 2: Cα neighborhood analysis from crystal structures of 3718 proteins reveals a single, amino-acid independent spatial distribution – (A) Number of Cα atoms of each amino-acid in a given neighborhood distance for a leucine Cα, from crystal structures of 3718 proteins. If Cα of an amino-acid is found within the fixed neighborhood distance from any leucine Cα, it is scored as a contact with leucine Cα. By doing so, 20 “sigmoidal” data-sets are observed (). Each point on a sigmoid corresponds to number of leucine-X Cα pairs, with X corresponding to one of the 20 amino-acids. The “sigmoidal” behavior is parameterized by Y = YMax(1-e-kX)n, shown by smooth lines. (B), (C), (D) Neighborhood contacts of Cα pairs for glycine-X, tryptophan-X and asparagine-X respectively. (E) Sum of all 20 YMax values for any amino-acid, indicating the total possible contacts for Cα of a given amino-acid, correlates excellently (r2 = 0.99) with percentage occurrence of that particular amino-acid in 3718 proteins. (F) Average values of “n” (▲) and “k” (☐) are independent of the percentage occurrence of amino-acids. ![]() Enlarge Figure 3 Figure 3: The single, amino-acid independent, spatial distribution is observed even for “zoomed in” neighborhoods within 10 Å – (A) Number of α atoms of each amino-acid within 10 Å for a leucine Cα, from crystal structures of 3718 proteins. The single, amino-acid independent, spatial distribution (parameterized in Figure 2) is shown by smooth lines. Inset shows that all the sigmoids collapse to almost a single sigmoid even at distances of 10 Å and lower, when each of the sigmoids is normalized w.r.t. the number of times a given amino-acid appears as a neighbor of leucine at 10 Å. (B), (C), (D) show the same results as in (A) for glycine, tryptophan and asparagine neigborhoods respectively. (E) Sum of all 20 neighborhood values at 10 Å for leucine, glycine, tryptophan and asparagine correlates excellently (r2 = 0.99) with percentage occurrence of each of the four amino-acids in 3718 proteins. (F) Average values of “n” (▲) and “k” (☐) are independent of the percentage occurrence of leucine, glycine, tryptophan and asparagine. To re-iterate the significance of the above findings, especially Figure 2F in terms of short range interactions, we “zoomed” into the sub-10 Å region of the sigmoids shown in Figure 2. Figure 3 demonstrates that (a) the absence of preferential interactions between amino-acids, and, (b) dependence of amino-acid neighborhoods primarily on their respective percentage occurrences, in folded proteins is clearly true at sub-10 Å distances as well. Having observed this for the four examples of individual amino-acids in Figure 3, we next explored the universal spatial distribution in greater detail to be able to test our findings on the remaining sixteen amino-acids. Figure 4A shows sigmoids varying only in their “n” values, with fixed values of YMax (= 1.462 × 106) and k (= 6.87 × 10–2). Clearly, the lift-off points of sigmoids, shown by arrows, are strongly dependent on “n” (= 1, 2, 3, 5, 7, 10). Therefore, if the neighborhood sigmoids of the amino-acids (e.g., those shown in Figures 2A-2D) were different due to preferential short range contacts (i.e., < 10 Å) with specific neighbors, one would expect very different “n” values from individual sigmoids. Clearly, this is not the case as seen previously in Figure 2F, i.e., the “n” values for all the neighborhood sigmoids are very similar (4.35 – 4.87). On similar lines, Figure 4B shows sigmoids varying only in their “k” values (= 6.87 x 10–2, 1.5 × 6.87 × 10–2, 2 × 6.87 × 10–2, 3 × 6.87 × 10–2, 5 × 6.87 × 10–2), with fixed values of YMax (= 1.462 × 106) and n (= 5). Firstly, the lift-off points of simoids, shown by arrows, are very similar (in contrast to Figure 4A). Secondly, if the neighborhood sigmoids of amino-acids were different due to any preferential medium range interactions with specific neighbors, one would expect very different “k” values for all the sigmoids. Clearly, this is also not the case, i.e., the “k” values for all the neighborhood sigmoids are very similar (6.43 × 10–2 – 7.36 × 10–2), as seen previously in Figure 2F. Thus, data from crystal structures of 3718 folded proteins demonstrates absence of any long, medium or even short range preferential interactions between amino-acids. The data also shows that protein folding is simply governed by frequencies of occurrences (stoichiometries) of individual amino-acids. To re-examine, we plotted the data of Figure 2E for each amino-acid. Figure 4C shows that regardless of the amino-acid, the percentage occurrence overlaps with the total number of contacts. In fact, this holds true regardless of the size of the protein, as shown in Figure 4D. Here, it is important to note that extremely meticulous inspection of Figures 3B and 3D might suggest minor deviations of the fits to the data, completely absent/invisible in Figures 2B and 2D. Thus, we also re-plotted the data of Figure 2F for each amino-acid. Figure 5A shows that n and k are independent of the amino-acid. A clear independence of “n” from the nature of the amino-acid shows the absence of any preferential interactions even at the closest possible distance ranges. This is so because the minor deviations of fits to the data at a very few individual points in Figures 3B and 3D are clearly a part of only noise in the data. Still, it can be argued from Figure 5A that some residues like C and H may have minor differences in “n” values compared to other amino-acids. These differences, if they exist indeed are (a) very different in nature from the prevalent views of role of amino-acid interactions in protein folding, and (b) are open to further investigation. In summary, crystal structures of 3718 folded proteins show that protein folding is primarily dictated by the frequencies of occurrence of amino-acids in the primary sequence, i.e., its stoichiometry, regardless of the length/size. One essential prediction from our results is that stoichiometry of “unstructured” proteins (as listed in the PDB, see methods for source reference, see supplementary Table S2 for PDB ids) would deviate from the frequencies of occurrence of amino-acids in folded proteins. Figure 5B shows this to be the case indeed. Conclusions The foundations of all diverse and prevalent views on protein folding currently lay in “classifications” of constituent amino-acids (e.g., polar, non-polar). We have found that the crystal structures of 3718 folded proteins do not support this conventional view. ![]() Enlarge Figure 4 Figure 4:Cα neighborhoods are a direct consequence of frequency of occurrence of amino-acids, rather than any preferential interactions, in folded proteins – (A) For fixed values of “YMax” and “k” (see legend to Figure 2), sigmoids that are different only in “n” values show a clear separation at their “lift-off” points as indicated by arrows (shown for n = 2, 3, 5, 7 and 10 respectively). (B) For fixed values of “YMax” and “n”, sigmoids that are different only in “k” values do not show much separation at “lift-off” points as indicated by arrows, but are substantially separated at the half-maximum point. (C) Data in Figure 2E is re-plotted for each amino-acid. Total number of contacts (●, left Y-axis) overlaps with the percentage occurrence (●, right Y-axis) of each amino-acid in folded proteins. (D) Subsets of data from (C) for different sizes of proteins. Total number of contacts made by individual amino-acids in folded proteins of sizes 101-150 (■), 151-200 (◆), 201-250 (▲), 251-300 (+) and 301-350 (●) are plotted. All subsets show exactly the same trend and vary in only the actual numbers. Thus, any of the subsets are scalable to observe exactly the same result as shown in (C). ![]() Figure 5: Cα neighborhoods do not show any preferential interactions at either short (~10 Å) or medium distance (~25 Å) ranges in folded proteins – (A) Data in Figure 2F is re-plotted for each amino-acid. The sigmoidal parameters, “n” (●, right Y-axis) and “k” (●, left Y-axis), show similar values for all amino-acids. (B) Average percentage of occurrence of amino-acids in 212 “unstructured” proteins is plotted against average percentage occurrence of the corresponding amino-acids in the 3718 folded proteins investigated in this work. The correlation is weak (r2 = 0.83) compared to that observed in Figure 2E. ![]() We have shown that protein folding is directly correlated to the stoichiometry of amino-acids in the primary sequence (i.e., frequency of occurrence). Table I shows the percentage occurrence of each amino-acid in 3718 folded proteins. These percentage occurrences are essentially the “Chargaff’s Rules”, dedicated to the seminal contribution of Chargaff for DNA structure (30), applicable to the primary sequence compositions that result in folded proteins. This means that to achieve a folded protein, the stoichiometric ratios of individual amino-acids have to be that most important parameters for protein folding (and protein engineering) are (i) exclusion by water and (ii) shape characteristics of individual amino-acids along the sequence that would minimize the surface-to-volume ratio. One can visualize protein folding like fitting “Lego Blocks” tied with a thread and packed into the lowest surface-to-volume ratio. Supplementary Material Supplementary Table S1 lists PDB ids of all 3718 folded proteins, with length (no. of amino-acids) of each protein, crystal structures of which have been analyzed in this work. Supplementary Table S2 lists PDB ids of 212 unstructured proteins (23), with length (no. of amino-acids) of each protein, sequence compositions of which have been analyzed for Figure 5B. The supplementary material can be downloaded free of charge from: Acknowledgements BJ acknowledges the funding support from the Department of Biotechnology, and Department of Information Technology, Govt. of India. BJ and AM are grateful to Prof. D. L. Beveridge for his critical reading of the manuscript. SS is grateful to the Department of Science and Technology (WOS scheme), Govt. of India. AM and BJ are particularly grateful to Prof. S. Prasad, Director, IIT Delhi, and, Prof. B. N. Jain, ex-Dy. Director (Faculty), IIT Delhi for providing the right platform for carrying out this work. We are grateful to our four anonymous reviewers for their constructively critical assessment and positive response to our work. We especially acknowledge “referee 1” in pushing us to prepare a much better manuscript. And finally, we are very grateful to the Chief Editor of J. Biomol. Struct. Dyn. for having faith in our ability to be able to address the queries of the reviewers in a timely manner. Author Contributions AM and BJ designed the study, analyzed the data and wrote the manuscript. SS and TSB collected the data. Competing Financial Interests There are no competing financial interests. References
|