Issue December 2006

category image Volume 24
No. 3 (p 203-302)
December 2006
ISSN 0739-110
Open Access

Using Bagging Classifier to Predict Protein Domain Structural Class (p. 239-242)

Classification and prediction of protein domain structural class is one of the important topics in the molecular biology. We introduce the Bagging (Bootstrap aggregating), one of the bootstrap methods, for classifying and predicting protein structural classes. By a bootstrap aggregating procedure, the Bagging can improve a weak classifier, for instance the random tree method, to a significant step towards optimality. In this research, it is demonstrated that the Bagging performed at least as well as LogitBoost and Support vector machines in predicting the structural classes for a given protein domain dataset by 10 cross-validation test, which indicate that the Bagging method is promising and anticipated that it could be potentially further improved on predicting protein structural classes as well as other bio-macromolecular attributes, if the bagging method and other existing methods can be effectively complemented with each other.

Key words: Protein structure classification; Bagging; Support vector machines; LogitBoost; and Amino acid composition.

Liuhuan Dong1
Yuan Yuan2
Yudong Cai1,*

1Dept of Combinatorics and Geometry
CAS-MPG Partner Institute for Computational Biology
Shanghai Institutes for Biological Sciences
Chinese Academy of Sciences
China 200031
2Department of Statistics
Jiaxing College
Zhejiang, China 314000
*cyd@picb.ac.cn

Open Access Article
The authors, the publisher, and the right holders grant the right to use, reproduce, and disseminate the work in digital form to all users.


Click here to download PDF.

Citations of this Paper

Mizianty, M.J., Kurgan, L. Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences (2009) BMC Bioinformatics, 10, art. no. 414, .

Dagliyan, O., Halil Kavakli, I., Turkay, M. Classification of cytochrome p450 inhibitors with respect to binding free energy and pIC50 using common molecular descriptors (2009) Journal of Chemical Information and Modeling, 49 (10), pp. 2403-2411.

Costantini, S., Facchiano, A.M. Prediction of the protein structural class by specific peptide frequencies (2009) Biochimie, 91 (2), pp. 226-229.

Kurgan, L.A., Zhang, T., Zhang, H., Shen, S., Ruan, J. Secondary structure-based assignment of the protein structural classes (2008) Amino Acids, 35 (3), pp. 551-564.

Chen, K.E., Kurgan, L.A., Ruan, J. Prediction of protein structural class using novel evolutionary collocation-based sequence representation (2008) Journal of Computational Chemistry, 29 (10), pp. 1596-1604.

Niu, B., Jin, Y.-H., Feng, K.-Y., Liu, L., Lu, W.-C., Cai, Y.-D., Li, G.-Z. Predicting membrane protein types with bagging learner (2008) Protein and Peptide Letters, 15 (6), pp. 590-594.

Kurgan, L., Cios, K., Chen, K. SCPRED: Accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences (2008) BMC Bioinformatics, 9, art. no. 226, .
Introduction

Classification and prediction of protein structure is one of the important topics in the protein science. According to Nikashima work (1), it implied that it is correlated between protein structural information and amino acid composition. Klein (2) and Chou (3) also showed that by the amino acid composition, scientists could discover some folding patterns information of protein structure, while avoiding to the complicated and irregular three-dimensional structure. Usually Proteins are classified into four folding classes is all-α, all-β, α/β, and α + β. The α, β folding information of protein secondary structure becomes important for characterizing the overall structure of a protein or its domain. A lot of effort is made to predict the protein structural classes based on the amino acid composition, [see, e.g., (4, 5, 6, 7)]. Recently with the developing of computer science, mathematics and statistics, novel methods, such as support vector machines (SVMs) and LogitBoost classifiers, were introduced to this field (8, 9).

In this paper, we would like to present a approach named ?Bagging? (10), which is the abbreviation of ?Bootstrap aggregating?, to predict the protein structural classes. We will give a detailed description of Bagging in the following.

How Bagging Works

The Bagging predictor was firstly introduced by Breiman (1996), which is similar to the boosting algorithm. It is a method for using independent bootstrap procedure on the replicate learning set and then to get an aggregated predictor with plurality voting. Tests and theoretical analysis of Breiman?s work shows that the Bagging predictors can push an unstable weak classifier to a significant step towards optimality (10). The bagging method has been used to various classification and prediction problems in biology and social sciences [see, e.g., (11, 12, 13, 14)].

For Learning set:

L : = {(yn, xn), n = 1, 2, ?, N},....................[1]

where xn ∈ X is the input and yn isin; Y is the response from a predictor denoted by φ(X,L). When consider k learning sets {Lk} of N iid (independent identical distribution) observation from the same distribution of L, it is obvious to use the average to get a better estimate of the classification:

φA(X) = ELφ(X,L),....................[2]

where the EL denotes the expectation over L. Unfortunately, we usually only have one learning set L, so, one resolvable way is to take bootstrap samples {L(B)} of N(B) cases, randomly, but with replacement from L, and form predictor set {φ(X, L(B))}. Then the predictor φ, which get most voting score of test set is the best classification. The algorithm of Bagging method is showed below Table I:


Choice of Weak Classifier

The program is downloaded from WEKA (16). The random tree algorithm for constructing a tree that considers one randomly chosen attribute at each node is used in the Bagging procedure.

Implementation and Dataset

The working dataset was taken from (17) that contains 204 protein chains. Fifty-two of the 204 are all-α protein domain, 61 are all-β, 45 are α/β, and 46 are α + β. The average sequence similarity is 21% for all-α, 30% for all-β, 15% for α/β, and 14% for α + β. Therefore, there is no similarity for each amino acid sequence within the same class. Twenty amino acid compositions of these 204 protein chains are, thus, taken as the input feature of 20-dimension vectors.

Results and Discussion

The demonstrations are conducted by two different approaches, the re-substitution test and the 10 cross-validation test, with comparing against other two popular methods: SVMs and LogitBoost. The results show that the Bagging predictor works at least as well as the two others.

Re-substitution and 10 Cross-validation Test

All the methods (SVMs, LogitBoost, Bagging) get 100% accuracy on the re-substitution test, which indicates that these methods grasp the complicated relationship between the amino acid composition and the structural class, and reflects good self-consistency of these predictors. However, a cross-validation test is needed to really reflect the power of the predictor. Leave-one-out test usually works well for estimating generalization error for continuous error functions, but it probably perform poorly for discontinuous data set such as the number of misclassified cases (18), like 204 protein structural dataset in this paper. Breiman (19) and Kohavi (20) found that 10 cross-validation test works better than leave-one-out test on regression and decision trees. Since the target of our work is to obtain as few as misclassified samples by the Bagging approach with random feature decision tree method mentioned above, the 10 cross-validation is preferred in our test.

Comparison with SVMs and LogitBoost

The classifier based on SVMs (21, 22, 23) and LogitBoost (24) are applied to the same problem as a comparison. The main advantage of SVMs classifier is less overfitting. It tries to find out a hyperplane, by which it can optimally separate the data into classes on the term of maximizing the distance between the closest vectors to the hyperplane. This is because right now, people just can believe the results of linear regression. So the separating hyperplan can be trusted. However, it is quite difficult to decide the sophisticated parameters. In this paper the polynomial kernel is chosen. The complexity parameter C and the exponent of the polynomial kernel are adapted to 10.0 and 3.0. The LogitBoost is based on the boosting method, which, like Bagging, can also improve an unstable and weak classifier into a better one. It iteratively makes the wrong classified sample more weighted and concentrated to improve the classification accuracy. Because it focuses on adjusting the weights for the wrong classified samples, it may lose some accuracy on classifying other samples. For some datasets, it will also cause overfitting problem. While for the method of Bagging, as introduced above: it focus on the global accuracy, that is why it can obtain less overfitting and try to get the averagely classification results. For the LogitBoost and Bagging algorithm, some parameter setting is adapted empirically. We try two classifiers: stumps (default) and M5 trees (25) for LogitBoost to obtain a better performance, it seems that M5 trees turn out to be much better than stumps, because they are able to generate several unconnected regions for a category. For the same reason, the number of iterations of both LogitBoost is set to be 100. With the setting of 99% out-of-bag error being calculated, random trees for classifier and 96 times iterations, the Bagging obtain a local optimal result.

All the success rates are showed in Table II. On the term of overall success rate, SVMs is a little better than the Bagging and LogitBoost (M5 trees) methods. But if we take a look at the α + β class, we found most algorithm don?t do well on this structural class prediction, which is considered to be difficultly predicted. Among these approaches, Bagging is the best predictor for the α + β class, while the SVMs gets much worse prediction. This implies the advantage of Bagging: the independent aggregating method can make an averagely good classification inferred from the above Equation [2].


All the tests run on a 2 GHz Intel Core Duo Macintosh, which has two processes. Thus, another advantage of Bagging that can run in parallel mode by independent bootstrap procedure cause a significant prevail over LogitBoost (M5 trees). The run time of these algorithms of building prediction model is showed in Table III. Although the run time is not an essential problem of algorithm, however, it means that the Bagging method has potential to run on other larger scale bio-dataset, if needed.


Acknowledgement

The authors would like to thank Prof. Dr. K. C. Chou for providing the 204 protein structural dataset, and Dr. S. Grunewald for checking the grammar.

References and Footnotes
  1. Nakashima, H., Nishikawa, K., and Ooi, T. J. Biochem. 99, 152-162 (1986).
  2. Klein, P. Biophys. Acta. 876, 205-275 (1986).
  3. Chou, K. C. and Maggiora, G. M. Protein Engineering 11, 523-538 (1998).
  4. Bahar, I., Atilgan, A. R., Jernigan, R. L., Erman, B. PROTEINS: Struct. Funct. Genet. 29, 172-185 (1997).
  5. Chou, J. J., Zhang, C. T. J. Theor. Biol. 161, 251-262 (1993).
  6. Chou, K. C. Proteins:Struct. Funct. Genet. 21, 319-344 (1995).
  7. Cai, Y. D., Zhou, G. P. Biochimie 82, 783-785 (2000).
  8. Chou, K. C., Cai, Y. D. J. Biol. Chem. 277, 45765-45769 (2002).
  9. Cai, Y. D., Feng, K. Y., Lu, W. C., and Chou, K. C. J. Theoretical Biology 238, 172-176 (2006).
  10. Breiman, L. Machine Learning 24, 123-140 (1996).
  11. Huang, Y., Cai, J., Ji, L., Li, Y. Comput Biol Chem. 28, 275-280 (2004).
  12. Prasad, A. M., Iverson, L. R., Liaw, A. Ecosystems 9, 181-199 (2006).
  13. Mardin, C. Y., Hothorn, T., Peters, A., Junemann, A. G., Nguyen, N. X., Lausen, B. J Glaucoma. 12, 340-346 (2003).
  14. Dudoit, S., Fridlyand, J. Bioinformatics 19, 1090-1099 (2003).
  15. Breiman, L., Friedman, J., Olshen, R., and Stone, C. Classification and Re-gression Trees. Wadsworth (1984).
  16. Witten, I. H. and Frank E. Data Mining: Practical Machine Learning Tools and Techniques, 2nd Edition. Morgan Kaufmann, San Francisco (2005).
  17. Chou, K. C. Biochem. Biophys. Res. Comm. 264, 216-224 (1999).
  18. Shao, J. J of the American Statistical Association 88, 486-494 (1993).
  19. Breiman, L. and Spector, P. International Statistical Review 60, 291-319 (1992).
  20. Kohavi, R. International Joint Conference on Artificial Intelligence (1995).
  21. Vapnik, V. Statistical Learning Theory. Wiley-Interscience, New York (1998).
  22. Platt, J. Fast Training of Support Vector Machines Using Sequential Minimal Optimization. Advances in Kernel Methods-Support Vector Learning. Eds., B. Schoelkopf, C. Burges, and A. Smola. MIT Press (1998).
  23. Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., and Murthy, K. R. K. Neural Computation 13, 637-649 (2001).
  24. Friedman, J., Hastie, T., and Tibshirani, R. Ann. Stat. 337-407 (2000).
  25. Quinlan J. R. Learning with Continuous Classes. Proceedings of the Australian Joint Conference on Artificial Intelligence, pp. 343-348. World Scientific, Singapore (1992).