Book of Abstracts: Albany 2011
June 14-18 2011
©Adenine Press (2010)
Protein Structure Modelling on the Indo-US Cancer Research Grid
The importance of protein structures can be understood easily from the fact that the function of any protein is directly correlated to its structure (1). The three dimensional structure of a protein directs its function within a cellular environment. Any mutation in the protein sequence leads to changes in its structure which in turn may render the protein non-functional or even attribute some adverse functions (2-6) leading to diseases like cancer. Over the decades cancer has become one of the most prevalent diseases with an estimate of reaching over 12 million deaths in 2030 according to World Health Organization. Proteins from almost 1% of the human genome have been identified to be involved in oncogenesis (7). In the absence of resolved structural data (RCSB database has 65847 resolved protein structures as opposed to 525207 sequence entries in UniProtKB) one has to resort to computational techniques to get the 3D structures of proteins in order to properly understand their functions. The Bioinformatics Group at the Centre for Development of Advanced Computing (C-DAC) in collaboration with cancer Biomedical Informatics Grid (caBIG®) has developed a grid-enabled web-based automated pipeline (Figure 1) for ab initio prediction of protein structures with an emphasis on cancer related proteins. The pipeline has been deployed on the Bioinformatics Resources & Applications Facility (BRAF) hosted at C-DAC, Pune India. The upstream component of the pipeline retrieves a protein sequence (according to user input) from the gridPIR service of caBIG that provides a data resource of high quality annotated information on all protein sequences supported by UniProtKB. The retrieved sequence in a FASTA format is then fed to the prediction pipeline. At its core the pipeline uses the ROSETTA prediction algorithm (8) for determining the 3D structures. The graphical user interface of the pipeline enables the user to choose various control parameters like which secondary structure prediction algorithms to use, number of iterations, number of output structures, uploading NMR constraint files etc. Once submitted, the jobs get distributed over multiple processors in the form of multiple threads on Biogene supercomputing system at BRAF, which highly reduces the prediction time. The resultant output comes in the form of predicted structures in PDB format and parsed energy log files which can be downloaded by the user. All the file transfers are secured over the network by SFTP. JMol has been integrated within the pipeline to provide a visual inspection of the predicted models. Test cases have been run using the pipeline with a few cancer related proteins, whose results will be discussed. This pipeline provides a hassle-free high throughput structure prediction platform. Java has been used for coding the entire pipeline with Struts, AJAX and Hibernate framework. The upstream gridPIR searching module parses XML results using SAX parser while the GUI has been built using JSP.
1Centre for Development of Advanced Computing, Pune – 411007, India