Issue April 2008No. 5 (p 453-572) April 2008 ISSN 0739-110
Open Access ISFOLD: Structure Prediction of Base Pairs in Non-Helical RNA Motifs from Isostericity Signatures in Their Sequence Alignments (p. 467-472)The existence and identity of non-Watson-Crick base pairs (bps) within RNA bulges, internal loops, and hairpin loops cannot reliably be predicted by existing algorithms. We have developed the Isfold (Isosteric Folding) program as a tool to examine patterns of nucleotide substitutions from sequence alignments or mutation experiments and identify plausible bp interactions. We infer these interactions based on the observation that each non-Watson-Crick bp has a signature pattern of isosteric substitutions where mutations can be made that preserve the 3D structure. Isfold produces a dynamic representation of predicted bps within defined motifs in order of their probabilities. The software was developed under Windows XP, and is capable of running on PC and MAC with Matlab 7.1 (SP3) or higher. A PC stand-alone version that does not require Matlab also is available. This software and a user manual are freely available at www.ucsf.edu/frankel/isfold.
Key words: RNA 3D structure; Structure prediction; Non-Watson-Crick base pair; Base pair isostericity; and Non-helical motif. Ali Mokdad* Department of Biochemistry and Biophysics Poner, J.S., Zgarbová, M., Jurec.Ka, P., Riley, K.E., Poner, J.E.S., Hobza, P. Reference quantum chemical calculations on RNA base pairs directly involving the 2'-OH group of ribose (2009) Journal of Chemical Theory and Computation, 5 (4), pp. 1166-1179.
Introduction
Many computational efforts to determine the structures of RNA molecules from their sequences have been aimed at determining their secondary structures, i.e., the set of cis Watson-Crick base pairs (WC bps) that make up the stacked helical skeleton and establishes the overall architecture of the molecules. However, depending on context, as many as 20-50% of edge-to-edge bps in non-coding structured RNAs are of non-WC types, and these fall into about a dozen bp types (1, 2, 3). These non-WC bps are found within internal loops, hairpin loops, and other non-helical elements such as pseudoknots. They can form long range inter- and intra-molecular interactions, and they often are dynamic in nature, allowing for transient formation and breaking of contacts that permit the molecule to change its shape as needed (4, 5, 6, 7). Such bps typically are found in function-determining parts of RNA molecules, so modeling their exact structures or range of possible structures is of considerable value. Computational and experimental approaches exist to determine secondary structures with good accuracy but provide much less information about the structures of non-helical regions. Dynamic programming algorithms such as Mfold (8) and RNAstructure (9) are relatively accurate in predicting RNA secondary structure in helical areas based on stacking free energies between consecutive WC bps. Furthermore, they indicate the approximate locations of non-WC regions but without details of their structures. Other programs such as RNAfold (10), Pfold (11), and Sfold (12) use related approaches but still are limited in their ability to detect non-WC bps (13). Another major class of computational approaches is based on comparative sequence analysis (CSA) (14, 15, 16, 17), where compensatory mutations (covariations) observed in sequence alignments are used to predict secondary structure. As with the free energy approaches, CSA methods also do not generally detect the positions and types of all possible bp interactions, in part because they are not represented by simple (two base WC) covariations but rather by more complex types of sequence signatures. For example, C:G, G:C, and G:G all can be equivalent substitutions of the cHH (cis Hoogsteen/Hoogsteen) bp family, and A:A, A:C, A:G, and A:U of the tSS (trans Sugar edge/Sugar edge) bp family (Figure 1). These signatures are not detected by classical CSA because the compensating mutations do not necessarily affect both positions that form the bp. Other methods to predict structure are based on primary sequence alone (18) or utilize graph grammar to detect structure from primary sequence alone or from sequence alignments (19, 20, 21), but these also have shown limited success in detecting non-WC bps. To date, no reliable method has been dedicated to determining non-WC bps (22), in part because the rules that establish such complex relationships have not yet been systematically defined. Here, we describe an approach that focuses solely on predicting plausible non-WC bps based on their degree of structural similarity or isomorphism (3, 23). ![]() Figure 1: The 12 bp families and their isosteric subfamilies (23), demarcated by colors that show the unique patterns of acceptable and unacceptable substitutions for each interaction. Within one bp family, boxes with the same color represent isosteric combinations, boxes with ?similar? colors (those grouped within single ovals at the bottom) near-isosteric, boxes with different and non-similar colors heterosteric, and gray boxes implausible (structurally incompatible). Three-letter names of the bp families are: c, cis; t, trans; W, Watson-Crick edge; H, Hoogsteen edge; S, sugar edge; according to previous nomenclature (2). Consequently, cWW designates a cis Watson-Crick/Watson Crick interaction, tHS designates a trans Hoogsteen/Sugar edge interaction, and so on. Materials and Methods A few years have passed since all possible bp types or families were categorized according to their structural similarities (physical dimensions and bond orientations), resulting in isostericity matrices (IMs) (23). According to this classification, each bp family is organized into isosteric subfamilies that yield distinctive patterns of acceptable sequence variations (Figure 1). These patterns in principle can identify the bp type amongst aligned sequences, but this is complicated because IMs within bp families often are not equally populated (24). IMs have been used to improve sequence alignments based on known 3D structures (25, 26), but few efforts have been made to predict bp types beginning with aligned sequences or from mutational data. One such effort resulted in the manual prediction of a loop E motif in potato spindle tuber viroid based on patterns of viable and lethal mutations (27, 28); here, we describe a more automated procedure. We have created the Isfold program, which uses isostericity patterns observed in sequence alignments or in experimental mutational data to suggest plausible bp configurations, particularly within RNA internal loops and hairpin loops. Isfold compares substitution patterns, which may or may not be considered standard two-base covariations, at every pair of nucleotide positions that may potentially form a bp, to the known isostericity patterns of all bp families. An ?isostericity compliance score? for each potential bp is calculated, based on the adherence of its sequence variation to isostericity rules. The formula for calculating this score takes into consideration the number of isosteric, near-isosteric, heterosteric, and forbidden substitutions that are observed when comparing the sequence alignment or experimental mutation patterns to the IM of each bp type. The more favorable (isosteric and near-isosteric) substitutions produce higher scores, and the more unfavorable (heterostreic and forbidden) substitutions lower scores. Users also may modify scoring functions as detailed in the online user manual. After all possible bps are scored against all possible IMs, scores are sorted and bp predictions are displayed in text and graphical form (Figure 2). The user can systematically examine possible base pairing within a user-defined motif, in order of these scores. Because some bp families share similarities in their isostericity patterns, and because not all allowed substitutions may be fully populated, several bp types may be satisfied equally by the sequence alignment or mutational data. In such cases, Isfold sorts the equally satisfied bp families based on their observed rate of occurrence in the particular structural context (Table I). ![]() Figure 2: Screenshot of Isfold Results Screen, which is a dynamic Graphical User Interface (GUI) depicting the predicted structures. The user can browse through plausible bp schemes, one Result Screen at a time, in order of their likelihood based on the calculated scores. The bp symbols used are the Leontis/Westhof representation (2): circle represents Watson-Crick edge, square represents Hoogsteen edge, and triangle represents sugar edge. Solid or filled symbols of any color indicate cis orientation, and open symbols indicate trans orientation. For example, the symbol open left triangle?open square indicates a tSH or trans Sugar edge/Hoogsteen interaction and solid circle?solid right triangle indicates a cWS or cis Watson Crick/Sugar edge interaction. For simplicity, when both nucleotides in the bp use the same edge, only one such edge symbol is drawn. Thus, ?solid square? indicates a cHH or cis Hoogsteen/Hoogsteen interaction. Values and colors of the arrows represent the isostericity compliance scores. The bottom section displays warnings concerning the quality of predictions, such as inadequate data in sequence alignment or prediction of incompatible interactions. Besides this dynamic graphical output, results also are provided in text format. ![]() Results and Discussion Isfold provides a text output and a graphical interface to display bp possibilities based on isostericity rules. Figure 3 outlines the basic procedure with an example. Isfold was applied to 5S rRNA beginning with a high quality alignment (29) that was refined manually based on its known 3D structure. The quality of the input sequence alignment is the single most critical element to the success of the method. In 65% of the bps examined, the first prediction by Isfold was the correct one as observed in the crystal structure (30), in 15% the second prediction was correct, and in 15% the third prediction was correct. In most cases where predictions were erroneous, there was insufficient sequence variation to distinguish between possible bps. ![]() Figure 3: Flowchart outlining the procedure followed by Isfold, together with an interpretation of the results by Ribostral (25). The alignment (step a) is a starting point for other programs to predict secondary structure (step b) and for Isfold to predict non-WC base pairs (step c). For simplicity, panel (c) shows only four of the possible base pair types between nucleotides 8 and 28. The first two predictions shown have equal and perfect scores, because all substitutions are isosteric according to the cHS (cis Hoogsteen/Sugar edge) and tWS (trans Watson Crick/Sugar edge) bp types, as seen in Ribostral screenshots (step d). Panel (e) explains how mutation experiments can be designed to differentiate between such similarly scoring base pairing possibilities. This method has several limitations. As mentioned, the predictions inherently rely on the data used to generate sequence alignments, for which many inaccuracies often exist. While all sequence comparative methods rely on good alignments, the search for non-WC bps is especially sensitive because there typically are more limited phylogenetic data and thus the sequence signatures can be quite subtle. Mutational data can substantially improve the signal-to-noise of sequence variation by providing more diversity or by constraining the space of possible interactions. Nevertheless, even with limited sequence variation, Isfold can help differentiate between plausible and highly improbable bp configurations. The ones that are most reasonable can be used to guide mutation experiments to further restrict the possibilities, and may be combined with other programs such as Ribostral (25) to interpret sequence variations by superposing them on IMs (Figure 3). Using both programs together can help iteratively improve structure prediction and refine the alignment. The analysis is further complicated in that not every internal or hairpin loop adopts a unique structure, as some motifs are dynamic in nature and can form different structures, for example in response to ligand or protein binding or substrate positioning (4, 5, 6, 7). In such interesting cases, molecular dynamics (MD) simulations may help establish alternative conformers (6) and can provide a useful and feasible additional step to discriminate between structures proposed by isostericity rules alone (31, 32). Isfold is limited further by the established patterns of isostericity (23) and thus cannot predict intermediate or novel types of bps not represented in the IMs (33). The discovery of such interactions by experimental, modeling (3, 23), or quantum mechanical (34, 35, 36) methods can be used later to revise Isfold and the bp classification schemes. Conclusions and Future Improvements Isfold represents a new computational approach to help predict bps in internal loops, hairpin loops, and other RNA motifs that play important roles in RNA folding and function. When the target motif is well localized within an alignment, a task typically well-achieved by available secondary structure prediction algorithms, Isfold can provide plausible WC and non-WC arrangements within the motif. The results of mutation experiments can be used to further narrow the possibilities that are most consistent with isomorphic combinations. Currently, Isfold assesses bps independently of each other. However, RNA motifs often have specific architectures involving bp stacking and other discrete arrangements. As databases of such motifs become more complete, it will be possible to incorporate these contextual aspects into Isfold scoring schemes. In the interim, Isfold should provide a useful tool to help evaluate plausible base pairings that conform to basic rules of structural isomorphism. Acknowledgements The authors thank Matt Daugherty and Jason Fernandes for valuable discussions and contributions. This work was supported by NIH grant GM47478. References and Footnotes
|