Issue August 2010No. 1 (1-132) August 2010 ISSN 0739-110
Open Access Single-base Resolution Nucleosome Mapping on DNA Sequences (107-122)Nucleosome DNA bendability pattern extracted from large nucleosome DNA database of C. elegans is used for construction of full length (116 dinucleotide positions) nucleosome DNA bendability matrix. The matrix can be used for sequence-directed mapping of the nucleosomes on the sequences. Several alternative positions for a given nucleosome are typically predicted, separated by multiples of nucleosome DNA period. The corresponding computer program is successfully tested on best known experimental examples of accurately positioned nucleosomes. The uncertainty of the computational mapping is ±1 base. The procedure is placed on publicly accessible server and can be applied to any DNA sequence of interest.
Key words: Chromatin code; DNA bendability; Nucleosome DNA; Nucleosome positioning; Nucleosome structure. I. Gabdank1,* 1Department of Computer Science, Ben Gurion University of the Negev, P.O.B 653 Be’er Sheva 84105, Israel Introduction
The nucleosome is natural device for packing DNA in eukaryotic cells. Essentially, this is a macromolecular complex of eight histones and 147 base pairs of DNA, of distinct two-fold symmetry (1-3) Nucleosome positioning along genomic sequences is believed to be guided primarily by deformational properties of DNA (4, 5). Although the literature on deformability of individual dinucleotide stacks is quite sizeable (e. g. (6-8) and references therein), the nucleosome mapping directly on the basis of the deformability data (see, e.g. (6, 9)) did not yet become of common use. Mapping tools based exclusively on the sequences are more popular (10-13). These are advanced versions of the very first sequence-based programs (14-16). Such mapping is technically very simple. It is based on 10-11 base periodicity of some dinucleotide elements along the nucleosome DNA sequence (17). However, there is no clear consensus on the phase relationships between various elements within the period. The matter is rather complicated as different species seem to have different dinucleotide usage in the nucleosome DNA sequences (18, 19). In our previous studies (20, 21) we have derived sequence structure of basic 10.4 base repeat of the nucleosome DNA of C. elegans. The derivation of the matrix of nucleosome DNA bendability became possible due to availability of the nucleosome core database of over 160,000 DNA sequences (22). The phases (positions within the period) are determined for eight dinucleotides (AA, TT, GA, TC, GG, CC, AT and CG), strongest contributors to the pattern characteristic for all six chromosomes of C. elegans. Eight other dinucleotides (AC, AG, CA, CT, GC, GT, TA and TG) did not show as strong positional affinity (21) and, thus, were not included in the basic repeat. It could well be that the excluded dinucleotides may show appreciable positional affinity in some other species (23). The final conclusion on the general applicability of the C. elegans pattern can only be made after the matrices of bendability are calculated from large nucleosome DNA databases for other species. There are, however, several reasons to consider the matrix applicable to other eukaryotic species as well (see below). Thus, in this work we used the C. elegans database for calculation of complete matrix of nucleosome DNA bendability, of 116 × 8 elements (116-matrix). Earlier bendability matrices were calculated on the basis of only about 120, 200 and 1300 nucleosome DNA sequences (16, 10, 24), or directly from genomic sequences (15, 25). The over two order’s difference in the size of experimental ensemble from C. elegans allows one to clarify all necessary details of the nucleosome DNA pattern and gives more confidence to the results of the calculations. The purpose of this study is to adapt the recently acquired basic knowledge on the nucleosome positioning pattern (21, 26) to the needs of actual mapping of the nucleosomes along the sequences. Results Size of DNA involved in the interaction with histone octamers Alignment of about 160,000 147-base sequences by MNase cut sites has revealed that the highest periodicity of, primarily, AA and TT dinucleotides is observed at the ends of the 147 base long sequences (22). That is, as this sequence bias indicates, the DNA deformability is the highest at the ends of the 147 base-pair long nucleosome core DNA. It is known, however, that in the crystals of the nucleosome core particles the ends, about 10 base-pairs each, are not involved in the contact with the histone octamers(3) and, thus, have no reason to have any sequence bias. On the other hand, in the nucleosomes there are only 12 points of contact between phosphates of minor grooves of DNA and positively charged arginine residues of the histone octamers (27, 28). In other words, only about 116 base-pairs of DNA (11 structural 10.4 base periods of the nucleosome DNA) are involved, from the 1st to 12th contact. The sequence bias, which leads to the anisotropic DNA bendability, thus, should span, essentially, only these 116 base-pairs, rather than 147 base-pairs of the core DNA. Actual sequence bias is likely to extend to the size ~125 base pairs, as periodicity of phosphate-to-phosphate distances would suggest (29, 30). Where exactly within the 147 base-pairs the contacting DNA segment is positioned, and why the periodicity starts at the very ends – remains to be found (see below). Positional distribution of the matches to bendability matrix along the 147 base sequences In our previous work the basic sequence repeat unit of the nucleosome DNA has been derived (21) – in form of the nucleosome DNA bendability matrix (10 × 8 matrix or simply 10-matrix) that describes at which positions within the ~10 base repeat various dinucleotides are preferentially positioned to, presumably, ensure better DNA bendability in the nucleosome. Its simplest one-line form is …CGGAAATTTCCG…, with CG dinucleotides at the ends, and AT dinucleotide in the middle. The value of the structural period of the nucleosome DNA is still matter of debate. We have chosen the period 10.4 bases, which is calculated (29) from coordinates of phosphates in crystallized nucleosomes. This period is also confirmed by genome-wide analysis of sequence periodicities (29, 23). In order to reconstruct the whole picture rather than one period only, i.e. to find out how the sequences matching to the bendability 10-matrix are distributed along the nucleosome DNA we started first with overall positional occurrences of the matches in the sequences of the nucleosome core DNA database (22). The distribution of the matches with the threshold 0.43 is shown in the Figure 1. As in previous observations with dinucleotides and higher oligonucleotides (22, 31), the histogram displays clear periodicity starting at the aligned left ends of the 146 base long core DNA sequences (positions 61 to 206). The highest peak at 63 is observed, followed by five smaller peaks. Their distances from the major peak fit fairly well to the 10.4xn series despite significant noise background in the histogram. Notably, there is no such high peak of the match counts at the right end of the 116 base-pair long DNA segment (position 176) which is in contact with the histone octamers. No high match is seen at the right end of the 146 base nucleosome core DNA either. Several small peaks of unknown origin at 143, 166, 174, 185 and 195 are not in phase with the group of maxima on the left. Would the 147 base pair fragments in the database be of exactly the same length, the periodicity could be observed at both ends of the aligned fragments. Since in the database only left ends are synchronized, the corresponding plots demonstrate the periodicities (22, 31) at the left ends. The right ends are smeared by ±8 base pairs (22). When the same sequences are synchronized by the right-end positions matching to the 10-matrix of bendability, the strong periodicity becomes obvious at the right ends as well (Figure 2). Periodicity all along nucleosome DNA As the Figure 1 may suggest, the periodical match to the matrix of bendability is confined to the left ends of the 147 base long nucleosome DNA sequences of the C. elegans database. Indeed, it spans only about 6 periods starting from the left end (position 61 of the 266 base long database sequences). The comparable periodicity at the right end is not seen. The simplest, though, questionable explanation would be that all the nucleosomes are asymmetric, synchronized in the database by the periodical ends. There is, however, more realistic interpretation of the apparent asymmetry. The DNA may be stretched or compressed by 1-2 base pairs here and there along nucleosome DNA (32). The positions where the match to the matrix of bendability is observed, respectively, would shift so that expected strict periodicity
Enlarge Figure 1Figure 1: Distribution of similarity to 10-matrix in nucleosome DNA database sequences, aligned as in (22). The distribution of the nucleosome DNA bendability matrix matches above the threshold 0.43 along over 160,000 sequences of the database. Resulting histogram was smoothed by running window of 3 bases. The span of the nucleosome core DNA is indicated (positions 61 to 206, as in (22)). As the Figure 3A demonstrates, the periodicity, apparently, spans a whole length of the 116 base nucleosome DNA although the periodical pattern is compromised by significant noise. Among the noise components of the histogram the most conspicuous are the peaks at positions 35 and 70. These peaks correspond to (35)n repeating sequences of C. elegans (33). After removing from the dataset all those nucleosomes that contain the (35)n and other related satellites and filtering the histogram by smoothing it with running window of 3 bases (see Methods) the whole length periodicity of the 116 base nucleosome DNA becomes obvious (Figure 3B). Despite the remaining noise all the peaks observed are within 1-2 bases from the 10.4xn positions: 10(10.4), 21(20.8), 32(31.2), 42(41.6), 53(52), 64(62.4), 74(72.8), 85(83.2) and 95(93.6). Interestingly, there is no sign of significant difference in the middle section of the distribution that would be expected from previous observations (34, 10, 25, 19). Indeed, the nucleosome DNA sequences of C. elegans show only small decrease of the periodicity in the middle sections, as demonstrated by distribution of various tetranucleotides (31). This result is, of course, statistical, that is not every individual nucleosome DNA sequence displays the periodicity. Rather, every sequence has the points of moderate match to the 10-matrix here and there along the sequence, separated by distances close to multiples of 10.4 bases. Construction of the 116 position dinucleotide matri for the nucleosome mapping Anisotropic bendability of DNA (4) is broadly accepted physical model of the nucleosome positioning. Every segment of the nucleosome DNA is smoothly bent around the histones, and the physics of the deformation is all the same along the molecule. If in a given bent 10 base-pair segment certain dinucleotide stack is oriented in optimal way to favor the bending, then in any other such segment the optimal orientation of the stack would be the same. The distance between such stacks equally oriented relative to the surface of the histone octamer has to be equal or close to integer number of nucleosome DNA periods, 10.4xn base-pairs. In accordance with this simple scheme, respective dinucleotides should be separated by the same distances along the sequence. Thus, the positional autocorrelation analysis of the nucleosome DNA sequences would show the preferred periodical distances, which, indeed, is the case (17, 25, 31). Two different dinucleotide stacks within the same segment of DNA would have different optimal orientations, hence - different positions within the bent segment. The preferred distances between different dinucleotides in the one period long bent segment are described by the matrix of nucleosome DNA bendability (4, 15, 21).
Enlarge Figure 2Figure 2: Distribution of similarity to 10-matrix in nucleosome core DNA sequences aligned by matches to 10-matrix located near right ends (database sequence positions 186-226). The calculations with the aligned sequences were similar to the ones in Figure 1. Match threshold 0.43. Nucleosome core DNA span indicated. The raw histogram was smoothed by running window of 3 bases.
Enlarge Figure 3Figure 3: Oscillations of the match to the 10-matrix along DNA in contact with the octamer, with both ends of the DNA matching to the matrix. (A) The distribution of the matches in 10,125 sequences containing the matches above the threshold 0.43. No filtration of repeating sequences applied. The histogram is smoothed by running window of three bases. (B) Same after removal of the entries, containing repeating sequences (see Methods). 7,034 entries survived the filtering. DNA makes 12 contacts of its oriented inwards minor grooves with arginines
of the histones (27, 28). That is, the nucleosome DNA engaged in the contacts
consists of 11 period-long segments, and the full length nucleosome DNA
bendability matrix should, therefore, consist of sequentially connected eleven
10.4 base long matrices of bendability. The elementary matrix of bendability
(for 16 dinucleotides in 10 positions) has been calculated as the matrix with nearest integer number of columns (21). Simple repetition of the 10-matrix 11 times would result in severe phase-shifts at the ends of the 11-period long matrix. The 10-matrix, thus, has to be “extended” to the size 10.4. In addition, according to crystal data, at the dyad axis of the nucleosome (position 0) a base is positioned, rather than an interbase (35). That means that the central element of the 10-matrix, the dinucleotide AT (21) can not be centered at the axis, but rather at symmetrical positions 0.5 and –0.5, i. e. A0T1 and A-1T0. In other words, the central AT element of the complete matrix should be equally often found in both these positions
(–1,0 and 0,1) though, of course, in the actual sequence only one of these two can be realized. This situation is repeated at two other bases of the nucleosome
DNA located exactly at the local dyad, at position 26 = 2.5 × 10.4 and at 52 =
5 × 10.4. Position 26 is half-period off the positions for AT, and should be, therefore, occupied by CG. Similar to A0T1 and A-1T0 above, the CG dinucleotide has to be placed at two locations of the complete matrix: C25G26 and C26G27. Finally, the AT should be placed at 51,52 and 52,53. The left half of the nucleosome
DNA matrix, with negative numbering, should be filled accordingly. The resulting complete matrix should contain, therefore, five extra columns, to accommodate the double steps as described above. Adding also the 0-th column (identical to column 116), for symmetry, we end with 110 + 5 + 1 = 116 dinucleotide positions. The distance from the base 0 at the major axis to the base 52 at the local dyad five periods away is exactly 5 times the 10.4 base repeat. That is, due to the duplicate columns at base-centered positions the “extension” to eleven 10.4 base periods is fully accomplished. The remaining positions, between the duplicate ones, are filled by respective elements of the 10-matrix. The full length 116 position matrix (116-matrix) is shown in the Figure 4. It is known that the nucleosome DNA may locally stretch/shrink allowing for 1-2 base shift (32). This property is especially important when local sequence-dependent twist of DNA is somewhat different. Also, if some dinucleotides happened to be off the ideal position within the periodical pattern, they may well still partake in the positioning by the fine tuning of the local twist. To take this possibility into account the matrix in the Figure 4 is shifted by one base left and right, and three matrices (shifts –1, 0, +1) are combined and symmetrized (see Methods). The resulting matrix, which may be considered as final at this stage, is shown in the Figure 5. By matching sequence of interest to this matrix (summing the scores at all 116 matrix positions) one can evaluate how good a given sequence is for the DNA bending in the nucleosome. The “strength” of the nucleosome can be measured from 0 to 1 as ratio of actual score to the maximal possible score for an ideal nucleosome sequence, that fits fully to the 116-matrix. The eight dinucleotides that have been initially excluded from the derived DNA bendability pattern (21), of course, do appear in the sequences. However, there are no elements in the full length matrix that correspond to the dinucleotides, and their possible (weak) contributions are not scored. Choice of the test cases The mapping procedure calculates how well a given sequence of length 117 base pairs (116 dinucleotides) matches the standard bendability pattern. In longer sequences the procedure would select those fragments (117 bases long) that match best. Since position of a nucleosome in a long sequence may be influenced by other nucleosomes on that sequence (36), for testing purposes only comparatively short sequences should be used, that accommodate one nucleosome only, directed by the underlying sequence only. In order to evaluate predictions of the mapping by the full length bendability matrix, one has to use, ideally, highly accurate (±1 base) experimental mapping data. The use of less accurate mapping is meaningless, as the bendability-based algorithm has potentially single-base resolution. Indeed, 1 base error in position of the nucleosome DNA corresponds to ~34° change in rotational setting of the DNA. Such change would significantly influence interactions of external agents with exposed segments of DNA. For example, DNaseI cuts in the nucleosome DNA are usually single bond cuts (37, 38), very sensitive to the orientation of the bonds (39). Of experimental techniques the most accurate, of course, is the atomic resolution x-ray diffraction from the crystallized nucleosomes with unique DNA sequences. In the resolved structures the middle bases, at the dyad axis of the nucleosomes, are unequivocally determined. There are 5 such structures known: 1AOI (3), 1EQZ (28), 1KX4 and 1KX5 (35), and 2NZD (32). They involve artificially designed complementarily symmetrical sequences made of halves of the primate alpha-satellite repeats. Four of them are slightly different versions of the same sequence that involves left half of the satellite: 1AOI, 1EQZ, 1KX5 and 2NZD. One is assembled on the symmetrized right half of the satellite – 1KX4.
Enlarge Figure 4Figure 4:Initial full length matrix of bendability (116-matrix). 116 dinucleotide positions matrix was constructed from 10-base matrices (21) accommodated to the 10.4 base periodicity, as described in the text and in the Methods. The amplitudes shown correspond to affinities of respective dinucleotides to various positions within the 10-base pattern (actual scores divided by their average in 10 positions), as in original work (ibid). Only eight major contributing dinucleotides of the complete 16 × 116 matrix are shown. Of the less consuming traditional experimental mapping techniques, perhaps, the most accurate (±1 base) are sequence gel resolution DNase digestion of the DNA within the nucleosomes (38), and hydroxyl radical scission of the nucleosome DNA at points of its crosslinking with the histones (40). Yet, even in these experiments some instrumental or human errors may happen, so that for confidence it is desirable to verify results of the mapping by an independent method. Such double checked mapping data may be taken as test cases, in addition to the crystal data. To this category belongs the nucleosome reconstituted on the 145 base fragment of prokaryotic origin (38). It is mapped by DNaseI and by DNaseII digestions. DNaseI cut near the nucleosome DNA dyad that is located between positions 92 and 93 of the sequence. Since the stagger between the DNaseI cuts in Watson and Crick strands is 4 bases, with 3'-end protruding (41), the nucleosome dyad according to these experiments would be located between positions 90 and 91. Since the nucleosome DNA is base-centered any one of these two positions can be taken as central. Another reconstituted nucleosome involves 5S rRNA gene fragment. It is mapped by DNaseI digestion (42) and partially confirmed by hydroxyl radical technique (40). DNaseI data suggest that the nucleosome DNA axis passes through dinucleotide GA (positions –1, +1 according to numbering of the 5S rDNA sequence in (42, 40). Nearby alternative positions for the local dyads are suggested as well by the data: CA (positions –11, –10) and CC (positions 11,12). From the Flaus et al. data, on the other hand, it follows that the local axes pass through C at position –11, AA at –2, –3 and C at position 8. The base C at 11 would be the axis position consistent with both experiments.
Enlarge Figure 5Figure 5:Final 116 position nucleosome probe matrix. To incorporate possible stretching/compression of DNA in the nucleosome, the initial matrix shown in Figure 4 was shifted by one base pair to the left and to the right, resulting matrices were combined and symmetrized (see Methods). There are two types of errors in the nucleosome positioning. One is due to
displacements of the nucleosome DNA by the integers of its period, 10.4 bases. Such displacement keeps the direction of DNA bending in the nucleosome,
rotational positioning (43). The alternative positions are frequently observed in reconstitution experiments (40, 44) as they are physically legitimate alternatives with only small difference in the stability of the complexes coexisting in equilibrium (44). Every sensitive sequence-dependent nucleosome positioning should predict such alternative positions. Second type of the positioning error is inaccuracy in determination of the central base pair in the exposed minor groove at the dyad axis of the nucleosome. The central base pair defines which of the remaining positions in the molecule are exposed as well - those that are 10.4xn bases away from the center - crucial information when interaction of the nucleosomes with various transcription factors and other externally approaching molecules is considered. The accurate (±1 base) nucleosome positioning tools are, thus, in demand.
Testing the mapping by the 116 position bendability matrix Figures 6 and 7 show nucleosome positioning results obtained with the 116 position bendability matrix described above, as applied to the seven most accurate test cases listed in the previous section. As the plots demonstrate the sequence-directed predictions of the nucleosome positions are exact fit in four cases, and in three cases the misfit is only one base. The calculations, thus, predict the nucleosome centers with±1 base accuracy. As the central element of the 116-matrix is the dinucleotide AT, the fit of the matrix to the accurate experimental data means that the AT dinucleotides are preferentially located at the minor grooves oriented outwards, while CG dinucleotides are positioned at the minor grooves contacting the octamers (26).
Enlarge Figure 6Figure 6: Comparison of calculated nucleosome positions with atomic resolution experimental data. Predicted nucleosome center positions (red dots) compared with nucleosome dyad bases (in red, see the sequences on the top of the plots) determined by x-ray diffraction from the crystallized nucleosome structures (A-D). Sequence coordinates on the plots are the same as in original works. (A) 1AOI(3), (B) 1EQZ(28), (C) 1KX4(35), (D) 1KX5(35). In all plots the alternative positions appear as additional maxima, 10-11 bases away from the central one. Their amplitudes may somewhat exceed the central peak, especially in case of crystallized nucleosomes. This can be explained by strong preference of the nucleosomes within the crystals to symmetrical positioning of respective DNA fragments (32, 44). As the examples of computational mapping above demonstrate, the whole-length 116-position periodical pattern with no modulation (no increased periodicity at the ends), performs very well. We, thus, incorporated the mapping program into the server http://www.cs.bgu.ac.il/~nucleom/ for public use (45). We did not succeed yet with our attempts to outline the expected modulation. Respective presumably small changes to the mapping program will be introduced at some later stage. Discussion Marginal “strength” of natural nucleosomes (C. elegans) The 116-matrix of bendability can be used now for calculation of the match between each sequence of the database and the matrix, to find out how strong are the natural nucleosomes as compared to those that could form on randomized sequences. The 116-matrix for nucleosome mapping is the first of this kind, and it may well need some modifications in future. Besides, there are other, non-sequence factors that influence the nucleosome positioning. The intended comparison of natural versus random nucleosome strengths has to be considered, therefore, with the above reservation. For the purpose of the comparison we scanned by the 116-matrix the central 186 base regions of the database sequences, taking into account the ±20 base uncertainty of determination of the ends of the 147 base core sequences(22). The histogram of resulting match values for the database sequences is shown in the Figure 8.
Enlarge Figure 7Figure 7: Comparison of calculated nucleosome positions with single-base accuracy experimental data. (A) Predicted nucleosome center positions compared with nucleosome dyad bases determined by x-ray diffraction from the crystallized nucleosome structure, 2NZD(32). (B) Comparison of computational prediction with the position, experimentally mapped by DNaseI and DNaseII digestions (38). (C) Comparison of calculated prediction with the nucleosome position mapped by DNaseI digestion (42) and confirmed by hydroxyl radical cleavage technique (40). The source sequence coordinates are indicated on the top. The natural sequences of moderate strength (score) 0.23-0.33 are somewhat
less frequent than corresponding matches within shuffled sequences of the
same dinucleotide composition. On the stronger side natural sequences are slightly more frequent. Only about 20% of the natural sequences are strong by the above criterion (nearly as many are strong “random” nucleosomes). Majority (~80%) of the nucleosomes are as weak as for random sequences. Similar result - majority of the nucleosomes of C. elegans have weak positional preferences - has been obtained also experimentally (31). The dominance of the weak nucleosomes means that bulk of the genomic sequences has no special bias towards similarity to the nucleosome pattern. In that respect the sequences are random. Some segments of the “random” sequences may, indeed, have some similarity to the pattern, in which case the segments would (weakly) attract the histone octamers, and no additional sequence bias would be needed. On the other hand, such bias has to be introduced when the “random” nucleosomes are not strong enough. This explains the shift of the natural distribution to the stronger side. The “random” natural nucleosomes in the above sense are, of course, not truly random. They have their affinity, no matter how small, to their unique positions, and are in accord with requirements of local chromatin structure and respective sequence functions. In other words, the sequence may well be random, while the choice of the binding sites by the histone octamers is not random at all. It is directed by the local weak similarities of the random sequence to the ideal nucleosome mapping pattern. Cooperatively, the weak and strong nucleosomes are likely to be organized in a unique sequence positioning and 3D chromatin organization.
One important practical conclusion from the observed moderate strength of natural nucleosomes is that it would not make sense to search for the hidden nucleosome positioning pattern in the 80% majority of the weak natural nucleosomes. In our calculations above and in previous work (21) only strongest (5-20%) nucleosome DNA sequences have been considered. The calculated nucleosome positioning sequence pattern (Figure 5) describes to which DNA sequences the histone octamers would ideally bind. As the Figure 8 shows, no such ideal sequences (correlation 1) have been detected in genome of C. elegans. Correlations (“strengths”) of 0.4-0.5 are occasionally seen, while typical nucleosome strengths are of the order 0.23-0.33
Figure 8: “Strength” of nucleosomes on natural and randomized sequences. The histograms (bin’s width 0.001) of maximal values of match to nucleosome mapping 116-matrix within central 186 base regions of the nucleosome core sequences in the database (22). Black curve – natural sequences. Gray curve – the same sequences randomized (see Methods).Universality of the DNA Bendability Pattern The nucleosome DNA bendability matrix can be presented in simplified linear form as the repeating motif CGGAAATTTCCG with CG and AT elements at local dyad positions of the one-period long segment of the nucleosome DNA on the surface of the histone octamer. It is calculated by Gabdank et al. ( 21) from a very large database of nucleosome core DNA sequences from C. elegans (22). No such bendability pattern has been derived yet for other genomes, though respective calculations are in progress (I. Ioshikhes, personal communication). It is already clear, however, that the motif above is universal, applicable to eukaryotic genomes in general. An identical motif is theoretically predicted by optimization of unstacking deformations of DNA wrapped around histone octamers (26). And the same motif is directly derived (46) from two known binary presentations of the nucleosome DNA sequence pattern – (R5Y5)n (16), and (S5W5)n (42). The universal motif (GGAAATTTCC)n, thus, appears to be the final expression of 30 years long world-wide efforts to establish the chromatin code of the nucleosome positioning. Perspectives for detailed studies of chromatin structure Availability of the single-base resolution mapping of the nucleosomes brings the chromatin studies to a new level, from correlations and low-resolution nucleosome occupancies to exact distances between the nucleosomes and their arrangement in 3D space (47, 48). Of special interest would be 3D structure of chromatin in promoter regions, distribution of transcription factor binding sites on the nucleosomes, and their possible involvement in spatial interactions between the nucleosomes, details of action of remodeling factors, exact relation of the nucleosomes to exons and their ends. General architecture of chromatin organization – typical 3D combinations of the nucleosomes, higher order structure – is also a hot subject. The high-resolution mapping of the nucleosomes may become a major instrument in these studies. Finally, high resolution details of chromatin rearrangements due to transposition, deletions, gene transfer and other molecular recombination events become now accessible. Materials and methods Construction of initial 116-matrix To accommodate the 10-matrix (21) to the 10.4 base periodicity the dinucleotide AT is placed at symmetrical positions 0.5 and –0.5, i. e. A0T1 and A-1T0. Similarly, at symmetrical local dyad base positions –52, –26, +26 and +52 the duplicate columns A-51 T-52, A-52T-53, C-25G-26, C-26G-27, C25G26, C26G27, A51T52 and A52T53 are placed. Positions between the duplicate columns and beyond are filled by respective elements of periodically repeating 10-matrix. The resulting complete matrix involves 116 dinucleotide positions. Construction of the final 116-matrix From original matrix M0 (Figure 4) the shifted matrices M-1 and M+1 were calculated (shifts +1, and –1). Then these matrices were combined with the original one according to the formula Mcomb(i) = MAX[M-1(i), M0(i), M+1(i)]. Since the original 10-matrix is not strictly symmetrical the resulting 116-matrix was symmetrized in the following way. Values for complementary dinucleotides (AA/TT, GA/TC and GG/CC) were taken from the ith (for any given dinucleotide XY) and (116-i)th positions (for its complementary dinucleotide) and averaged, the resulting values were put back in ith and (116-i)th positions, for XY and its complementary one, respectively. For the self-complementary AT and CG dinucleotides the symmetrized values were calculated by averaging the values of ith and (116-i)th positions for the given dinucleotide. Sequence shuffling preserving dinucleotide composition For the first round of the shuffling one of 16 dinucleotides is chosen (e.g. AA). All non-overlapping subsequences of a given sequence that start and end with this dinucleotide were detected and shuffled. The resulting sequence then was subjected to a second round of shuffling, with another dinucleotide. The procedure was repeated for all 16 dinucleotides. Distribution of matches to the bendability 10-matrix in database sequences The match of any 10 dinucleotide (11 bases) long sequence of interest to the 10-matrix is calculated by aligning the matrix with the sequence and summing the elements of the matrix that match to the respective dinucleotides of the sequence. The score is normalized to maximal possible match and scales between 0 and 1.0. Since the maximal match to the 10-matrix is almost never encountered, we had to resort to more moderate matches. The choice of the match threshold is dictated by necessity to have sufficient signal-to-noise ratio in the oscillating pattern. Empirically found range of the suitable thresholds is between 0.3 and 0.6. Each sequence was scanned by the bendability 10-matrix (21), and positions with >0.43 match were tabulated. The resulting positional distributions of the matches (as in Figure 1) were not very sensitive to the choice of the threshold. Right-end synchronization of the database sequences Every sequence in the database was searched for a match with initial bendability matrix (10-matrix) within ± 20 bases (threshold 0.43) at the right end of the 147 long nucleosome core sequences. Selected 70210 sequences were aligned with respect to the matching position. The positional distribution of other matches in the aligned sequences was calculated (Figure 2). Calculation of nucleosome center positions in test sequences The final 116-matrix of dinucleotides was aligned to every position in the test sequences, and matching scores normalized to maximal possible value calculated. The maxima in the plots (nucleosome maps) were compared to nucleosome center positions determined experimentally (3, 28, 32, 35, 38, 40, 42) for the test sequences. In one case the 145-base sequence is asymmetrically located on the histone octamers (38) so that not all DNA-histone contact points are engaged. At one end only 54 bases of this DNA fragment are involved. The rest of the sequence, however, occupies a unique position on the octamer. To determine this position by our algorithm, that requires at least 117 bases long sequence, we extended the original sequence by 15 non-scoring bases (N15). As a result the predicted position (local maximum) was found within the plot, rather than beyond it (Figure 7B). Removal of the entries with repeating sequences. There are several families of the tandem repeats of lengths from 19 to 35 bases (33), all containing common 17-base motif AAATTTCCGGCAAATCG. Entries of the nucleosome DNA database have been screened for match with the sequence. The entries containing matches higher than 10/17 were removed. This procedure was applied to the sub-sequences having both ends of the DNA matching to the 10-matrix (Figure 3). Acknowledgements The work has been supported by Israel Science Foundation (grant 222/09), by Czech Ministry of Education (grant MSM0021622415), and by Pratt fellowship and the Lynn and William Frankel Center for Computer Sciences at Ben Gurion University. References
|