Albany 2015:Book of Abstracts
June 9-13 2015
©Adenine Press (2012)
Analysis of SNP containing sites in human genome using text complexity estimates
Due to technological breakthrough in the development of DNA sequencing technologies amount of available genomic data exponentially grows each year, including information on genomes natural variability. The study of genomic context of single nucleotide polymorphisms (SNPs) represented in the dbSNP database (http://www.ncbi.nlm.nih.gov/projects/SNP/) is of greater interest. Association of SNP position in human genome with mononucleotide repeats was shown earlier. We studied context dependencies in broader scale in human, mouse and rat genomes using several complexity measures. Nucleotide text complexity is important mathematical features to explore fully the contextual dependencies in the sequences, is the complexity of the text (Orlov et al., 2006). A wide range of complexity measures estimates different features of the nucleotide text: linguistic complexity relates to oligonucleotide vocabularies, complexity estimation by Lempel-Ziv compression relates to structure of repeats in the text, Shannon entropy counts variation of nucleotides. These algorithms which were previously used in "Complexity" software developed in the Institute of Cytology and Genetics in Novosibirsk (http://wwwmgs.bionet.nsc.ru/ mgs/programs/lzcomposer/) have been re-implemented in a computer program with supplements weight complexity measures and measures the rotation of the monomers. We analyzed the nucleotide sequences containing SNPs in the human, mouse and rat genomes by in-house program (in C++) calculating the averaged text complexity profiles. We analyzed more than 2.7 million SNP containing sites (+/-20 nt) in the human genome presented at the UCSC Genome Browser tables and in the "1000 genomes" project (http://www.1000genomes.org/data). The presence of low complexity sites in the flanking regions around SNPs in the human genome was statistically shown. The same effect was confirmed for sample of SNPs in mouse and in rat genomes. Effect of mononucleotide repeats adjacent to a SNP position (Siddle et al., 2011) was confirmed on new data including model mammalian genomes. Note that low complexity profiles keep more information extending just measures of mononucleotide patches. This effect was found in model genomes (Matushkin et al, 2013; Levitsky et al, 2014). The irregularities of mutation hot-spots in genome have been shown earlier on a limited data. The molecular mechanism of the observed effect of lowering the text complexity on flanks of SNP genome position can be explained by the increased frequency of double-helix DNA breaks in flanking positions.
The research has been supported by ICG SB RAS budget project VI.61.1.2 and RFBR 14-04-01906.
Matushkin, Y. G., Levitsky, V. G., Orlov, Y. L., Likhoshvai, V. A., Kolchanov, N. A. (2013) Translation efficiency in yeasts correlates with nucleosome formation in promoters. J Biomol Struct Dyn. 31(1), 96-102.
Orlov, Y. L., Te Boekhorst, R., Abnizova, I. I. (2006) Statistical measures of the structure of genomic sequences: entropy, complexity, and position information. J Bioinform Comput Biol. 4, 523-36.
Siddle, K. J., Goodship, J. A, Keavney, B., Santibanez-Koref, M. F. (2011) Bases adjacent to mononucleotide repeats show an increased single nucleotide polymorphism frequency in the human genome. Bioinformatics. 27(7), 895-8.
Nataly S. Safronova
Institute of Cytology and Genetics SB RAS