Google Matrix Analysis of DNA Sequences Vivek Kandiah and Dima - PowerPoint PPT Presentation

Google Matrix Analysis of DNA Sequences Vivek Kandiah and Dima Shepelyansky Laboratoire de Physique Théorique, IRSAMC, UMR 5152 du CNRS Université Paul Sabatier, Toulouse Supported by EC FET open project NADINE 21 june 2014 Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 1 / 13

Introduction : motivation Large and accurate genomic dataset available for several species 1 . Interest in detection of specific/rare patterns in a given sequence. New viewpoint of directed network. Google matrix : G ij = α S ij + ( 1 − α ) / N with S i , j = T i , j / � j T i , j where T describes the transitions between nearby words. 1 http://www.ensembl.org/ Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 2 / 13

Introduction : from DNA sequence to network Bos Taurus (Bull, L ≈ 2 . 9 · 10 9 bp ); Canis Familiaris (Dog, L ≈ 2 . 5 · 10 9 bp ); Loxondonta Africana (Elephant, L ≈ 3 . 1 · 10 9 bp ); Homo Sapiens (Human, L ≈ 1 . 5 · 10 10 bp ) and Danio Rerio (Zebrafish, L ≈ 1 . 4 · 10 9 bp ). ... TCG ATAT CTGG TAAC CTA ... � �� W k − 1 W k W k + 1 → W k − 1 → W k → W k + 1 → T ij → T ij + 1 whenever word j points to word i . Full matrix limit, L / mN 2 ≈ 10 to 100 transitions per elements at m = 6. Webpages ≈ 10 links per node on average with N ≈ 2 · 10 5 . Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 3 / 13

Statistics of Google matrix elements 0 0 -1 -1 -2 -2 2 ) 2 ) Log 10 (N g /N Log 10 (N g /N -3 -3 -4 -4 -5 -5 -6 -6 -7 -7 -7 -6 -5 -4 -3 -2 -1 0 -7 -6 -5 -4 -3 -2 -1 0 Log 10 g Log 10 g Integrated fraction Ng / N 2 of Google matrix elements with Gij > g as a function of g . Left panel : Various species with 6-letters word length: bull BT (magenta), dog CF (red), elephant LA (green), Homo sapiens HS (blue) and zebrafish DR(black). Right panel : Data for HS sequence with words of length m = 5 (brown), 6 (blue), 7 (red). For comparison black dashed and dotted curves show the same distribution for the WWW networks of Universities of Cambridge and Oxford in 2006 respectively. Oscillations but universal decay law N g ∝ 1 / g ν − 1 with ν ≈ 2 . 5 (range − 5 . 5 < log 10 g < − 0 . 5). Distribution of outgoing links in WWW networks decay with ˜ ν ≈ 2 . 7. Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 4 / 13

Statistics of Google matrix elements 0 0 -1 -1 Log 10 (N s /N) Log 10 (N s /N) -2 -2 -3 -3 -4 -4 -2 -1 0 1 -2 -1 0 1 Log 10 g s Log 10 g s Integrated fraction Ns / N of sum of ingoing matrix elements with � N j = 1 Gi , j ≥ gs . Left and right panels show the same cases as above in same colors. The dashed and dotted curves are shifted in x -axis by one unit left to fit the figure scale. Visible differences between species but close to universal decay curve as N s ∝ 1 / g µ − 1 with µ ≈ 5. Distribution of ingoing links in WWW networks decay with ˜ µ ≈ 2 . 1. Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 5 / 13

Spectrum and PageRank a) b) 0,2 0 -0,2 Presence of large gap. c) d) 0,2 HS ∼ CF and strong differences 0 between mammalian and non -0,2 mammalian sequences. -0,4 -0,2 0 0,2 0,4 0,8 1 -0,2 0 0,2 0,4 0,8 1 0,6 0,6 Spectrum of G and G ∗ are e) 0,5 identical. 0 -0,5 -1 -0,5 0 0,5 1 1,5 Eigenvalue spectrum at m = 6 of a) Bos Taurus, b) Canis Familiaris, c) Loxodonta Africana, d) Homo Sapiens and e) Danio Rerio. Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 6 / 13

Spectrum and PageRank -2 -2 -3 -3 -4 Log 10 P Log 10 P -4 -5 -5 -6 -6 -7 0 1 2 3 4 0 1 2 3 4 5 Log 10 K Log 10 K PageRank probability decay of several species at m = 6 (left) and Homo Sapiens at m = 5, m = 6 and m = 7 (right). Top five (top) and last five (bottom) PageRank entries of DNA sequences. PageRank ∼ frequency of words. BT CF LA HS DR P ( K ) ∼ 1 / K β with β = 1 / ( µ − 1 ) . TTTTTT TTTTTT AAAAAA TTTTTT ATATAT AAAAAA AAAAAA TTTTTT AAAAAA TATATA ATTTTT AATAAA ATTTTT ATTTTT AAAAAA At m = 6 : β = 0 . 273 ± 0 . 005 (BT), AAAAAT TTTATT AAAAAT AAAAAT TTTTTT TTCTTT AAATAA AGAAAA TATTTT AATAAA 0 . 340 ± 0 . 005 (CF), 0 . 281 ± 0 . 005 (LA), BT CF LA HS DR CGCGTA TACGCG CGCGTA TACGCG CCGACG 0 . 308 ± 0 . 005 (HS), 0 . 426 ± 0 . 008 (DR) TACGCG CGCGTA TACGCG CGCGTA CGTCGG in the range 1 ≤ log 10 K ≤ 3 . 3. Small CGTACG TCGCGA ATCGCG CGTACG CGTCGA CGATCG CGTACG TCGCGA TCGACG TCGACG variation between mammalian species, ATCGCG CGATCG CGCGAT CGTCGA TCGTCG stable with word length. Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 7 / 13

Spectrum and PageRank -2 0 -1 -3 -2 2 ) Log 10 (N g /N -3 Log 10 P -4 -4 -5 -5 -6 -6 -7 0 1 2 3 4 -7 -6 -5 -4 -3 -2 -1 0 Log 10 K Log 10 g 0,4 0,2 0 -0,2 -0,4 -0,4 -0,2 0 0,2 0,4 0,6 0,8 1 Random matrix model with distribution of elements corresponding to HS at m = 6. Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 8 / 13

Statistical proximity 4000 K bt K cf 3000 2000 √ � N i = 1 ( K s 1 ( i ) − K s 2 ( i )) 2 ) / N ζ ( s 1 , s 2 ) = . 1000 σ rnd K hs K hs 0 ζ ( HS , CF ) = 0 . 206, ζ ( HS , LA ) = 0 . 238, K la K dr ζ ( HS , BT ) = 0 . 246, ζ ( LA , CF ) = 0 . 303, 3000 ζ ( CF , BT ) = 0 . 308, ζ ( LA , BT ) = 0 . 324, ζ ( DR , HS ) = 0 . 375, ζ ( DR , CF ) = 0 . 414, 2000 ζ ( DR , LA ) = 0 . 422, ζ ( DR , BT ) = 0 . 425 1000 K hs K hs 0 0 1000 2000 3000 0 1000 2000 3000 4000 PageRank proximity K − K plane diagrams for different species in comparison with Homo Sapiens. Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 9 / 13

Statistical proximity 4000 K hs2 K hs2 3000 2000 1000 K hs1 K hs1 0 0 1000 2000 3000 0 1000 2000 3000 4000 PageRank proximity K − K plane diagrams between two Homo Sapiens individuals. ζ = 0 . 031 Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 10 / 13

Conclusion and Perspectives Complex and large gaped spectrum of Google matrix. Structural differences and similarities of DNA with WWW through G ij . DNA sequence µ ≈ 5 → slow PageRank decay β ≈ 0 . 25 (For WWW β ≈ 0 . 9). PageRank correlations show the statistical similarity between species from a Markov chain point of view. Random matrix model reproducing the spectrum. Other eigenmodes may highlight a relatively long living relaxation mode and might localize themselves in a paricular set of words. Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 11 / 13

References 1. Nucleotide sequence bank http://www.ncbi.nlm.nih.gov 2. Academic Web Link Database Project http://cybermetrics.wlv.ac.uk/database/ 3. S.Brin and L.Page, Computer Networks and ISDN Systems 30 107 (1998). 4. A.M. Langville and C.D. Meyer C D 2006 Google’s PageRank and Beyond: The Science of Search Engine Rankings , Princeton University Press, Princeton, 2006. 5. Frahm KM, Shepelyansky DL (2012) Poincaré recurrences of DNA sequences , Phys. Rev. E 85 : 016214 6. K.M. Frahm, B. Georgeot and D.L. Shepelyansky, Universal emergence of PageRank , J. Phys, A: Math. Theor. 44 (2011) 465101. 7. L.Ermann, K.M.Frahm and D.L.Shepelyansky Spectral properties of Google matrix of Wikipedia and other networks submitted to Eur. Phys. J. B 5 Dec 2012 8. Fortunato S Community detection in graphs Phys. Rep.486: 75 (2010) 9. Robin S, Rodolphe F , Schbath S DNA, words and models Cambridge Univ. Press, Cambridge (2005) 10. Mantegna RN, Buldyrev SV, Goldberger AL, Havlin S, Peng C-K, Simons M, Stanley HE Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics Phys. Rev. E52: 2939 (1995) 11. Halperin D, Chiapello H, Schbath S, Robin S, Hennequet-Antier C, Gruss A, El Karoui M (2007) Identification of DNA motifs implicated in maintenance of bacterial core genomes by predictive modeling , PLoS Genetics 3(9) : e153 12. Dai Q, Yang Y, Wang T (2008) Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison , Bioinformatics 24(20) : 2296 13. Reinert G, Chew D, Sun D, Waterman MS (2009) J. Comp. Biology 16(12) : 1615 14. Burden CJ, Jing J, Wilson SR (2012) Alignment-free sequence comparison for biologically realistic sequences of moderate length , Stat. Appl. Gen. Mol. Biology 11(1) 3 15. Brendel V, Beckmann JS, Trifonov EN (1986) J. Boimolecular Structure Dynamics 4 : 11 16. Popov O, Segal DM, Trifonov EN (1996) Biosystems 38 : 65 17. Frenkel Zakharia M, Frenkel Zeev M, Trifonov EN, Snir S (2009) J. Theor. Biology 260 : 438 Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 12 / 13

Google Matrix Analysis of DNA Sequences Vivek Kandiah and Dima - PowerPoint PPT Presentation

Google Matrix Analysis of DNA Sequences Vivek Kandiah and Dima Shepelyansky Laboratoire de Physique Thorique, IRSAMC, UMR 5152 du CNRS Universit Paul Sabatier, Toulouse Supported by EC FET open project NADINE 21 june 2014 Vivek Kandiah and

DNA D DNA Double bl Helix DNA stands for: DNA stands for: U d Under a Deoxyribose

Table of Contents Why DNA Computing? The Structure of DNA DNA Computing Operations on DNA

Take out your DNA model DNA and the Human Genome DNA Model How was your How was your model

Table of Contents Why DNA Computing? The Structure of DNA DNA Computing Operations on

DNA Computing Information Processing with DNA Molecules Christian Jacob, 01/2002. Table of

Eastern Shores (GHOTES) DNA A Family Tree DNA Project Family Tree DNA Family Tree DNA or

Analysis and classification of the DNA Analysis and classification of the DNA sequence of TARA

DNA IN OUR FOOD? EXTRACTION OF DNA FROM STRAWBERRIES (GETTING THE DNA OUT OF STRAWBERRIES) -OR

The Design of Autonomous DNA The Design of Autonomous DNA Nanomechanical Devices: Devices:

DNA evidence: two important features match between two DNA profiles frequency of the DNA profile in

DNA Nucleus Contains cells genetic info (DNA) controls cell functions DNA Structure

Self-Assembling DNA Self-Assembling DNA N. Jonoska Jonoska, N. C. , N. C. Seeman Seeman, DNA

Go Bananas! Introduction Tell you about DNA Show you how to extract DNA from a Banana

Google Matrix Analysis of DNA Sequences Vivek Kandiah and Dima Shepelyansky Laboratoire de

20-03-06 7. Learning Sequences/Behaviors How to use sequences/behaviors? Sequences and more

RPC Metrics at Google JBD, Google (@rakyll) gRPC Metrics at Google JBD, Google (@rakyll)

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms

Conference Welcome Paul Holme Chair NWPN Apprenticeships The Leeds Way Treat 2 million

Improving Information Services for Education The Strategic Vision New services will

Open Source and Capacity in the HISP Network 02.10.2017 Action and research in the HISP network

Assembly Assembly Assembling with Repeats Assembling with Repeats Mate Pairs Mate Pairs Whole

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Chapter 6 Dynamic Programming CS 573: Algorithms, Fall 2013 September 12, 2013 6.1 Maximum

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Sambuz

Useful Links

Newsletter

Mail Us