overlap graph and clumps
play

Overlap Graph and Clumps Mireille R egnier LIX and INRIA - PowerPoint PPT Presentation

Overlap Graph and Clumps Mireille R egnier LIX and INRIA Mireille.Regnier@inria.fr web page : algo.inria.fr/regnier October, 9-th 2008 AlBio08 An Optimized Counting Graph 1 Outline 1 Introduction and principles 2 Overlap Graph 3


  1. Overlap Graph and Clumps Mireille R´ egnier LIX and INRIA Mireille.Regnier@inria.fr web page : algo.inria.fr/regnier October, 9-th – 2008 AlBio08 An Optimized Counting Graph 1

  2. Outline 1 Introduction and principles 2 Overlap Graph 3 Combinatorics of clumps 4 Open problems AlBio08 An Optimized Counting Graph 2

  3. Cis-regulation AlBio08 An Optimized Counting Graph 3

  4. Cis-regulation changes AlBio08 An Optimized Counting Graph 4

  5. Example : the caudal motif in early developmental enhancers from Drosophila Papatsenko et al., 2002 GCTTTTTTATGGTCGGC TCGCTTTTATGGCCCAA CAGTTTTTATGTCTTTA CCGTTTTGATGGCGGTG AAATTTTTAGGGAACCA GCCCGTTTATGGTTCCC GACACTTTATGTGACAA TCGGATTTATGACACAA A| 2 3 2 2 1 0 0 0 9 0 0 2 1 3 3 4 7 ATGTCTTTATGATTATT C| 3 7 3 2 3 0 0 0 0 0 0 0 6 4 5 2 2 GCAACTTTTGGGCCATA G| 4 0 5 1 1 0 0 2 0 2 11 7 1 1 2 1 1 CCCTTTTGTTGGCCAAA T| 2 1 1 6 6 11 11 9 2 9 0 2 3 3 1 4 1 (a) Aligned Motifs (b) Countings AlBio08 An Optimized Counting Graph 5

  6. Example : the caudal motif in early developmental enhancers from Drosophila Papatsenko et al., 2002 GCTTTTTTATGGTCGGC TCGCTTTTATGGCCCAA CAGTTTTTATGTCTTTA CCGTTTTGATGGCGGTG AAATTTTTAGGGAACCA GCCCGTTTATGGTTCCC GACACTTTATGTGACAA TCGGATTTATGACACAA A| 2 3 2 2 1 0 0 0 9 0 0 2 1 3 3 4 7 ATGTCTTTATGATTATT C| 3 7 3 2 3 0 0 0 0 0 0 0 6 4 5 2 2 GCAACTTTTGGGCCATA G| 4 0 5 1 1 0 0 2 0 2 11 7 1 1 2 1 1 CCCTTTTGTTGGCCAAA T| 2 1 1 6 6 11 11 9 2 9 0 2 3 3 1 4 1 (a) Aligned Motifs (b) Countings A| -0.22 0.06 -0.22 -0.22 -0.62 -1.32 -1.32 -1.32 0.98 -1.32 -1.32 -0.22 -0.62 0.06 0.06 0.28 0.75 C| 0.06 0.75 0.06 -0.22 0.06 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 0.62 0.28 0.47 -0.22 -0 G| 0.28 -1.32 0.47 -0.62 -0.62 -1.32 -1.32 -0.22 -1.32 -0.22 1.16 0.75 -0.62 -0.62 -0.22 -0.62 -0 T| -0.22 -0.62 -0.62 0.62 0.62 1.16 1.16 0.98 -0.22 0.98 -1.32 -0.22 0.06 0.0 6 -0.62 0.28 -0 (c) Position Specific Scoring matrix AlBio08 An Optimized Counting Graph 6

  7. Probability Weight Matrices Probability function ! Threshhold s : A word (site) is similar iff score ( w ) > s . ! Pvalue : Prob n ( ∃ H ; score ( H ) > s ) . AlBio08 An Optimized Counting Graph 7

  8. Probability Weight Matrices Probability function ! Threshhold s : A word (site) is similar iff score ( w ) > s . ! Pvalue : Prob n ( ∃ H ; score ( H ) > s ) . Algorithms and data structures ! candidates-motifs extraction AlBio08 An Optimized Counting Graph 8

  9. Probability Weight Matrices Probability function ! Threshhold s : A word (site) is similar iff score ( w ) > s . ! Pvalue : Prob n ( ∃ H ; score ( H ) > s ) . Algorithms and data structures ! candidates-motifs extraction Model accuracy ! Improve PWM with structural information AlBio08 An Optimized Counting Graph 9

  10. Principles Biological function ! Overrepresented words ! underrepresented words Statistical softwares ! candidates-motifs extraction ! statistical significance AlBio08 An Optimized Counting Graph 10

  11. Probability Computation “Classic” methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00]. AlBio08 An Optimized Counting Graph 11

  12. Probability Computation “Classic” methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00]. Space/time complexity ! Exact (all n ) → AhoPro (NIIGenetika, Inria) ! O ( n × | Σ | ) ; n : text size ; Σ : data structure. AlBio08 An Optimized Counting Graph 12

  13. Probability Computation “Classic” methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00]. Space/time complexity ! Exact (all n ) → AhoPro (NIIGenetika, Inria) ! O ( n × | Σ | ) ; n : text size ; Σ : data structure. Drawback ! n dependency ; ! numerical precision ; AlBio08 An Optimized Counting Graph 13

  14. Probability Computation “Classic” methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00]. AlBio08 An Optimized Counting Graph 14

  15. Probability Computation “Classic” methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00]. Space/time complexity ! Approximation → RSA-tools, Spatt, AhoSoft (NIIGenetika, Inria) ! O (1 × | Σ | ) AlBio08 An Optimized Counting Graph 15

  16. Probability Computation “Classic” methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00]. Space/time complexity ! Approximation → RSA-tools, Spatt, AhoSoft (NIIGenetika, Inria) ! O (1 × | Σ | ) Drawback ! size of the data structure ; ! tightness ; AlBio08 An Optimized Counting Graph 16

  17. AhoCorasick searching automaton t c a t t g a a c a t a a t c g t t c t c c a a a a a c a c t c c t t c a a a a a a c a a 1 2 3 4 5 6 7 8 AlBio08 An Optimized Counting Graph 17

  18. AhoCorasick automaton : searching and computing ! n : w n = largest prefix found =ATA ; ! n + 1 : character x found : x = G , wx = ATAG ∈ Graph , w n +1 = ATAG x = A , C , T , wx �∈ Graph * x = C ; w = A · TA , w n +1 = TAC ∈ Graph * x = T ; w = AT · A , w n +1 = AT ∈ Graph * x = A ; AA , TAA �∈ G , w n +1 = root t c a t t g a a c a t a a t c g t t c t c c a a a a a c a c t c c t t c a a a a a a c a a 1 2 3 4 5 6 7 8 AlBio08 An Optimized Counting Graph 18

  19. AhoPo :pobability computation Step n : ( p n ( w )) w ∈ Graph . p n ( w ) = Prob (largest prefix ending at n is w ). Induction p n +1 ( ATAG ) = p n ( ATA ) · p ( G ) p n +1 ( AT ) = p n ( ATA ) · p ( T ) + p n ( AGA ) · p ( T ) + p n ( CA ) · p ( T ) + p n ( TA ) · p ( T ) AlBio08 An Optimized Counting Graph 19

  20. AhoCorasick automaton : searching and computing Left relation H 1 R L H 2 ⇔ Father LOG ( H 1 ) = Father LOG ( H 2 ) ˜ { ATACACA , ATAGATA } ATA ATA :Largest prefix of ATACACA that is a suffix in H AlBio08 An Optimized Counting Graph 20

  21. AhoCorasick automaton : searching and computing Left relation H 1 R L H 2 ⇔ Father LOG ( H 1 ) = Father LOG ( H 2 ) ˜ { ATACACA , ATAGATA } ATA ATA :Largest prefix of ATACACA that is a suffix in H Right relation H 1 R R H 2 ⇔ Mother ROG ( H 1 ) = Mother ROG ( H 2 ) ¯ { ATACACA , ATACACA } ACA ∪{ AGACACA , } ACA :Largest suffix of ATACACA that is a prefix in H AlBio08 An Optimized Counting Graph 21

  22. Computation on Graph :induction AlBio08 An Optimized Counting Graph 22

  23. AhoCorasick automaton : searching and computing First occurrence at position n = 18 GGGGGGGG | ATACACA | no H ∈ H | · · · | n AlBio08 An Optimized Counting Graph 23

  24. AhoCorasick automaton : searching and computing First occurrence at position n = 18 GGGGGGGG | ATACACA | no H ∈ H | · · · | n AND NOT GGGGCATT | ATACACA | GGGGACAT | ATACACA | GGACATAT | ATACACA | GGAGACAC | ATACACA | · · · All marked nodes in AhoGraph AlBio08 An Optimized Counting Graph 24

  25. Ovelap graph :pobability computation Compute ( p n ( H )) H ∈H using LOG, ROG. LOG dependency to the past ROG information to transfer (memory) AlBio08 An Optimized Counting Graph 25

  26. Ovelap graph :pobability computation Compute ( p n ( H )) H ∈H using LOG, ROG. LOG dependency to the past ROG information to transfer (memory) Graph traversals... AlBio08 An Optimized Counting Graph 26

  27. Clump counts First occurrence : “small” n . k occurrences : large n . ⇒ approximation ⇒ generating functions ⇒ clumps AlBio08 An Optimized Counting Graph 27

  28. Clump counts With H 1 = AACGGAA and H 2 = GAATCA , AACGGAACGGAACGGAATCACGGAA k -decomposition counted with coef. ( − 1) k [BoClReVa05]. AlBio08 An Optimized Counting Graph 28

  29. Clump counts With H 1 = AACGGAA and H 2 = GAATCA , AACGGAACGGAACGGAATCACGGAA k -decomposition counted with coef. ( − 1) k [BoClReVa05]. Contribution ( − 1) 7 = − 1 With AACAACAACAA = AA ( CAA ) 3 AACAACAACAA · CAA · CAA · CAA · CAA · CAA · CAA · CAA · CAA · CAA · CAA · CAA · CAA · ACAACAACAA · No contribution : even = odd AACAACAACAA · CAA · CAA · CAA · CAA · CAA · CAA · CAA · CAA · CAA · CAA · CAA · ACAACAACAA · AlBio08 An Optimized Counting Graph 29

  30. Open problems : Frameshift and riboswitches AlBio08 An Optimized Counting Graph 30

  31. Open problems : Frameshift and riboswitches Boxes : ( w 1 , w 2 , ˜ w 1 , ˜ w 2 ) with : P. Nicodeme. AlBio08 An Optimized Counting Graph 31

  32. Open problems : Frameshift and riboswitches AlBio08 An Optimized Counting Graph 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend