Overlap Graph and Clumps Mireille R egnier LIX and INRIA - - PowerPoint PPT Presentation

overlap graph and clumps
SMART_READER_LITE
LIVE PREVIEW

Overlap Graph and Clumps Mireille R egnier LIX and INRIA - - PowerPoint PPT Presentation

Overlap Graph and Clumps Mireille R egnier LIX and INRIA Mireille.Regnier@inria.fr web page : algo.inria.fr/regnier October, 9-th 2008 AlBio08 An Optimized Counting Graph 1 Outline 1 Introduction and principles 2 Overlap Graph 3


slide-1
SLIDE 1

Overlap Graph and Clumps

Mireille R´ egnier

LIX and INRIA Mireille.Regnier@inria.fr web page : algo.inria.fr/regnier

October, 9-th – 2008

AlBio08 An Optimized Counting Graph 1

slide-2
SLIDE 2

Outline

1 Introduction and principles 2 Overlap Graph 3 Combinatorics of clumps 4 Open problems

AlBio08 An Optimized Counting Graph 2

slide-3
SLIDE 3

Cis-regulation

AlBio08 An Optimized Counting Graph 3

slide-4
SLIDE 4

Cis-regulation changes

AlBio08 An Optimized Counting Graph 4

slide-5
SLIDE 5

Example : the caudal motif in early developmental enhancers from Drosophila

GCTTTTTTATGGTCGGC TCGCTTTTATGGCCCAA CAGTTTTTATGTCTTTA CCGTTTTGATGGCGGTG AAATTTTTAGGGAACCA GCCCGTTTATGGTTCCC GACACTTTATGTGACAA TCGGATTTATGACACAA ATGTCTTTATGATTATT GCAACTTTTGGGCCATA CCCTTTTGTTGGCCAAA

(a) Aligned Motifs Papatsenko et al., 2002

A| 2 3 2 2 1 9 2 1 3 3 4 7 C| 3 7 3 2 3 6 4 5 2 2 G| 4 5 1 1 2 2 11 7 1 1 2 1 1 T| 2 1 1 6 6 11 11 9 2 9 2 3 3 1 4 1

(b) Countings

AlBio08 An Optimized Counting Graph 5

slide-6
SLIDE 6

Example : the caudal motif in early developmental enhancers from Drosophila

GCTTTTTTATGGTCGGC TCGCTTTTATGGCCCAA CAGTTTTTATGTCTTTA CCGTTTTGATGGCGGTG AAATTTTTAGGGAACCA GCCCGTTTATGGTTCCC GACACTTTATGTGACAA TCGGATTTATGACACAA ATGTCTTTATGATTATT GCAACTTTTGGGCCATA CCCTTTTGTTGGCCAAA

(a) Aligned Motifs Papatsenko et al., 2002

A| 2 3 2 2 1 9 2 1 3 3 4 7 C| 3 7 3 2 3 6 4 5 2 2 G| 4 5 1 1 2 2 11 7 1 1 2 1 1 T| 2 1 1 6 6 11 11 9 2 9 2 3 3 1 4 1

(b) Countings

A| -0.22 0.06 -0.22 -0.22 -0.62 -1.32 -1.32 -1.32 0.98 -1.32 -1.32 -0.22 -0.62 0.06 0.06 0.28 0.75 C| 0.06 0.75 0.06 -0.22 0.06 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 0.62 0.28 0.47

  • 0.22

G| 0.28 -1.32 0.47 -0.62 -0.62 -1.32 -1.32 -0.22 -1.32 -0.22 1.16 0.75 -0.62 -0.62

  • 0.22
  • 0.62 -0

T| -0.22 -0.62

  • 0.62

0.62 0.62 1.16 1.16 0.98 -0.22 0.98 -1.32 -0.22 0.06 0.0 6 -0.62 0.28 -0

(c) Position Specific Scoring matrix

AlBio08 An Optimized Counting Graph 6

slide-7
SLIDE 7

Probability Weight Matrices

Probability function ! Threshhold s : A word (site) is similar iff score(w) > s. ! Pvalue : Probn(∃H; score(H) > s) .

AlBio08 An Optimized Counting Graph 7

slide-8
SLIDE 8

Probability Weight Matrices

Probability function ! Threshhold s : A word (site) is similar iff score(w) > s. ! Pvalue : Probn(∃H; score(H) > s) . Algorithms and data structures ! candidates-motifs extraction

AlBio08 An Optimized Counting Graph 8

slide-9
SLIDE 9

Probability Weight Matrices

Probability function ! Threshhold s : A word (site) is similar iff score(w) > s. ! Pvalue : Probn(∃H; score(H) > s) . Algorithms and data structures ! candidates-motifs extraction Model accuracy ! Improve PWM with structural information

AlBio08 An Optimized Counting Graph 9

slide-10
SLIDE 10

Principles

Biological function ! Overrepresented words ! underrepresented words Statistical softwares ! candidates-motifs extraction ! statistical significance

AlBio08 An Optimized Counting Graph 10

slide-11
SLIDE 11

Probability Computation

“Classic” methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00].

AlBio08 An Optimized Counting Graph 11

slide-12
SLIDE 12

Probability Computation

“Classic” methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00]. Space/time complexity ! Exact (all n) → AhoPro (NIIGenetika, Inria) ! O(n × |Σ|) ; n : text size ; Σ : data structure.

AlBio08 An Optimized Counting Graph 12

slide-13
SLIDE 13

Probability Computation

“Classic” methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00]. Space/time complexity ! Exact (all n) → AhoPro (NIIGenetika, Inria) ! O(n × |Σ|) ; n : text size ; Σ : data structure. Drawback ! n dependency ; ! numerical precision ;

AlBio08 An Optimized Counting Graph 13

slide-14
SLIDE 14

Probability Computation

“Classic” methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00].

AlBio08 An Optimized Counting Graph 14

slide-15
SLIDE 15

Probability Computation

“Classic” methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00]. Space/time complexity ! Approximation → RSA-tools, Spatt, AhoSoft (NIIGenetika, Inria) ! O(1 × |Σ|)

AlBio08 An Optimized Counting Graph 15

slide-16
SLIDE 16

Probability Computation

“Classic” methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00]. Space/time complexity ! Approximation → RSA-tools, Spatt, AhoSoft (NIIGenetika, Inria) ! O(1 × |Σ|) Drawback ! size of the data structure ; ! tightness ;

AlBio08 An Optimized Counting Graph 16

slide-17
SLIDE 17

AhoCorasick searching automaton

a c t c a t a t

1

a g a c a c

2

a t a c a c

3

a g a t

4

a a t t a t

5

a t t t c a

6

c c

7

a a c c a c

8

a

AlBio08 An Optimized Counting Graph 17

slide-18
SLIDE 18

AhoCorasick automaton : searching and computing

! n : wn = largest prefix found =ATA ; ! n + 1 : character x found :

x = G, wx = ATAG ∈ Graph, wn+1 = ATAG x = A, C, T, wx ∈ Graph

* x = C; w = A · TA, wn+1 = TAC ∈ Graph * x = T; w = AT · A, wn+1 = AT ∈ Graph * x = A; AA, TAA ∈ G, wn+1 = root

a c t c a t a t 1 a g a c a c 2 a t a c a c 3 a g a t 4 a a t t a t 5 a t t t c a 6 c c 7 a a c c a c 8 a

AlBio08 An Optimized Counting Graph 18

slide-19
SLIDE 19

AhoPo :pobability computation

Step n : (pn(w))w∈Graph. pn(w) = Prob(largest prefix ending at n isw). Induction pn+1(ATAG) = pn(ATA) · p(G) pn+1(AT) = pn(ATA) · p(T) + pn(AGA) · p(T) + pn(CA) · p(T) + pn(TA) · p(T)

AlBio08 An Optimized Counting Graph 19

slide-20
SLIDE 20

AhoCorasick automaton : searching and computing

Left relation H1RLH2 ⇔ FatherLOG(H1) = FatherLOG(H2) {ATACACA, ATAGATA} ˜ ATA ATA :Largest prefix of ATACACA that is a suffix in H

AlBio08 An Optimized Counting Graph 20

slide-21
SLIDE 21

AhoCorasick automaton : searching and computing

Left relation H1RLH2 ⇔ FatherLOG(H1) = FatherLOG(H2) {ATACACA, ATAGATA} ˜ ATA ATA :Largest prefix of ATACACA that is a suffix in H Right relation H1RRH2 ⇔ MotherROG(H1) = MotherROG(H2) {ATACACA, ATACACA} ¯ ACA ∪{AGACACA, } ACA :Largest suffix of ATACACA that is a prefix in H

AlBio08 An Optimized Counting Graph 21

slide-22
SLIDE 22

Computation on Graph :induction

AlBio08 An Optimized Counting Graph 22

slide-23
SLIDE 23

AhoCorasick automaton : searching and computing

First occurrence at position n = 18 GGGGGGGG |ATACACA | no H ∈ H | · · · |n

AlBio08 An Optimized Counting Graph 23

slide-24
SLIDE 24

AhoCorasick automaton : searching and computing

First occurrence at position n = 18 GGGGGGGG |ATACACA | no H ∈ H | · · · |n AND NOT GGGGCATT| ATACACA| GGGGACAT| ATACACA| GGACATAT| ATACACA| GGAGACAC| ATACACA| · · · All marked nodes in AhoGraph

AlBio08 An Optimized Counting Graph 24

slide-25
SLIDE 25

Ovelap graph :pobability computation

Compute (pn(H))H∈H using LOG, ROG. LOG dependency to the past ROG information to transfer (memory)

AlBio08 An Optimized Counting Graph 25

slide-26
SLIDE 26

Ovelap graph :pobability computation

Compute (pn(H))H∈H using LOG, ROG. LOG dependency to the past ROG information to transfer (memory) Graph traversals...

AlBio08 An Optimized Counting Graph 26

slide-27
SLIDE 27

Clump counts

First occurrence : “small” n. k occurrences : large n. ⇒ approximation ⇒ generating functions ⇒ clumps

AlBio08 An Optimized Counting Graph 27

slide-28
SLIDE 28

Clump counts

With H1 = AACGGAA and H2 = GAATCA, AACGGAACGGAACGGAATCACGGAA k-decomposition counted with coef. (−1)k [BoClReVa05].

AlBio08 An Optimized Counting Graph 28

slide-29
SLIDE 29

Clump counts

With H1 = AACGGAA and H2 = GAATCA, AACGGAACGGAACGGAATCACGGAA k-decomposition counted with coef. (−1)k [BoClReVa05]. Contribution (−1)7 = −1 With AACAACAACAA = AA(CAA)3 AACAACAACAA·CAA·CAA·CAA·CAA·CAA·CAA·CAA· CAA·CAA·CAA·CAA·CAA·ACAACAACAA· No contribution : even = odd AACAACAACAA·CAA·CAA·CAA·CAA·CAA·CAA·CAA· CAA·CAA·CAA·CAA·ACAACAACAA·

AlBio08 An Optimized Counting Graph 29

slide-30
SLIDE 30

Open problems : Frameshift and riboswitches

AlBio08 An Optimized Counting Graph 30

slide-31
SLIDE 31

Open problems : Frameshift and riboswitches

Boxes : (w1, w2, ˜ w1, ˜ w2) with : P. Nicodeme.

AlBio08 An Optimized Counting Graph 31

slide-32
SLIDE 32

Open problems : Frameshift and riboswitches

AlBio08 An Optimized Counting Graph 32