Arbres digitaux et suites dADN Brigitte CHAUVIN (Versailles) en - - PowerPoint PPT Presentation

arbres digitaux et suites d adn
SMART_READER_LITE
LIVE PREVIEW

Arbres digitaux et suites dADN Brigitte CHAUVIN (Versailles) en - - PowerPoint PPT Presentation

Arbres digitaux et suites dADN Brigitte CHAUVIN (Versailles) en collaboration avec Peggy C ENAC (Univ. Bourgogne), Eric FEKETE, St ephane GINOUILLAC, Nicolas POUYANNE (Versailles) INRIA, 26 mai 2008 Outline Introduction Tree


slide-1
SLIDE 1

Arbres digitaux et suites d’ADN

Brigitte CHAUVIN (Versailles) en collaboration avec Peggy C´ ENAC (Univ. Bourgogne), Eric FEKETE, St´ ephane GINOUILLAC, Nicolas POUYANNE (Versailles)

INRIA, 26 mai 2008

slide-2
SLIDE 2

Outline

◮ Introduction ◮ Tree representation ◮ Where randomness is ◮ What is known ◮ Results ◮ Methods

slide-3
SLIDE 3

Introduction

◮ A DNA sequence is an infinite word

U = u1u2 . . . un . . . ∀i, ui ∈ {A, C, G, T}.

slide-4
SLIDE 4

Introduction

◮ A DNA sequence is an infinite word

U = u1u2 . . . un . . . ∀i, ui ∈ {A, C, G, T}.

◮ To be seen on a representation:

◮ repetition of patterns ◮ missing patterns ◮ repartition of different possible patterns ◮ comparison of different sequences

slide-5
SLIDE 5

Introduction

◮ A DNA sequence is an infinite word

U = u1u2 . . . un . . . ∀i, ui ∈ {A, C, G, T}.

◮ To be seen on a representation:

◮ repetition of patterns ◮ missing patterns ◮ repartition of different possible patterns ◮ comparison of different sequences

◮ Can we identify some characteristics

◮ easy to study on the representation ◮ different from a species to another species?

◮ objectifs : distance entre les esp`

eces, stat

slide-6
SLIDE 6

Tree representation

U = u1u2 . . . un . . . Prefixes u1 u1u2 u1u2u3 . . . Rev.prefixes u1 u2u1 u3u2u1 . . . Suffixes u1u2u3u4 . . . u2u3u4 . . . u3u4 . . . . . .

◮ suffix trie ◮ DST of reversed prefixes ◮ trie of reversed prefixes ◮ suffix DST

slide-7
SLIDE 7
  • Example. Suffix trie. U = 1001011001110 . . .

S1 = U = 1001011001110 . . .

S1 1

slide-8
SLIDE 8
  • Example. Suffix trie. U = 1001011001110 . . .

S1 = U = 1001011001110 . . . S2 = 001011001110 . . .

S2 1 S1

slide-9
SLIDE 9
  • Example. Suffix trie. U = 1001011001110 . . .

S1 = U = 1001011001110 . . . S2 = 001011001110 . . . S3 = 01011001110 . . .

S1 1 1 S2 S3

slide-10
SLIDE 10
  • Example. Suffix trie. U = 1001011001110 . . .

S1 = U = 1001011001110 . . . S2 = 001011001110 . . . S3 = 01011001110 . . . S4 = 1011001110 . . .

S3 1 1 1 S2 S4 S1

slide-11
SLIDE 11
  • Example. Suffix trie. U = 1001011001110 . . .

S1 = U = 1001011001110 . . . S2 = 001011001110 . . . S3 = 01011001110 . . . S4 = 1011001110 . . . S5 = 011001110 . . .

S1 1 1 1 1 S3 S5 S2 S4

slide-12
SLIDE 12
  • Example. Suffix trie. U = 1001011001110 . . .

S1 = U = 1001011001110 . . . S2 = 001011001110 . . . S3 = 01011001110 . . . S4 = 1011001110 . . . S5 = 011001110 . . . S6 = 11001110 . . .

S1 1 1 1 1 1 S3 S5 S2 S4 S6

slide-13
SLIDE 13
  • Example. Suffix trie. U = 1001011001110 . . .

S1 = U = 1001011001110 . . . S2 = 001011001110 . . . S3 = 01011001110 . . . S4 = 1011001110 . . . S5 = 011001110 . . . S6 = 11001110 . . . S7 = 1001110 . . .

S6 1 1 1 1 1 1 1 S1 S7 S3 S5 S2 S4

slide-14
SLIDE 14
  • Example. Suffix trie. U = 1001011001110 . . .

S1 = U = 1001011001110 . . . S2 = 001011001110 . . . S3 = 01011001110 . . . S4 = 1011001110 . . . S5 = 011001110 . . . S6 = 11001110 . . . S7 = 1001110 . . .

S6 1 1 1 1 1 1 1 S1 S7 S3 S5 S2 S4

The shape of the tree is closely related to the repetitions of patterns

slide-15
SLIDE 15

Where randomness is?

Comes from the production of the letters: {0, 1} or {A, C, G, T}

  • r from any finite alphabet. For a given word U = u1u2 . . . un . . . ,

the tree process (Tn)n≥0 is nonrandom.

slide-16
SLIDE 16

Where randomness is?

Comes from the production of the letters: {0, 1} or {A, C, G, T}

  • r an alphabet. For a given word U = u1u2 . . . un . . . ,

the tree process (Tn)n≥0 is nonrandom. Different kinds of sources:

◮ Memoryless: Bernoulli or asymmetric i.i.d. ◮ Markov ◮ General probabilistic source

◮ choose an infinite word U = u1u2 . . . un . . . with distribution µ ◮ call T the shift, ◮ add mixing assumptions (later).

The inserted words (suffixes or reversed prefixes) are NOT independent.

slide-17
SLIDE 17

What is known

DST for independent words Bernoulli source

  • height, insertion depth, profile
  • cf. Mahmoud (92)
  • Hn − log2 n P

→ 0 Aldous-Shields (98)

  • Concentration of the height

Drmota (02) iid assymmetric, Markov source

  • Pittel (85)

insertion depth, height strong convergences from an infinite word

  • iid or Markov source

C´ enac et al. (07)

slide-18
SLIDE 18

What is known

Suffix tries

  • height

Devroye, Szpankowski (92) (i.i.d. source)

  • depth, fill-up level, height

Jacquet, Szpankowski (93) (general source + mixing)

  • average size and total path length

Fayolle (06) (iid assym., Markov)

  • fill-up level

C´ enac, Fekete (general source + not too strong mixing) (in progress)

slide-19
SLIDE 19

Two families of methods: (1) (2) analytic combinatorics probability generating functions Mellin transform ↓ ↓ precise asymptotics on a.s. convergences

  • the average of additive characteristics
  • distribution of the height

common: correlations, overlapping of words

slide-20
SLIDE 20

Some notations to write the results

◮ The probability that the source produces a sequence of

symbols starting with the pattern m is pm =

  • Im

f (t)dt.

◮ s = s1s2 . . . sn . . . denotes an infinite deterministic sequence. ◮ s(n) = s1s2 . . . sn.

slide-21
SLIDE 21

Some notations to write the results

pm =

  • Im

f (t)dt

◮ s = s1s2 . . . sn . . . denotes an infinite deterministic sequence. ◮ s(n) = s1s2 . . . sn. ◮ Entropies

h+ = lim

n→+∞

1 n max

s(n)

  • ln

1 ps(n)

  • ,

h− = lim

n→+∞

1 n min

s(n)

  • ln

1 ps(n)

  • ,

h = lim

n→+∞

1 nE

  • ln
  • 1

p

  • U(n)
  • .
slide-22
SLIDE 22

Some notations to write the results

◮ s = s1s2 . . . sn . . . denotes an infinite deterministic sequence.

s(n) = s1s2 . . . sn.

h+ = lim

n→+∞

1 n max

s(n)

  • ln

1 ps(n)

  • , h− =

lim

n→+∞

1 n min

s(n)

  • ln

1 ps(n)

  • ,

h = lim

n→+∞

1 nE

  • ln
  • 1

p

  • U(n)
  • .

◮ ˜

ℓn = length shortest branch of the tree = fill-up level = ℓn Ln = length of the longest branch of the tree. Dn = insertion depth

slide-23
SLIDE 23

Results

ℓn = fill-up level Ln = length of the longest branch of the tree. Dn = insertion depth

Theorem

(C´ enac et al. (07)) For the DST for a memoryless source or a Markovian source ℓn ln n

a.s.

− →

n→∞

1 h+ , and Ln ln n

a.s.

− →

n→∞

1 h− .

slide-24
SLIDE 24

Results

ℓn = fill-up level Ln = length of the longest branch of the tree. Dn = insertion depth

Theorem

For the DST for a memoryless source or a Markovian source ℓn ln n

a.s.

− →

n→∞

1 h+ , and Ln ln n

a.s.

− →

n→∞

1 h− . Dn ln n

P

− →

n→∞

1 h

slide-25
SLIDE 25

In progress

ℓn = fill-up level Ln = length of the longest branch of the tree. Dn = insertion depth

Theorem

For the suffix trie for a general source with mixing conditions ℓn ln n

a.s.

− →

n→∞

1 h+ .

slide-26
SLIDE 26

Methods - 1 - Runs well

(works for the DST and for the suffix trie)

◮ s = s1s2 . . . sn . . . denotes an infinite deterministic sequence. ◮ s(n) = s1s2 . . . sn

Xn(s)

def

= length of the branch corresponding to s in the tree T n ℓn = min

s

Xn(s) and Ln = max

s

Xn(s).

slide-27
SLIDE 27

Methods - 1 - Runs well

(works for the DST and for the suffix trie)

◮ s = s1s2 . . . sn . . . denotes an infinite deterministic sequence. ◮ s(n) = s1s2 . . . sn ◮ Tk(s)

def

= size of the first tree where is inserted s(k), Xn(s)

def

= length of the branch corresponding to s in T n. ℓn = min

s

Xn(s) and Ln = max

s

Xn(s).

◮ Xn and Tk are in duality

{Xn(s) ≥ k} = {Tk(s) ≤ n}. P(ℓn ≤ k − 1) ≤ . . .

slide-28
SLIDE 28

Methods - 1 - Runs well

(works for the DST and for the suffix trie)

◮ s = s1s2 . . . sn . . . denotes an infinite deterministic sequence. ◮ s(n) = s1s2 . . . sn ◮ Tk(s)

def

= size of the first tree where is inserted s(k), Xn(s)

def

= length of the branch corresponding to s in T n. ℓn = min

s

Xn(s) and Ln = max

s

Xn(s).

◮ Xn and Tk are in duality

{Xn(s) ≥ k} = {Tk(s) ≤ n}. P(ℓn ≤ k − 1) ≤

  • s(k)

P(Tk(s) > n)

slide-29
SLIDE 29

Methods - 1 - Runs well

(works for the DST) P(ℓn ≤ k − 1) ≤

  • s(k)

P(Tk(s) > n)

◮ s = s1s2 . . . sn . . . denotes an infinite deterministic sequence. ◮ s(n) = s1s2 . . . sn ◮ Tk(s)

def

= size of the first tree where is inserted s(k), Tk(s) =

k

  • r=1

Tr(s) − Tr−1(s) =

k

  • r=1

Zr(s) Zr(s) = waiting time of the first occurrence of s(r) in U after Tr−1 Hyp Markov ⇒ the r. v. Zr(s) are independent

slide-30
SLIDE 30

Un peu de proba

(works for the DST) Hyp Markov ⇒ the r. v. Zr(s) are independent Tk(s) =

k

  • r=1

Tr(s) − Tr−1(s) =

k

  • r=1

Zr(s) =

k

  • r=1
  • Zr(s) − I

EZr(s)

  • +

k

  • r=1

I EZr(s) =

k

  • r=1

ǫr(s) + I ETk(s) = martingale Mk(s) + I ETk(s)

slide-31
SLIDE 31

Tk(s) =

k

  • r=1

Zr(s) =

k

  • r=1
  • Zr(s) − I

EZr(s)

  • +

k

  • r=1

I EZr(s) =

k

  • r=1

ǫr(s) + I ETk(s) = martingale Mk(s) + I ETk(s) log Tk(s) = log I ETk(s) + log

  • 1 + Mk(s)

I ETk(s)

kh(s) + ↓ ∀α > 0, Mk(s) I ETk(s) = o(k1+α/2)

log Tk(s) k a.s.

− →

k→∞h(s)

slide-32
SLIDE 32

Un peu de proba

(works for the DST) Hyp Markov ⇒ the r. v. Zr(s) are independent P(ℓn ≤ k − 1) ≤

  • s(k)

P(Tk(s) > n) ≤

  • s(k)

t−nI E(tTk(s)), t > 1 Tk(s) = k

r=1 Zr(s)

I E(tTk(s)) =

k

  • r=1

I E(tZr(s)) Daudin-Robin (99)

slide-33
SLIDE 33

(for the suffix trie)

◮ s = s1s2 . . . sn . . . denotes an infinite deterministic sequence. ◮ s(n) = s1s2 . . . sn ◮ Tk(s)

def

= size of the first tree where is inserted s(k), Xn(s)

def

= length of the branch corresponding to s in T n. ℓn = min

s

Xn(s) and Ln = max

s

Xn(s).

◮ Xn and Tk are in duality

{Xn(s) ≥ k} = {Tk(s) ≤ n}. P(ℓn ≤ k − 1) ≤

  • s(k)

P(Tk(s) > n) =

  • s(k)

P(t0

s(k) + t1 s(k) > n)

slide-34
SLIDE 34

Methods - 1 - Runs well

(works for the suffix trie)

◮ Tk(s)

def

= size of the first tree where is inserted s(k), ℓn = min

s

Xn(s) P(ℓn ≤ k − 1) ≤

  • s(k)

P(Tk(s) > n) =

  • s(k)

P(t0

s(k) + t1 s(k) > n)

where t0

m = hitting time of pattern m

t1

m = return time of pattern m. ◮ sufficient:

  • s(k)

P(t0

s(k) > n/2) is the g.t. of a conv. series

  • s(k)

P(t1

s(k) > n/2) is the g.t. of a conv. series

slide-35
SLIDE 35

Methods - 1 - Runs well

t0

m = hitting time of pattern m

t1

m = return time of pattern m.

It is sufficient to prove

  • s(k)

P(t0

s(k) > n/2) is the g.t. of a conv. series

  • s(k)

P(t1

s(k) > n/2) is the g.t. of a conv. series

slide-36
SLIDE 36

Methods - 1 - Runs well

t0

m = hitting time of pattern m

t1

m = return time of pattern m.

To prove:

  • s(k)

P(t0

s(k) > n/2) is the g.t. of a conv. series

  • s(k)

P(t1

s(k) > n/2) is the g.t. of a conv. series

↑ for a pattern m |P(t1

m > t) − Ce−ξmt| ≤ C ′tβ

∼ Galves-Schmidt (97)

slide-37
SLIDE 37

Methods - 2 - Less easy

The more auto-correlated a word is, the more easily it may reappear and the smaller its return time is.

slide-38
SLIDE 38

Methods - 2 - Less easy

The more auto-correlated a word is, the more easily it may reappear and the smaller its return time is. To achieve this (1) (2) work on the assumptions tools add independence auto-correlation polynomials ↓ Bernoulli Markov dynamical source + mixing assumptions .

slide-39
SLIDE 39

Meaning of such mixing conditions: When two parts of a word w = . . . w0|w1w2 . . . wn|wn+1 . . . are far (more than n letters) from each other, then, these two parts are “almost” independent.

slide-40
SLIDE 40

µ stationary, ergodic measure is the distribution of the words. T is the shift (or the transformation in a dynamical system) A is a word depending on the first m letters B is a word depending on the suffix after m + n. mixing lim

n→∞ µ(A ∩ T −nB) − µ(A)µ(B) = 0.

↑ φ−mixing (Paccaut (99)): ∃φ → 0, s.t. |µ(A ∩ T −nB) − µ(A)µ(B)| ≤ φ(n)µ(B). ↑ ψ−mixing (Szpankowski (93), Galves-Schmidt (97)): ∃ψ decreasing, positive, tending to 0 s.t. µ(A ∩ T −nB) − µ(A)µ(B) ≤ ψ(n)µ(A)µ(B)

slide-41
SLIDE 41

Questions

Le cas particulier des syst` emes dynamiques apporte-t-il quelque chose dans l’utilisation des conditions de mixing ? Que se passe-t-il si on ne met pas ou peu d’hypoth` ese de mixing ?