Local Maximal Stack Scores with General Loop Penalty Function EVA - - PowerPoint PPT Presentation

local maximal stack scores with general loop penalty
SMART_READER_LITE
LIVE PREVIEW

Local Maximal Stack Scores with General Loop Penalty Function EVA - - PowerPoint PPT Presentation

Local Maximal Stack Scores with General Loop Penalty Function EVA 2005, Gothenburg Niels Richard Hansen . p.1/17 Local Maximal Stack Scores with General Loop Penalty Function EVA 2005, Gothenburg Niels Richard Hansen This talk is based


slide-1
SLIDE 1

Local Maximal Stack Scores with General Loop Penalty Function

EVA 2005, Gothenburg Niels Richard Hansen

. – p.1/17

slide-2
SLIDE 2

Local Maximal Stack Scores with General Loop Penalty Function

EVA 2005, Gothenburg Niels Richard Hansen This talk is based on two papers: Asymptotics for Local Maximal Stack Scores with General Loop Penalty Function. To be submitted shortly. The Maximum of a Random Walk Reflected at a General Barrier. To appear in Ann. Appl. Probab.

. – p.1/17

slide-3
SLIDE 3

RNA-structures

RNA molecules are sequences of nucleotides – some forming functionally important structures. An RNA-molecule is represented as a sequence, X1 . . . Xn, of letters from the alphabet {A, C, G, U}.

. – p.2/17

slide-4
SLIDE 4

RNA-structures

RNA molecules are sequences of nucleotides – some forming functionally important structures. An RNA-molecule is represented as a sequence, X1 . . . Xn, of letters from the alphabet {A, C, G, U}. Its (secondary) structure is a graph with vertex set {1, . . . , n}.

. – p.2/17

slide-5
SLIDE 5

RNA-structures

RNA molecules are sequences of nucleotides – some forming functionally important structures. An RNA-molecule is represented as a sequence, X1 . . . Xn, of letters from the alphabet {A, C, G, U}. Its (secondary) structure is a graph with vertex set {1, . . . , n}. The graph is a partial matching: A vertex can enter in at most one edge and no loops.

. – p.2/17

slide-6
SLIDE 6

RNA-structures

RNA molecules are sequences of nucleotides – some forming functionally important structures. An RNA-molecule is represented as a sequence, X1 . . . Xn, of letters from the alphabet {A, C, G, U}. Its (secondary) structure is a graph with vertex set {1, . . . , n}. The graph is a partial matching: A vertex can enter in at most one edge and no loops. Typically edges between near neighbours (sharp turns) are not allowed.

. – p.2/17

slide-7
SLIDE 7

RNA-structures

RNA molecules are sequences of nucleotides – some forming functionally important structures. An RNA-molecule is represented as a sequence, X1 . . . Xn, of letters from the alphabet {A, C, G, U}. Its (secondary) structure is a graph with vertex set {1, . . . , n}. The graph is a partial matching: A vertex can enter in at most one edge and no loops. Typically edges between near neighbours (sharp turns) are not allowed. Typically pseudo-knots are not allowed: Pairs of edges of the form {i1, j1} and {i2, j2} with i1 < i2 < j1 < j2 are not allowed.

. – p.2/17

slide-8
SLIDE 8

RNA-structures

RNA molecules are sequences of nucleotides – some forming functionally important structures. An RNA-molecule is represented as a sequence, X1 . . . Xn, of letters from the alphabet {A, C, G, U}. Its (secondary) structure is a graph with vertex set {1, . . . , n}. The graph is a partial matching: A vertex can enter in at most one edge and no loops. Typically edges between near neighbours (sharp turns) are not allowed. Typically pseudo-knots are not allowed: Pairs of edges of the form {i1, j1} and {i2, j2} with i1 < i2 < j1 < j2 are not allowed. An edge represents a hydrogen bond between nucleotides.

. – p.2/17

slide-9
SLIDE 9

RNA-structures

GUG UA AG C GC AU ACCG CCG CUGCAUACUUC UUACAU CCAUA CUAU C

|||| ||| ||||||||||| |||||| ||||| ||||

A UGGU GGC GAUGUAUGAAG AAUGUA GGUAU GGUA U UGA GG AA A A A AA

An example RNA-molecule from the nematode C. elegans.

. – p.3/17

slide-10
SLIDE 10

RNA-structures

GUG UA AG C GC AU ACCG CCG CUGCAUACUUC UUACAU CCAUA CUAU C

|||| ||| ||||||||||| |||||| ||||| ||||

A UGGU GGC GAUGUAUGAAG AAUGUA GGUAU GGUA U UGA GG AA A A A AA

An example RNA-molecule from the nematode C. elegans. Xiong and Waterman (1997) show strong limit results for the maximum of (minus) the free energy score of RNA-structures. The free energy score being an additive score of the hydrogen bonded nucleotides (edges) plus linear penalties on the length of the loops (unpaired vertices). The score depends on a parameter vector α.

. – p.3/17

slide-11
SLIDE 11

Strong Limits

Let X1, . . . , Xn be an iid RNA-sequence. Let Ti,j denote the maximal structure score for Xi, . . . , Xj for i < j and Mn = max{ max

1≤i<j≤n Ti,j, 0}.

. – p.4/17

slide-12
SLIDE 12

Strong Limits

Let X1, . . . , Xn be an iid RNA-sequence. Let Ti,j denote the maximal structure score for Xi, . . . , Xj for i < j and Mn = max{ max

1≤i<j≤n Ti,j, 0}.

Relying on subadditive techniques Xiong and Waterman show that lim

n→∞

1 nT1,n = a(α) a.s.

. – p.4/17

slide-13
SLIDE 13

Strong Limits

Let X1, . . . , Xn be an iid RNA-sequence. Let Ti,j denote the maximal structure score for Xi, . . . , Xj for i < j and Mn = max{ max

1≤i<j≤n Ti,j, 0}.

Relying on subadditive techniques Xiong and Waterman show that lim

n→∞

1 nT1,n = a(α) a.s. If a(α) > 0, lim

n→∞

1 nMn = a(α) a.s. and if a(α) < 0 lim

n→∞

1 log nMn = b(α) a.s.

. – p.4/17

slide-14
SLIDE 14

A Conjecture

In the logarithmic phase, a(α) < 0, Xiong and Waterman conjecture that P(Mn > t) ≃ 1 − exp(−K(α)n exp(−t/b(α)))

(1)

for suitable large n and t.

. – p.5/17

slide-15
SLIDE 15

A Conjecture

In the logarithmic phase, a(α) < 0, Xiong and Waterman conjecture that P(Mn > t) ≃ 1 − exp(−K(α)n exp(−t/b(α)))

(1)

for suitable large n and t. For a (quite restrictive) class of stack/hairpin-loop structures we show such a result. Our result contains situations corresponding to a(α) = 0 but where (1) holds.

. – p.5/17

slide-16
SLIDE 16

Local scores

We proceed as follows: Choose functions f : {A, C, G, U}2 → R (non-lattice) and g : N0 → (−∞, 0].

| {z } | {z } | {z }

. – p.6/17

slide-17
SLIDE 17

Local scores

We proceed as follows: Choose functions f : {A, C, G, U}2 → R (non-lattice) and g : N0 → (−∞, 0]. For 1 ≤ i < j ≤ n define Ti,j = max

−2≤2δ<j−i

δ

  • k=0

f(Xi+k, Xj−k) + g(j − i − 2δ − 1)

  • .

stack hairpin-loop stack

X1 . . . Xi−1 Xi . . . Xi+δ

| {z }

δ+1

Xi+δ+1 . . . Xj−δ−1

| {z }

j−i−2δ−1

Xj−δ . . . Xj

| {z }

δ+1

Xj+1 . . . Xn.

. – p.6/17

slide-18
SLIDE 18

Local scores

We proceed as follows: Choose functions f : {A, C, G, U}2 → R (non-lattice) and g : N0 → (−∞, 0]. For 1 ≤ i < j ≤ n define Ti,j = max

−2≤2δ<j−i

δ

  • k=0

f(Xi+k, Xj−k) + g(j − i − 2δ − 1)

  • .

stack hairpin-loop stack

X1 . . . Xi−1 Xi . . . Xi+δ

| {z }

δ+1

Xi+δ+1 . . . Xj−δ−1

| {z }

j−i−2δ−1

Xj−δ . . . Xj

| {z }

δ+1

Xj+1 . . . Xn.

Let Mn = max1≤i<j≤n Ti,j.

. – p.6/17

slide-19
SLIDE 19

The Recursion

The scores Ti,j fulfill the recursion Ti,j = max{Ti+1,j−1 + f(Xi, Xj), g(j − i + 1)}. X1 X2 X3 X4 X5 X1 g(1) X2 g(1) X3 g(1) X4 g(1) X5 g(1)

. – p.7/17

slide-20
SLIDE 20

The Recursion

The scores Ti,j fulfill the recursion Ti,j = max{Ti+1,j−1 + f(Xi, Xj), g(j − i + 1)}. X1 X2 X3 X4 X5 X1 g(1) T1,2 ր X2 g(1) X3 g(1) X4 g(1) X5 g(1)

. – p.7/17

slide-21
SLIDE 21

The Recursion

The scores Ti,j fulfill the recursion Ti,j = max{Ti+1,j−1 + f(Xi, Xj), g(j − i + 1)}. X1 X2 X3 X4 X5 X1 g(1) T1,2 T1,3 ր ր X2 g(1) X3 g(1) X4 g(1) X5 g(1)

. – p.7/17

slide-22
SLIDE 22

The Recursion

The scores Ti,j fulfill the recursion Ti,j = max{Ti+1,j−1 + f(Xi, Xj), g(j − i + 1)}. X1 X2 X3 X4 X5 X1 g(1) T1,2 T1,3 T1,4 ր ր ր X2 g(1) T2,3 ր X3 g(1) X4 g(1) X5 g(1)

. – p.7/17

slide-23
SLIDE 23

The Recursion

The scores Ti,j fulfill the recursion Ti,j = max{Ti+1,j−1 + f(Xi, Xj), g(j − i + 1)}. X1 X2 X3 X4 X5 X1 g(1) T1,2 T1,3 T1,4 T1,5 ր ր ր ր X2 g(1) T2,3 T2,4 T2,5 ր ր ր X3 g(1) T3,4 T3,5 ր ր X4 g(1) T4,5 ր X5 g(1)

. – p.7/17

slide-24
SLIDE 24

The Diagonals

Suppose (Xk)k∈Z is a doubly infinite sequence of iid variables. Define recursively T 0

k = max{T 1 k−1 + f(X−k, Xk), g(2k)},

T 0

0 = 0

and T 1

k = max{T 2 k−1 + f(X−k, Xk), g(2k + 1)},

T 1

0 = g(1).

. – p.8/17

slide-25
SLIDE 25

The Diagonals

Suppose (Xk)k∈Z is a doubly infinite sequence of iid variables. Define recursively T 0

k = max{T 1 k−1 + f(X−k, Xk), g(2k)},

T 0

0 = 0

and T 1

k = max{T 2 k−1 + f(X−k, Xk), g(2k + 1)},

T 1

0 = g(1).

Ti,j

D

=    T 0

(j−i+1)/2

if j − i is odd

T 1

(j−i)/2

if j − i is even

. – p.8/17

slide-26
SLIDE 26

Reflected Random Walks

The processes (T i

k)k≥0, i = 0, 1 are random walks reflected at g.

50 100 150 200 −200 −150 −100 −50 50 100 150 50 100 150 200 −200 −150 −100 −50 50 100 150 50 100 150 200 −200 −150 −100 −50 50 100 150 g(n)=0 g(n) = −15 log(n) g(n) = −n

. – p.9/17

slide-27
SLIDE 27

Reflected Random Walks

If M i := supk≥0 T i

k < ∞ a.s. and θ∗ > 0 solves

E exp(θf(X−1, X1)) = 1. then P(M i > x) ∼ K∗

i exp(−θ∗x)

for x → ∞.

. – p.10/17

slide-28
SLIDE 28

Reflected Random Walks

If M i := supk≥0 T i

k < ∞ a.s. and θ∗ > 0 solves

E exp(θf(X−1, X1)) = 1. then P(M i > x) ∼ K∗

i exp(−θ∗x)

for x → ∞. Its necessary that µ := Ef(X−1, X1) < 0 in which case

  • k=1

exp(θ∗g(k)) < ∞ is sufficient for M i < ∞ a.s.

. – p.10/17

slide-29
SLIDE 29

The Main Result

Define C(t) =

n

  • i=1

1(∃δ : Ti−δ,i+δ > t) + 1(∃δ : Ti−δ,i+1+δ > t). Theorem: With tn = log(K∗

0 + K∗ 1) + log n + x

θ∗ ,

(1)

for x ∈ R then ||D(C(tn)) − Poi(exp(−x))||tv → 0

(1)

for n → ∞. In particular P(Mn ≤ tn) → exp(− exp(−x))

(1)

for n → ∞.

. – p.11/17

slide-30
SLIDE 30

A consequence of the theorem is that 1 log nMn

P

→ 1 θ∗ . The “parameters” involved are the functions f and g and b(f, g) = 1 θ∗ where θ∗ > 0, solving E exp(θf(X−1, X1)) = 1, does not depend upon g.

. – p.12/17

slide-31
SLIDE 31

A consequence of the theorem is that 1 log nMn

P

→ 1 θ∗ . The “parameters” involved are the functions f and g and b(f, g) = 1 θ∗ where θ∗ > 0, solving E exp(θf(X−1, X1)) = 1, does not depend upon g. Moreover, for suitable n and t P(Mn > t) ≃ 1 − exp(−(K∗

0 + K∗ 1)n exp(−θ∗t))

. – p.12/17

slide-32
SLIDE 32

The Proof

Apply Arratia et al. (1989) “Two moments suffice for Poisson approximations: the Chen-Stein method”. It involves: Localisation of dependencies by band-limitation: Consider only Ti,j with j − i ≤ h(n) where lim

n→∞ h(n)−1 log n = lim n→∞ n−ǫh(n) = 0.

. – p.13/17

slide-33
SLIDE 33

The Proof

Apply Arratia et al. (1989) “Two moments suffice for Poisson approximations: the Chen-Stein method”. It involves: Localisation of dependencies by band-limitation: Consider only Ti,j with j − i ≤ h(n) where lim

n→∞ h(n)−1 log n = lim n→∞ n−ǫh(n) = 0.

Handling of the tail-behavior of partial maxima of reflected random walks due to band-limitation.

. – p.13/17

slide-34
SLIDE 34

The Proof

Apply Arratia et al. (1989) “Two moments suffice for Poisson approximations: the Chen-Stein method”. It involves: Localisation of dependencies by band-limitation: Consider only Ti,j with j − i ≤ h(n) where lim

n→∞ h(n)−1 log n = lim n→∞ n−ǫh(n) = 0.

Handling of the tail-behavior of partial maxima of reflected random walks due to band-limitation. Bounding probabilities of the form P(Ti,j > t, Ti′,j′ > t) by the Azuma-Hoeffding inequality and exponential change of measure.

. – p.13/17

slide-35
SLIDE 35

Back to Xiong and Waterman

The variables T1,n do not form a subadditive sequence. By other means one can sometimes establish that lim

n→∞

1 nT1,n = a(f, g).

. – p.14/17

slide-36
SLIDE 36

Back to Xiong and Waterman

The variables T1,n do not form a subadditive sequence. By other means one can sometimes establish that lim

n→∞

1 nT1,n = a(f, g). Using that g ≤ 0 and µ < 0 lim sup

n→∞

1 nT1,n ≤ 0. If g is sublinear, g(n)/n → 0, 1 nT1,n ≥ g(n) n → 0, hence a(f, g) = 0.

. – p.14/17

slide-37
SLIDE 37

Example I

Let g(n) = ρn for ρ < 0. Then

  • k=1

exp(θ∗ρk) < ∞. and M i < ∞ a.s.

. – p.15/17

slide-38
SLIDE 38

Example I

Let g(n) = ρn for ρ < 0. Then

  • k=1

exp(θ∗ρk) < ∞. and M i < ∞ a.s. If ρ < µ a(f, g) = µ and if ρ > µ a(f, g) = ρ.

. – p.15/17

slide-39
SLIDE 39

Example II

Let g(n) = ρ log n for ρ < 0. Then

  • k=1

exp(θ∗ρ log k) =

  • k=1

kθ∗ρ < ∞ iff ρ < −1/θ∗ and a(f, g) = 0.

. – p.16/17

slide-40
SLIDE 40

Example II

Let g(n) = ρ log n for ρ < 0. Then

  • k=1

exp(θ∗ρ log k) =

  • k=1

kθ∗ρ < ∞ iff ρ < −1/θ∗ and a(f, g) = 0. It is possible to show that for ρ > −1/θ∗ then M i = ∞ a.s. for i = 0, 1. What happens here is an open question.

. – p.16/17

slide-41
SLIDE 41

Example II

Let g(n) = ρ log n for ρ < 0. Then

  • k=1

exp(θ∗ρ log k) =

  • k=1

kθ∗ρ < ∞ iff ρ < −1/θ∗ and a(f, g) = 0. It is possible to show that for ρ > −1/θ∗ then M i = ∞ a.s. for i = 0, 1. What happens here is an open question. When g ≡ 0 (the limiting case ρ → 0) is understood and 1 log nMn = 2 θ∗ a.s. with a corresponding asymptotic extreme value distribution of Mn.

. – p.16/17

slide-42
SLIDE 42

Concluding Remarks

The use of extreme value distributions in local sequence alignment for significance evaluation of the alignment score is much used (BLAST) with a theoretical justification for special cases.

. – p.17/17

slide-43
SLIDE 43

Concluding Remarks

The use of extreme value distributions in local sequence alignment for significance evaluation of the alignment score is much used (BLAST) with a theoretical justification for special cases. We have provided a result for sequence structure where one finds that the structure score follows asymptotically an extreme value distribution.

. – p.17/17

slide-44
SLIDE 44

Concluding Remarks

The use of extreme value distributions in local sequence alignment for significance evaluation of the alignment score is much used (BLAST) with a theoretical justification for special cases. We have provided a result for sequence structure where one finds that the structure score follows asymptotically an extreme value distribution. Our result is particular useful when searching large sequences for local parts containing “a lot of structure”.

. – p.17/17

slide-45
SLIDE 45

Concluding Remarks

The use of extreme value distributions in local sequence alignment for significance evaluation of the alignment score is much used (BLAST) with a theoretical justification for special cases. We have provided a result for sequence structure where one finds that the structure score follows asymptotically an extreme value distribution. Our result is particular useful when searching large sequences for local parts containing “a lot of structure”. The result confirms to some extend the conjecture by Xiong and Waterman – and extends the conjecture in one direction.

. – p.17/17