Pattern matching and common structure inference in RNA (secondary) - - PowerPoint PPT Presentation

pattern matching and common structure inference in rna
SMART_READER_LITE
LIVE PREVIEW

Pattern matching and common structure inference in RNA (secondary) - - PowerPoint PPT Presentation

Pattern matching and common structure inference in RNA (secondary) structures St ephane Vialette Stephane.Vialette@lri.fr Laboratoire de Recherche en Informatique (LRI) b at.490, Univ. Paris-Sud XI, 91405 Orsay cedex, France


slide-1
SLIDE 1

Pattern matching and common structure inference in RNA (secondary) structures

St´ ephane Vialette Stephane.Vialette@lri.fr Laboratoire de Recherche en Informatique (LRI) bˆ at.490, Univ. Paris-Sud XI, 91405 Orsay cedex, France http://www.lri.fr/˜vialette September 19, 2007, Wuhan, China

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-2
SLIDE 2

Outline

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-3
SLIDE 3

RNA secondary structures

Definition RNA molecules fold back on themselves via Watson-Crick base paring between the bases (A with U and G with C or U) leading to double-stranded helices interrupted by single-stranded regions in internal loops or hairpin loops.

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-4
SLIDE 4

RNA secondary structures

Possible representations Linear representation

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-5
SLIDE 5

RNA secondary structures

Possible representations Bracket representation

(((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-6
SLIDE 6

RNA secondary structures

Possible representations Tree representation

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-7
SLIDE 7

RNA secondary structures

Possible representations Circle representation

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-8
SLIDE 8

RNA secondary structures

Possible representations Mountain representation

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-9
SLIDE 9

RNA tertiary structure

Definition In the next level of organization, the tertiary structure, the secondary structure elements are associated through numerous contacts, specific hydrogen bonds via the formation

  • f a small number of additional Watson-Crick pairs and/or

unusual pairs involving hairpin loops or internal bulges.

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-10
SLIDE 10

RNA tertiary structure

David W. Staple et Samuel E. Butcher, Pseudoknots: RNA Structures with Diverse Functions, PLOS Biology 3(6) : e213, 2005. St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-11
SLIDE 11

A crash course in algorithmic complexity theory

Fact Most problems cannot be solved to optimality in reasonable (polynomial) running time. Most problems are NP-complete.

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-12
SLIDE 12

A crash course in algorithmic complexity theory

The class NP (Non-deterministic Polynomial The class NP is composed of all decision problems for which answers can be checked by an algorithm whose running time is polynomial in the size of the input. Note that this doesn’t require or imply that an answer can be found quickly, only that any claimed solution can be verified quickly.

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-13
SLIDE 13

A crash course in algorithmic complexity theory

NP-hard problems A problem Π is NP-hard if an algorithm for solving it can be translated into one for solving any problem in NP (non-deterministic polynomial time). NP-hard therefore means ” at least as hard as any problem in NP ”, although Π might, in fact, be harder.

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-14
SLIDE 14

A crash course in algorithmic complexity theory

NP-hard problems A problem Π is NP-complete if Π is in NP (verifiable in non-deterministic polynomial time), and Π is NP-hard (any problem in NP can be translated into this problem).

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-15
SLIDE 15

A crash course in algorithmic complexity theory

NP P NPC NP P

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-16
SLIDE 16

A crash course in algorithmic complexity theory

Proving a problem Π to be NP-complete

1 Prove that problem Π is in NP. 2 Choose any known NP-complete problem Π′ and prove

that Π′ reduces to Π.

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-17
SLIDE 17

A crash course in algorithmic complexity theory

Coping with hardness

  • OK. So what is the next step ?

Approximation algorithms. Parameterized algorithms. Heuristic algorithms. . . . The choice of the direction to follow is application-dependent.

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-18
SLIDE 18

Approximation algorithms

definition An algorithm to solve an optimization problem that runs in polynomial-time in the length of the input and outputs a solution that is guaranteed to be close to the optimal solution. ”Close” has some well-defined sense called the performance guarantee.

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-19
SLIDE 19

Parameterized algorithms

definition An algorithm to solve an optimization problem that runs in polynomial-time in the length of the input but in exponential-time in a parameter, and outputs a solution that is guaranteed to be the optimal solution. The choice of a parameter makes parameterized algorithms well-suited for practical problems.

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-20
SLIDE 20

Heuristic algorithms

definition An algorithm that usually, but not always, works or that gives nearly the right answer. The running time of the algorithm might be prohibitive . . . but not always.

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-21
SLIDE 21

A crash course in algorithmic complexity theory

More or less a fact Most RNA structure problems cannot be solved to optimality in reasonable (polynomial) running time for crossing structures, i.e., pseudo-knotted structures. Dynamic programming. Dynamic programming can deal with reasonable pseudo-knotted structures. Approximation algorithms. Parameterized algorithms.

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-22
SLIDE 22

New (not so simple) RNA representations

Sets of 2-intervals Linear graphs Arc-annotated sequences

a a a g g t t c a

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-23
SLIDE 23

Outline

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-24
SLIDE 24

General problem

Definition Given two (seconday) structures S and T, decide whether or not S “ occurs ” in T. Parsing RNA structure databases. Comparing RNA stuctures. The exact problem depends on

the structure of S and T, and what does it mean for a structure to occur in another one ?

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-25
SLIDE 25

The ARC-PRESERVING SUBSEQUENCE problem

Definition Given two arc-annotated sequences S and T, decide wether or not S occurs in T as an arc-preserving subsequence. Example

a a a g g t t c a a c u a g t c c

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-26
SLIDE 26

The ARC-PRESERVING SUBSEQUENCE problem

Definition Given two arc-annotated sequences S and T, decide wether or not S occurs in T as an arc-preserving subsequence. Example

a a a g g t t c a a c u a g t c c

mapping

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-27
SLIDE 27

The ARC-PRESERVING SUBSEQUENCE problem

a c t g t g

Unlimited

a c t g t g

Crossing

a c t g t g

Nested

a c t g t g

Chain

a c t g t g

Plain

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-28
SLIDE 28

The ARC-PRESERVING SUBSEQUENCE problem

Complexity issues APS CROSSING NESTED CHAIN PLAIN CROSSING NP-complete NP-complete NP-complete NESTED O(nm) CHAIN O(nm) O(n + m)

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-29
SLIDE 29

Outline

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-30
SLIDE 30

General problem

Definition Given n (seconday) structures S1, S2, . . . , Sn, find the largest (secondary) structure T that occuts in each input structure. Parsing RNA structure databases. Comparing RNA stuctures. The exact problem depends on

n, the input structures and the structure of T, and what does it mean for a structure to occur in another one ?

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-31
SLIDE 31

Common structure inference

Remarks Variants of the problem exist for 2-interval sets, linear graphs, and arc-annotated sequences. The choice of the structure to focus in here is (mostly) algorithmic-dependent: The simpler the structure, the simpler the algorithmic problem.

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-32
SLIDE 32

The LONGEST ARC-PRESERVING COMMON SUBSEQUENCE problem

Definition Given n arc-annotated sequences S1, S2, . . . , Sn, find the largest arc-annotated sequence that occurs in each Si, 1 ≤ i ≤ n, as an arc-preserving subsequence. Remarks The complexity of the problem depnds on the structure of each input arc-annotated sequence. Bad news: the problem is hard to solve to optimality even the structures are crossing-free.

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-33
SLIDE 33

The LONGEST ARC-PRESERVING COMMON SUBSEQUENCE problem

Example S1

a a a g g t t c a c g u

S2

a a g t c c g

S3

a g a a c t c g c g

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-34
SLIDE 34

The LONGEST ARC-PRESERVING COMMON SUBSEQUENCE problem

Example S1

a a a g g t t c a c g u

S2

a a g t c c g

S3

a g a a t t c g c g

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-35
SLIDE 35

The LONGEST ARC-PRESERVING COMMON SUBSEQUENCE problem

Chain Nest Cross Chain Chain Nest Chain Nest Cross EDIT O(nm) O(nm3) NPC APX-hard LAPCS O(nm) O(nm3) NPC MLG O(nm) O(n2m) O(n2m2) O(n4 log3 n) NPC Unlim Chain Nest Cross Unlim EDIT APX-hard LAPCS NPC MLG O(n4 log3 n) NPC

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-36
SLIDE 36

Occurrences in linear graphs

What is an occurrence of a pattern T in a linear graph S ? S T

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-37
SLIDE 37

Occurrences in linear graphs

What is an occurrence of a pattern T in a linear graph S ? S T

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-38
SLIDE 38

Occurrences in linear graphs

What is an occurrence of a pattern T in a linear graph S ? S T

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-39
SLIDE 39

Common structure inference

S3 S2 S1

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-40
SLIDE 40

Common structure inference

S3 S2 S1

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-41
SLIDE 41

Common secondary structure inference

Theorem The problem of finding the maximum size common secondary structure in a set of n linear graphs is solvable in O(m2k logk−2 mk log log mk) time, where m is the maximum size of an input linear graph. Remarks The result holds true even the input structures are crossing. We still don’t have an efficient implementation of the algorithm.

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-42
SLIDE 42

Outline

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-43
SLIDE 43

Tertiary but planar

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures

slide-44
SLIDE 44

RNA bi-secondary structures

Include most real practical RNA structures. Several heuristic algorithms so exist. Well-known combinatorial structure in graph theory (2-pages linear graps). Very little is known on algorithic issues.

St´ ephane Vialette Pattern matching and common structure inference in RNA (secondary) structures