A Method for Aligning RNA Secondary Structures Jason T. L. Wang - - PowerPoint PPT Presentation

a method for aligning rna secondary structures
SMART_READER_LITE
LIVE PREVIEW

A Method for Aligning RNA Secondary Structures Jason T. L. Wang - - PowerPoint PPT Presentation

A Method for Aligning RNA Secondary Structures Jason T. L. Wang New Jersey Institute of Technology J Liu, JTL Wang, J Hu and B Tian, BMC Bioinformatics, 2005 1 Outline Introduction Structural alignment of RNA (preliminaries, RSmatch


slide-1
SLIDE 1

1

A Method for Aligning RNA Secondary Structures

Jason T. L. Wang New Jersey Institute of Technology

J Liu, JTL Wang, J Hu and B Tian, BMC Bioinformatics, 2005

slide-2
SLIDE 2

2

Outline

  • Introduction
  • Structural alignment of RNA (preliminaries,

RSmatch algorithm, software)

  • Experiments (RNA motif detection)
  • Multiple structural alignment (RMulti)
  • Combining RSmatch with RNAView
  • Conclusion and future work
slide-3
SLIDE 3

3

Molecule building blocks

  • Protein building blocks:

– 20 types of amino acid

  • RNA building blocks:

– Purine: Adenine, Guanine – Pyrimidine: Cytosine, Uracil

slide-4
SLIDE 4

4

RNA structure elements

  • RNA sequence folds to form

secondary/tertiary structure

  • Majority of base connections

involve two bases

– Watson-Crick: AU or CG – Non-canonical: UG or AG

  • Basic structure elements of RNA
slide-5
SLIDE 5

5

Definition of structural components

  • Given an RNA sequence:

– 5’ 3’: r1r2r3…rn

  • Two types of structural

components[1]:

– Single bases (blue) – Bonded base pairs (red)

A U C G G G A U C G C G G A U A U G A G G C G C A U A G C G G U 5’ 3’ [1] Zuker, M. (1989) Science

slide-6
SLIDE 6

6

Secondary structure constraint (1)

  • No common base

can be shared by any two pairs[2].

– Bad: “G” is shared by two pairs: A-G and G-C

(a) GOOD (b) BAD A U C G G G A U C G C G G A U A U G A GG C G C A U A GC G G U 5’ 3’ A U C G G G A U C G C G G A U A U G A GG C G C A G A GC G G U 5’ 3’ C CG AC Prohibited! [2] Hofacker, I.L. (2003) NAR

slide-7
SLIDE 7

7

  • A hairpin element must

have at least 3 bases on the loop part [3].

– Bad: only two bases (A and U) present in the loop

(a) GOOD (b) BAD A U C G G G A U C G C G G A U A U G A GG C G C A U A GC G G U 5’ 3’ A U A U C G C G G A U A U G A GG C G C A U A GC G G U 5’ 3’

Secondary structure constraint (2)

hairpin Prohibited! [3] Zuker, M. (1991) NAR

slide-8
SLIDE 8

8

  • Pseudoknots are not included[4]

(a) BAD (b) GOOD (nested structure) (c) GOOD (branching)

Secondary structure constraint (3)

A U C G G G A U C G C G G A U A U G A GG C G C A U A GC G G U 5’ 3’ A U C G G G A U C G C G G A U AU G A G G C G C A U A G C G G U 5’ 3’ GG U A U C G G G A U C G C G G A U A U G A G G C G C A U A G C G G U 5’ 3’ A A AGG C Prohibited! [4] Mathews, D.H. (1999) JMB

slide-9
SLIDE 9

9

RNA secondary structure representation schemes

a. Bond annotation[5] b. Arc representation[6] c. Tree representation[7] d. Nested parenthesis representation[8]

[5] Shapiro, B. (1990) CABIOS [6] Zhang, K. (1999) CPM [7] Ma, B. (2002) TCS [8] Hofacker, I.L. (2002) JMB

slide-10
SLIDE 10

10

Outline

  • Introduction
  • Structural alignment of RNA (preliminaries,

RSmatch algorithm, software)

  • Experiments (RNA motif detection)
  • Multiple structural alignment (RMulti)
  • Combining RSmatch with RNAView
  • Conclusion and future work
slide-11
SLIDE 11

11

Extended circle model

Circle model[9] :

  • circle 0: G, C, A, G, A, A
  • circle 1: A, A, U, G
  • circle 7: C, C, G, C, G
  • circle 8: G, U, A, U, U, U, C

Sequential order between components: G > C > A-U > U > C-G > A-G

G U C G A A A U U A A U G GA U C G C G C G C G C U A U U U A A C G 5’ 3’ A G

circle 0 circle 1 circle 2 circle 3 circle 4 circle 5 circle 6 circle 7 circle 8

[9] Liu, J. (2005) BMC Bioinformatics

slide-12
SLIDE 12

12

Hierarchical organization

  • circles are organized in a tree-like hierarchy

G U C G A A A U U A A U G GA U C G C G C G C G C U A U U U A A C G 5’ 3’ A G

circle 0 circle 1 circle 2 circle 3 circle 4 circle 5 circle 6 circle 7 circle 8

circle 0 circle 1 circle 2 circle 3 circle 4 circle 5 circle 6 circle 7 circle 8

slide-13
SLIDE 13

13

Hierarchical relationship between two structural components

(1) the same circle: e.g. each pair from G, C, G, A-U, G-C, G, A-U (2) descendant/ancestor circles: e.g. pair (G, A-U) (3) cousin circles: e.g. pairs (U, C), (A-U, G-C) and (U, G-C)

(1) (2) (3) GU CG A A A U U A A U G G A U C G C G C G C G C UA U U U A A C G 5’ 3’ A G GU CG A A A U U A A U G G A U C G C G C G C G C UA U U U A A C G 5’ 3’ A G circle GU CG A A A U U A A U G G A U C G C G C G C G C UA U U U A A C G 5’ 3’ A G

slide-14
SLIDE 14

14

Partial structure induced by a structural component

GU CG A A A U U A A U G G A U C G C G C G C G C UA U U U A A C G 5’ 3’ A G GU CG A A A U U A A U G G C G C G C G C G C UA U U U 5’ 3’ GU CG A A A U U A A U G G C G C C G C G C UA U U U 5’ 3’

10 30

parent structure child structure

slide-15
SLIDE 15

15

Structural alignment rules (1)

  • A1 precedes A2 iff B1 precedes B2 where A1 , A2 , B1 ,B2

are structural components.

slide-16
SLIDE 16

16

Structural alignment rules (2)

RNA 1 RNA 2 (a) (b) (c)

(a) Same loop relationship preserved: A1 is in the same loop as A2 iff B1 is in the same loop as B2 (b) Ancestor/descendant relationship preserved: A1 is ancestor of A2 iff B1 is ancestor of B2 (c) Cousin relationship preserved: A1 is cousin of A2 iff B1 is cousin of B2

slide-17
SLIDE 17

17

Example alignment

  • All structural alignment

rules must be satisfied for a valid alignment

  • In addition, a single base

can not be aligned with a base pair

GU CG A A A U U A G C A G C G C G C G C G C G C UA U U U A A U G 3’ A U 5’ G C U CU A U U A U A A GC G G C G A U G C U A U U U A A U 3’ U A GC 5’ First RNA Second RNA Alignment Result

..((...(((......)))((.(.....))).)).. GUACGCAGUAAGUCGAUACGCCGUAUUUCGCGGUAA ..((..((......))(((.......))).)).. GUUCGAUUUCUCUAAAGAGUAGCUUUCUCGGAAA

..((...(((......)))((.(.. ...))).)).. GUACGCAGUAAGUCGAUACGCCGUA—-UUUCGCGGUAA || || | || | | | ||| |||| ||| || GUUCGA-UU-UCUCUA-AAGA-GUAGCUUUCUCGGAAA ..((.. (( ...... ))(( (.......))).))..

slide-18
SLIDE 18

18

Dynamic programming algorithm:

  • verview

A UA C A UG U U 5’ 3’ A UC U CA U A U GA G C U A G G 5’ 3’

First structure Second structure DP scoring table

A-U A U C A U G U A U U C A U C A G G U A-U A G C-G The best alignment between partial structures

  • f U and A-U
slide-19
SLIDE 19

19

Case 1

5’ 3’ 5’ 3’

slide-20
SLIDE 20

20

Case 2

5’ 3’ 3’ 5’

slide-21
SLIDE 21

21

Case 3

5’ 5’ 3’ 3’

slide-22
SLIDE 22

22

Case 4.1

3’ 3’ 5’ 5’

slide-23
SLIDE 23

23

Case 4.2

5’ 5’ 3’ 3’

slide-24
SLIDE 24

24

Example of matching score function

  • Score function of matching two equal-length

structural components: i.e.

     = = =

  • therwise

, and pairs base are and both , 2 and bases single are and both , 1 ) , (

b a b a b a b a b a

C C C C if C C C C if C C g

  • Gap penalty equals 0
  • Extending g to the whole set of matched component pairs,
  • ur goal is to maximize f(R1, R2)

) , ( ) , (

2 1

i i

b a i

C C g R R f

=

slide-25
SLIDE 25

25

Cell type 1 : single base vs. single base

A U A C A U G U U 5’ 3’ C A UC U CA U A U GA G C U A G G 5’ 3’

..(.....).

  • -AUACAUGUUC

UCAUACAGGUUA ....(.....).

(A)

A U A C A U G U U 5’ 3’ C 5’

(B)

A UC U CA U A U GA G C U A G G 3’

..(.....) .

  • -AUACAUGUU-C

UCAUACAGGUUA- ....(.....).

A U A C A U G U U 5’ 3’ C A UC U CA U A U GA G C U A G G 5’ 3’

(C)

..(.....).

  • -AUACAUGUUC-

UCAUACAGGUU-A ....(.....) .

A U A C A U G U U 5’ 3’ C A UC U CA U A U GA G C U A G G 5’ 3’

AUACAUGUUC ..(.....). UCAUACAGGUUA ....(.....).

?

slide-26
SLIDE 26

26

Cell type 2: base pair vs. single base

A UC U CA U A U GA G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C A UC U CA U A U GA G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C A U A C A U G U U 5’ 3’ C A UC U CA U A U GA G C U A G G 5’ 3’

first score second score ? ? ?

slide-27
SLIDE 27

27

Cell type 2: base pair vs. single base (first score)

A UC U C A U A U G A G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C

?

ACAUGUU (.....) UCAUACAGGUUA ....(.....).

A UC U C A U A U G A G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C

(.....)

  • ---ACAUGUU-

UCAUACAGGUUA ....(.....).

A UC U C A U A U G A G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C

( ..... ) A-----CAUGU--U

  • UCAUACAGGUUA

....(.....).

slide-28
SLIDE 28

28

Cell type 2: base pair vs. single base (second score)

A UC U C A U A U G A G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C

?

AUACAUGUU ..(.....) UCAUACAGGUUA ....(.....).

A UC U C A U A U G A G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C

..(.....)

  • -AUACAUGUU-

UCAUACAGGUUA ....(.....).

(A)

A UC U C A U A U G A G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C

.. (.....) AU----ACAUGUU-

  • -UCAUACAGGUUA

....(.....).

(B)

A UC U C A U A U G A G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C

(C)

.. (.....)

  • -AU--------ACAUGUU

UCAUACAGGUUA------- ....(.....).

slide-29
SLIDE 29

29

A UC U C A U A U G A G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C

AUACAUGUU ..(.....) UCAUACAGGUU ....(.....)

A UC U C A U A U G A G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C

?

A UC U C A U A U G A G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C

?

A UC U C A U A U G A G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C

?

A UC U C A U A U G A G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C

? (A) (B) (C) (b1) (b2)

Cell type 3: base pair vs. base pair

?

slide-30
SLIDE 30

30

Cell type 3: base pair vs. base pair (first score)

A UC U C A U A U G A G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C

?

(.....) ACAUGUU (.....) ACAGGUU

A UC U C A U A U G A G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C

(.....) ACAUGUU ACAGGUU (.....)

A UC U C A U A U G A G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C

( ..... ) A-CAUGU-U

  • ACAGGUU-

(.....)

A UC U C A U A U G A G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C

(.....)

  • ACAUGUU-

A CAGGU U ( ..... )

(A) (B) (C)

slide-31
SLIDE 31

31

Cell type 3: base pair vs. base pair (2nd & 3rd score)

A UC U C A U A U G A G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C

( ..... ) A-----CAUGU-U

  • UCAUACAGGUU-

....(.....)

A UC U C A U A U G A G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C

(.....)

  • ---ACAUGUU

UCAUACAGGUU ....(.....)

A UC U CA U A U GA G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C

?

(.....) ACAUGUU ....(.....) UCAUACAGGUU (.....) ACAUGUU-------

  • -UCAU-ACAGGUU

.... (.....)

A UC U C A U A U G A G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C A UC U CA U A U GA G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C

?

..(.....) AUACAUGUU (.....) ACAGGUU

slide-32
SLIDE 32

32

Cell type 3: base pair vs. base pair (final score)

A UC U C A U A U G A G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C

?

..(.....) AUACAUGUU ....(.....) UCAUACAGGUU

A UC U C A U A U GA G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C

..(.....)

  • -AUACAUGUU

UCAUACAGGUU ....(.....)

A UC U C A U A U GA G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C A UC U C A U A U GA G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C A UC U C A U A U GA G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C A UC U C A U A U GA G C U A G G 5’ 3’ A U A C A U G U U 5’ 3’ C

.. (.....) AU----ACAUGUU

  • -UCAUACAGGUU

....(.....)

(A) (B) (C)

..(.....)

  • ---AUACAUGUU

UCAU--ACAGGUU .... (.....)

(D)

.. (.....)

  • -AU-------ACAUGUU

UCAUACAGGUU------- ....(.....) ..( .....) AUA-CAUGUU-------

  • ---UCAU---ACAGGUU

.... (.....)

slide-33
SLIDE 33

33

Analysis of algorithm

  • Time and space complexity

– Each score is calculated only once. – Time is bounded by the number of score calculations needed to fill up the table. – Each base pair will contribute to two or four score calculations. – Single bases: Ns; base pairs: Np – Total number of score calculations: Ns

2+4NsNp+4Np 2 =O(N2)

  • Ns

2 score calculations are contributed by two single bases

  • 4NsNp score calculations are contributed by one single base and one base

pair

  • 4Np

2 score calculations are contributed by two base pairs

slide-34
SLIDE 34

34

Software RSmatch

  • http://aria.njit.edu/rnacenter/RSmatch/
slide-35
SLIDE 35

35

Outline

  • Introduction
  • Structural alignment of RNA (preliminaries,

RSmatch algorithm, software)

  • Experiments (RNA motif detection)
  • Multiple structural alignment (RMulti)
  • Combining RSmatch with RNAView
  • Conclusion and future work
slide-36
SLIDE 36

36

Motif example: detection/instantiation

  • Motif structure is

known

  • IUB ambiguity

symbols:

– N: A U C G – W: A U – H: not G

slide-37
SLIDE 37

37

Gap Penalty Example

motif structure subject structure

slide-38
SLIDE 38

38

Position independent scoring matrices

  • Two scoring matrices
  • Gap penalty: -3 for

each single base, -6 for each base pair, involved in the gap

slide-39
SLIDE 39

39

Motifs used in the experiments

  • HSL3 has a typical stem loop

structure with two flanking tails

  • IRE has specific stem-loop

structure for gene regulation related to cell iron metabolism

  • Wildcard “n” is allowed to

match with 0 or 1 nucleotide

  • IUB code:

– M: A, T/U; – Y: C, T/U; – H: not G; – R: A, G; – W: A, T;

(a) HSL3 (b) IRE

slide-40
SLIDE 40

40

Experiments

  • Performance measurements: sensitivity (recall)

and specificity (precision)

  • 19,986 human RefSeq mRNA sequences were
  • btained from NCBI; 39,972 UTR regions were

extracted

  • Each UTR sequence was chopped and folded into

secondary structures using Vienna RNA package, yielding ~575,000 structures

  • Compare RSmatch with PatSearch[10]

[10] Pesole G. (2000) Bioinformatics

slide-41
SLIDE 41

41

Chop and fold UTR sequences

50 100 150 200

ORF

AAAAAAAAAAAAA

3’UTR

50 100 150 200

ORF

5’UTR

ORF: Open Reading Frame

slide-42
SLIDE 42

42

Detecting HSL3 motif

  • PatSearch: specificity (98.2%), sensitivity (87.1%).
  • Several histone genes (i.e. NM_003542, NM_003548) were found by

RSmatch, but not by PatSearch.

slide-43
SLIDE 43

43

Detecting IRE motif

  • Use PatSearch to search 39,972 UTR sequences for

IRE motif and get 27 hit structures belonging to 18 UTR sequences

  • The 18 UTR sequences were chopped and folded

into 1,196 structures

  • Compare RSmatch, Rsearch[11] and stemloc[12].
  • A well-known IRE-containing structure

(NM_000032) was used as the query (it does not have wildcard or ambiguity symbols since Rsearch and stemloc cannot handle them)

[11] Klein, R.J. (2003) BMC Bioinformatics [12] Holms, I. (2002) PSB

slide-44
SLIDE 44

44

Experimental results for IRE motif

slide-45
SLIDE 45

45

Dealing with complex structures

slide-46
SLIDE 46

46

Outline

  • Introduction
  • Structural alignment of RNA (preliminaries,

RSmatch algorithm, software)

  • Experiments (RNA motif detection)
  • Multiple structural alignment (RMulti)
  • Combining RSmatch with RNAView
  • Conclusion and future work
slide-47
SLIDE 47

47

Extension to multiple structural alignment

search small database seed alignment profile expand seed alignment

score (best alignment) < δ OR non-expandable

expand best alignment pairwise match

NO YES

slide-48
SLIDE 48

48

Example

expand expand

slide-49
SLIDE 49

49

RMulti Webserver

  • http://aria.njit.edu/rnacenter/multi.html
slide-50
SLIDE 50

50

Outline

  • Introduction
  • Structural alignment of RNA (preliminaries,

RSmatch algorithm, software)

  • Experiments (RNA motif detection)
  • Multiple structural alignment (Rmulti)
  • Combining RSmatch with RNAView
  • Conclusion and future work
slide-51
SLIDE 51

51

slide-52
SLIDE 52

52

slide-53
SLIDE 53

53

Outline

  • Introduction
  • Structural alignment of RNA (preliminaries,

RSmatch algorithm, software)

  • Experiments (RNA motif detection)
  • Multiple structural alignment (RMulti)
  • Combining RSmatch with RNAView
  • Conclusion and future work
slide-54
SLIDE 54

54

Conclusion

  • An efficient algorithm RSmatch to align

and analyze RNA secondary structures

  • A multiple structural alignment tool RMulti
  • A visualization tool combining RSmatch

with RNAView

slide-55
SLIDE 55

55

Future Work

  • Extending RSmatch to handle pseudoknots
  • Large-scale genome-wide motif mining
  • Indexing very large RNA structure databases
  • Improved multiple structural alignment of RNA

sequences

  • RNA classification and clustering
  • RNA-RNA interactions and protein-RNA

interactions

slide-56
SLIDE 56

56