day 3
play

Day 3 ! " 5 slide synopsis of last lecture A: Sequence + - PDF document

Day 3 ! " 5 slide synopsis of last lecture A: Sequence + structure " Covariance Models (CMs) represent conserved RNA sequence/structure motifs " B: the CM guide tree " They allow accurate search " C: probabilities


  1. Day 3 ! " 5 slide synopsis of last lecture A: Sequence + structure " Covariance Models (CMs) represent conserved RNA sequence/structure motifs " B: the CM “guide tree” " They allow accurate search " C: probabilities of letters/ pairs & of indels " But " Think of each branch " a) search is slow " being an HMM emitting " b) model construction is laborious " both sides of a helix (but 3’ side emitted in reverse order) " 56 57 Example: CM Viterbi Alignment ! searching for (the “inside” algorithm) " tRNAs y = max " log P ( x ij generated starting in state y via path " ) S ij % z y max z [ S i + 1, j # 1 + log T yz + log E x i , x j ] match pair ' y ] z max z [ S i + 1, j + log T yz + log E x i match/insert left ' ' y = y ] S ij z & max z [ S i , j # 1 + log T yz + log E x j match/insert right ' z max z [ S i , j + log T yz ] delete ' y right ] y left + S k + 1, j ' max i < k $ j [ S i , k bifurcation ( Time O(qn 3 ), q states, seq len n Time O(qn 3 ), q states, seq len n compare: O(qn) for profile HMM 58 59 Example Rfam Family " Today’s Goals " IRE (partial seed alignment): ! Input (hand-curated): " Faster Search " Hom.sap. GUUCCUGCUUCAACAGUGUUUGGAUGGAAC MSA “seed alignment” " Infernal & RaveNnA " Hom.sap. UUUCUUC.UUCAACAGUGUUUGGAUGGAAC Hom.sap. UUUCCUGUUUCAACAGUGCUUGGA.GGAAC SS_cons " Automated Model-building " Hom.sap. UUUAUC..AGUGACAGAGUUCACU.AUAAA Score Thresh T " Hom.sap. UCUCUUGCUUCAACAGUGUUUGGAUGGAAC Hom.sap. AUUAUC..GGGAACAGUGUUUCCC.AUAAU CMfinder " Window Len W " Hom.sap. UCUUGC..UUCAACAGUGUUUGGACGGAAG Hom.sap. UGUAUC..GGAGACAGUGAUCUCC.AUAUG Output: " Hom.sap. AUUAUC..GGAAGCAGUGCCUUCC.AUAAU Cav.por. UCUCCUGCUUCAACAGUGCUUGGACGGAGC CM " Mus.mus. UAUAUC..GGAGACAGUGAUCUCC.AUAUG Mus.mus. UUUCCUGCUUCAACAGUGCUUGAACGGAAC scan results & “full Mus.mus. GUACUUGCUUCAACAGUGUUUGAACGGAAC alignment” " Rat.nor. UAUAUC..GGAGACAGUGACCUCC.AUAUG Rat.nor. UAUCUUGCUUCAACAGUGUUUGGACGGAAC SS_cons <<<<<...<<<<<......>>>>>.>>>>> 60 61

  2. Impact of RNA homology search " (Barrick, et al., 2004) � Homology search " glycine operon � riboswitch � Sequence-based " B. subtilis � Smith-Waterman " L. innocua � FASTA " BLAST " A. tumefaciens � V. cholera � Sharp decline in sensitivity at ~60-70% identity " M. tuberculosis � So, use structure, too " (and 19 more species) � 64 65 Impact of RNA homology search " (Barrick, et al., 2004) � (Mandal, et al., 2004) � glycine operon � riboswitch � Faster Genome Annotation ! B. subtilis � of Non-coding RNAs ! L. innocua � Without Loss of Accuracy " A. tumefaciens � Zasha Weinberg " V. cholera � & W.L. Ruzzo " M. tuberculosis � Recomb ‘04, ISMB ‘04, Bioinfo ‘06 " (and 42 more species) � (and 19 more species) � BLAST-based CM-based 66 RaveNnA: Genome Scale ! CM’s are good, but slow ! RNA Search " Rfam Reality Our Work Rfam Goal Typically 100x speedup over raw CM, w/ no loss in accuracy: " EMBL EMBL EMBL " " Drop structure from CM to create a (faster) HMM " " Use that to pre-filter sequence; " BLAST Ravenna " Discard parts where, provably, CM score < threshold; " " Actually run CM on the rest (the promising parts) " " Assignment of HMM transition/emission scores is key " CM CM CM " " (a large convex optimization problem) " junk junk hits hits hits 1 month, ~2 months, 10 years, Weinberg & Ruzzo, Bioinformatics , 2004, 2006 1000 computers 69 1000 computers 1000 computers 71

  3. Covariance ! Oversimplified CM ! (for pedagogical purposes only) " Model " Key difference of CM vs HMM: A A C C Pair states emit paired symbols, G G corresponding to base-paired U U – – nucleotides; 16 emission probabilities here. A A C C G G U U – – 72 73 CM to HMM " Key Issue: 25 scores " 10 " CM HMM CM HMM A A A A A A A A C C C C C C C C P L R G G G G G G G G U U U U U U U U – – – – – – – – A A A A C C C C G G G G Need: log Viterbi scores CM ! HMM " U U U U – – – – 25 emisions per state 5 emissions per state, 2x states 74 75 P AA ! L A + R A P AC ! L A + R C Key Issue: 25 scores " 10 " Rigorous Filtering " P AG ! L A + R G P AU ! L A + R U CM HMM P A– ! L A + R – … Any scores satisfying the linear inequalities A A A A C C C C P L R give rigorous filtering ! G G G G U U U U – – – – NB: HMM not a prob. model Proof: ! Need: log Viterbi scores CM ! HMM " CM Viterbi path score ! ! “corresponding” HMM path score ! P AA ! L A + R A P CA ! L C + R A … P AC ! L A + R C P CC ! L C + R C … ! Viterbi HMM path score ! P AG ! L A + R G P CG ! L C + R G … (even if it does not correspond to any CM path) " P AU ! L A + R U P CU ! L C + R U … P C– ! L C + R – … P A– ! L A + R – 77 78

  4. Some scores filter better " Assignment of scores/ “probabilities” " P UA = 1 ! L U + R A " Convex optimization problem " P UG = 4 ! L U + R G " Constraints: enforce rigorous property " Objective function: filter as aggressively as " " " " " " Assuming ACGU # 25% " possible " Option 1: " " " " Opt 1: " Problem sizes: " L U = R A = R G = 2 " " L U + (R A + R G )/2 = 4 " 1000-10000 variables " Option 2: " " " " Opt 2: " 10000-100000 inequality constraints " L U = 0, R A = 1, R G = 4 " L U + (R A + R G )/2 = 2.5 " 79 83 Estimated Filtering Efficiency ! “Convex” Optimization " (139 Rfam 4.0 families) " Filtering # families # families Convex: ! Nonconvex: ! fraction " (compact) " (expanded) " local max = global max; " can be many local maxima, ≪ global max; ! simple “hill climbing” works " < 10 -4 " 105 " 110 " # break even ~100x “hill-climbing” fails " 10 -4 - 10 -2 " 8 " 17 " speedup .01 - .10 " 11 " 3 " .10 - .25 " 2 " 2 " .25 - .99 " 6 " 4 " .99 - 1.0 " 7 " 3 " 84 85 Averages 283 times faster than CM � Results: new ncRNAs (?) " Heuristic Filters " Name " # Known " # New " Rigorous filters optimized for worst case " (BLAST + CM) " (rigorous filter + CM) " Pyrococcus snoRNA " 57 " 123 " Possible to trade improved speed for small Iron response element " 201 " 121 " loss in sensitivity? " Histone 3’ element " 1004 " 102* " Yes – profile HMMs as before, but optimized Retron msr " 11 " 48 " Hammerhead I " 167 " 26 " for average case " Hammerhead III " 251 " 13 " Often 10x faster, modest loss in sensitivity " U6 snRNA " 1462 " 2 " U7 snRNA " 312 " 1 " cobalamin riboswitch " 170 " 7 " 13 other families " 5-1107 " 0 " 87 102

  5. Software " CM Search Summary " Ravenna implements both rigorous and Still slower than we might like, but dramatic heuristic filters " speedup over raw CM is possible with: " Infernal (engine behind Rfam) implements No loss in sensitivity (provably), or " heuristic filters and some other accelerations " Even faster with modest (and estimable) loss in sensitivity " E,g., dynamic “banding” of dynamic programming matrix based on the insight that large deviations from consensus length must have low scores. " 107 108 RNA Motif Discovery " RNA Motif Discovery " CM’s are great, but where do they come from? " Typical problem: given a 10-20 unaligned An approach: comparative genomics " sequences of 1-10kb, most of which contain Search for motifs with common secondary structure in a instances of one RNA motif of 100-200bp set of functionally related sequences. " -- find it. " Challenges " Example: 5’ UTRs of orthologous glycine Three related tasks " Locate the motif regions. " cleavage genes from $ -proteobacteria " Align the motif instances. " Example: corresponding introns of Predict the consensus secondary structure. " Motif search space is huge! " orthogolous vertebrate genes " Motif location space, alignment space, structure space. " 115 117 Pitfall for sequence-alignment- Approaches " first approach " Align-First: Align sequences, then look for Structural conservation ≠ Sequence conservation " Alignment without structure information is unreliable " common structure " CLUSTALW alignment of SECIS elements with flanking regions Fold-First: Predict structures, then try to align them " Joint: Do both together " same-colored boxes should be aligned 118 120

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend