Efficient Implementation of a Generalized Pair HMM for Efficient - PowerPoint PPT Presentation

Efficient Implementation of a Generalized Pair HMM for Efficient Implementation of a Generalized Pair HMM for Comparative Gene Finding Comparative Gene Finding B. Majoros M. Pertea S.L. Salzberg

ab initio (optional) gene finder genome 1 MUMmer ROSE Whole-genome Region-Of-Synteny alignment Extractor ) l a n o i t p o ab initio ( gene finder OASIS Generalized Pair HMM genome 2 predictions for predictions for genome 1 genome 2

What is MUMmer? • Uses suffix trees to efficiently align whole genomes • Can align in nucleotide space (NUCmer) or amino acid space (PROmer) • Described in: Kurtz,S., Phillippy,A., Delcher,A.L., Smoot,M., Shumway,M., Antonescu,C. and Salzberg,S.L. (2004) Versatile and open software for comparing large genomes. Genome Biol. , 5 , R12.

Sample MUMmer Output

ROSE “Region-Of-Synteny Extractor” • Sole purpose is to identify (and preprocess) likely orthologous regions so they can be fed to OASIS Species 1 syntenic Feed these to regions the GPHMM Species 2

How ROSE Works • ROSE performs maximal-length clustering of consistent MUMmer hits • If ab initio predictions are available, ROSE uses them to extend the boundaries of clusters to avoid interrupting probable genes

Three Types of Inconsistency + + + + + + + + Consistent Inconsistent – mapping to different locations + + - - + + + - Inconsistent – mapping Inconsistent – mapping to inconsistent strands to different contigs

Resolving the Inconsistencies When ROSE determines that two MUMmer hits are not consistent, it does the following: 1)It refuses to cluster them together 2)If the difference in alignment scores for the two regions is not great, ROSE supplies both regions to OASIS via separate OASIS runs (even if the regions overlap) 3)If the difference in alignment scores is large, ROSE discards the MUMmer hit with the lower score

Model Topology intron x2 x 3 x 3 donor acceptor x2 x2 site site internal x2 exon final initial single x2 x2 x2 exon exon exon start stop x2 x2 x2 3’-UTR codon codon x2 5’-UTR poly-A x2 promoter signal x2 + strand intergenic - strand x2

Conservation of Exon Structure Our formulation of GPHMM operation assumes that orthologues have equal numbers of exons. This assumption does not hold in many cases. Many Aspergillus orthologs have unequal numbers of exons: This is a shortcoming that we hope to address in the near future.

Background: GHMM Decoding Finding the optimal parse, φ max : φ P ( , S ) arg max arg max φ = φ = P ( | S ) φ φ max P ( S ) arg max arg max = φ = φ φ P ( , S ) P ( S | ) P ( ) φ φ − n 1 arg max ∏ = P ( S | q , d ) P ( q | q ) P ( d | q ) φ − e i i i t i i 1 d i i = i 1 emission transition duration

Generalizing to Pairs of Features The emission probability P e ( S i | q i ,d i ) can be replaced by a paired emission probability, ψ e ( S i, 1 ,S i, 2 |q i ,d i, 1 ,d i, 2 ). The duration probability can likewise be replaced by a paired duration probability, ψ d ( d i, 1 ,d i, 2 |q i ): argmax ∏ ψ ⋅ ⋅ ψ ( S , S | q , d , d ) P ( q | q ) ( d , d | q ) φ − e i , 1 i , 2 i i , 1 i , 2 t i i 1 d i , 1 i , 2 i ∈ φ q i paired emission transition paired duration alignment Markov chain where: ψ e ( S i, 1 ,S i, 2 |q i ,d i, 1 ,d i, 2 ) = P e ( S i, 1 |q i ,d i, 1 ) ⋅ P cond ( S i, 2 | S i, 1 ,q i ,d i, 2 ) (if we ignore any dependence of S i, 1 on d i, 2 and of S i, 2 on d i, 1 ). P e ( S i, 1 |q i ,d i, 1 ) can be evaluated using the standard GHMM methods. ψ d ( d i, 1 ,d i, 2 |q i ) can be estimated using an empirical distribution of | d i, 1 - d i, 2 | values, or more roughly using P d ( d i,1 ) or P d ( d i,2 ). What remains is to evaluate P cond ( S i, 2 | S i, 1 ,q i ,d i, 2 )...

Efficiently Estimating P cond P cond ( S i, 2 | S i, 1 ,q i ,d i, 2 ) can be (very roughly) estimated from the approximate alignment score, P ident : ( ) ( ) − P L ( 1 P ) L = P ( S | S , q , d ) P P ident ident e i , 2 i , 1 i i , 2 match mismatch where L is the alignment length, P match is a parameter to the GPHMM (probability of a match), and P mismatch =1 - P match . For putative noncoding features, P ident is the percent identity estimated from the global NUCmer alignment. For coding features, P ident is the percent similarity (counting a BLOSUM score>0 as a similar pair of residues) estimated from the nearest PROmer (amino acid) alignment.

OASIS Implementation region 1 build parse graph sparse Align parse “guide” alignment graphs alignments matrix region 2 build parse graph Extract best alignment inputs from ROSE predictions for predictions for genome 1 genome 2

What is a Parse Graph? • represents all high-scoring ORFs • each vertex is a signal and each edge is a feature such as an exon or intron • a complete path through the graph from left to right gives a single gene prediction • can be used to explore sub-optimal gene models • when our GHMM’s prediction is not exactly correct, the true gene model is often one of the top few sub-optimal parses.

Building Parse Graphs Our parse graphs are built using a special GHMM decoding algorithm that uses very little memory while also pruning away unpromising subgraphs: coding noncoding coding noncoding coding absolute 012012012012012012012012012012012012012 frame: ATG GT AG GT AG TAG sequence: 01201 201 2012012 gene phase: 0 ······································· model phase 1 ······································· 2 ······································· · · · · · · accum ∞ ∞ ∞ · · · ∞ ∞ ∞ · · · ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ · · · · · prop ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ · · · · · · · · ∆ =2 ∆ =0 ∆ =1

GHMM Memory Requirements Our GHMM’s memory requirements increase linearly, as do Genscan’s, but with a much smaller constant factor: Memory Time Genscan 445 Mb 2:57 Our GHMM 29 Mb 1:28 Aspergillus contig: 922,000 bases

Building Parse Graphs Parse graphs for the two genomes are built independently. The graphs are weighted: edges are scored by Markov chains and vertices are scored by weight matrices.

Aligning Parse Graphs The two parse graphs are aligned using a global alignment algorithm. The optimal alignment corresponds to the chosen pair of syntenic gene predictions. The alignment is constrained by the topologies of the two parse graphs: 1.Only like signals can align, and 2.Two signals can align only if they have neighbors which also align 3.Standard phase constraints apply

Sparse Alignment Matrix “Guide” alignments provided by MUMmer via ROSE dictate which portions of the alignment matrix to allocate. A user- specified ∆ influences the sparseness of the matrix. Both coding (PROmer) and noncoding (NUCmer) guide alignments are used. Coding HSPs are extended to include maximal ORFs.

The Alignment Trellis An alignment “trellis” is formed by taking the cartesian product of edges in the parse graphs leading into corresponding vertices.

Scoring the Trellis (in log space) OASIS matrix A ATCGATGCTTAT S 1 A inductive = argmax B [ B inductive + P transition + P duration + P emission (S 1 ) + P cond (S 2 |S 1 ) ] B GATCGCGATATCTAGCT S 2

Approximating P ident from Alignment The “guide” alignment (red) is superimposed onto the OASIS matrix. P ident = (µ z -µ y )/( λ z - λ y + δ A + δ B ), The alignment cells store µ=cumulative #matches prefix sums of alignment λ =cumulative length scores. Jumps to and from the guide alignment incur indel penalties δ A and δ B . The remaining portion of the alignment is scored in constant time by simple subtraction.

Approx. vs. Exact Alignment Correlation between Needleman-Wunsch score (%) approximate alignment scores (x) and full Needleman-Wunsch alignment scores (y), using percent identity as the alignment score in 2 9 . 0 both cases. = n o i t a l e Correlation coefficient r r o c was 0.92 for points above (50%,50%). However, there is much variability, and scores are underestimated at the low end (slope<1). Approximate alignment score (%)

Efficient Implementation of a Generalized Pair HMM for Efficient - PowerPoint PPT Presentation

Efficient Implementation of a Generalized Pair HMM for Efficient Implementation of a Generalized Pair HMM for Comparative Gene Finding Comparative Gene Finding B. Majoros M. Pertea S.L. Salzberg ab initio (optional) gene finder genome 1

Cell implementation HMM (HMM hidden Markov model) Authors: Jakub Hork Ji Hona

Introduction to Hmm Introduction to Hmm Joe Wu Nov 4 th 2011 Agenda The applications of HMM.

Using HMM to Blur the Lines between CPU and GPU Programming John Hubbard, May 10, 2017

ROUNDERS (1998) CASINO ROYALE (2006) HAND RANKINGS HIGH CARD HAND RANKINGS PAIR HIGH CARD

Closest Pair of Points Cormen et.al 33.4 Closest Pair of Points Closest pair. Given n points in

A Talk on Protein Homology Detection by HMM-HMM comparisons[1] Sding, J Qing Ye Department of

The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 Lecture Outline Lecture Outline

Fast TwoLevel Fast TwoLevel HMM Decodi HMM Decoding ng Algor gorithm for thm for Large

Global Robot Ego-Localization C Combining Image Retrieval and HMM- bi i I R i l d HMM

ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2019 Recap: HMM Elements of HMM:

Generalized MPLS Signaling draft-ietf-mpls-generalized-signaling-05.txt

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

On Accelerating Pair-HMM Computations in Programmable Hardware Contributions Design and

On the Complexity of Closest Pair via Polar-Pair of Point-Sets Bundit Laekhanukit

Scala Generics Type Parameterization 1 / 16 Scala Generics Consider: 1 class Pair[T, S](val

Pair HMMs and Pairwise Sequence Alignment COMP 571 Luay Nakhleh, Rice University Pair HMMs

Against an Increasingly User-Hostile Web We're quietly replacing an open web that connects and

Partnership Working: Creating the Possible (Circles Network, 2001) A Parent / Carers

Welcome Conference on supporting people with long term health conditions 16 th November 2017

Anatomy is Not Destiny: Creating Eyeglasses for the Mind Gerhard Fischer Center for LifeLong

Nutritional immunity and fungal infections S.J. Park et al., J. Immunol ., 2006 3 Y.-S. Sun et

Statistical mechanics of fitness landscapes Joachim Krug Institute for Theoretical Physics,

Pictures: all pictures: Hajo Seng except: p. 4: Leif Ekblad, Aspie Quiz,

SELF-AWARENESS OR, LOVING THE ALIEN IDENTITY Self- awareness encompasses perceptions,