I i M i M i 5 6 Handling non-Global Alignments Original profile - PDF document

Introduction CSCE 471/871 Lecture 4: Profile Hidden Markov Models • Designed to model (profile) a multiple alignment of a protein family (e.g. p. 102) • Gives a probabilistic model of the proteins in the family Stephen D. Scott • Useful for searching databases for more homologues and for aligning strings to the family 1 2 Organization of a Profile HMM Outline • Start with a trivial HMM M (not really hidden at this point) • Organization of a profile HMM 1 1 1 1 1 B M 1 M E i – Ungapped regions – Insert and delete states • Each match state has its own set of emission probs, so we can compute prob of a new sequence x being part of this family: L • Building a model P ( x | M ) = Y e i ( x i ) i =1 • Searching with HMMs • Can, as usual, convert probabilities to log-odds score 3 4 Organization of a Profile HMM (cont’d) Organization of a Profile HMM • But this assumes ungapped alignments! (cont’d) • To handle gaps, consider insertions and deletions - Deletion: parts of multiple alignment not matched by any residue in x (use silent delete states) – Insertion: part of x that doesn’t match anything in multiple alignment (use insert states) D i I i M i M i 5 6

Handling non-Global Alignments • Original profile HMMs model entire sequence • Add flanking model states (or free insertion modules) to generate non- General Profile HMM Structure local residues B E B E 7 8 Building a Model Outline • Given a multiple alignment, how to build an HMM? • Organization of a profile HMM – General structure defined, but how many match states? ... V G A - - H A G E Y ... • Building a model ... V - - - - N V D E V ... – Structure ... V E A - - D V A G H ... ... V K G - - - - - - D ... – Estimating probabilities ... V Y S - - T Y E T S ... ... F N A - - N I P K H ... • Searching with HMMs ... I A G A D N G A G V ... 9 10 Building a Model Building a Model (cont’d) (cont’d) • Given a multiple alignment, how to build an HMM? – General structure defined, but how many match states? • Now, find parameters – Heuristic: if more than half of characters in a column are non-gaps, include a match state for that column • Multiple alignment + HMM structure ! state sequence ... V G A - - H A G E Y ... Non-gap in match column -> ... V - - - - N V D E V ... M1 D3 I 3 match state ... V E A - - D V A G H ... ... V G A - - H A G E Y ... Gap in match column -> ... V - - - - N V D E V ... ... V K G - - - - - - D ... delete state ... V E A - - D V A G H ... Non-gap in insert column -> ... V Y S - - T Y E T S ... ... V K G - - - - - - D ... insert state ... F N A - - N I P K H ... ... V Y S - - T Y E T S ... Gap in insert column -> ... F N A - - N I P K H ... ... I A G A D N G A G V ... ignore ... I A G A D N G A G V ... Durbin Fig 5.4, p. 10 9 11 12

Weighted Pseudocounts Building a Model (cont’d) • Let c ja = observed count of residue a in position j of multiple alignment • Count number of transitions and emissions and compute: c ja + Aq a e M j ( a ) = A kl P a 0 c ja 0 + A a kl = P l 0 A kl 0 E k ( b ) • q a = background probability of a , A = weight placed on e k ( b ) = P b 0 E k ( b 0 ) pseudocounts (sometimes use A ⇡ 20 ) • Still need to beware of some counts = 0 • Background probabilities also called a prior distribution 13 14 Dirichlet Mixtures Dirichlet Mixtures (cont’d) • Can be thought of as a mixture of pseudocounts ↵ k (so ↵ k • Each component k consists of a vector of pseudocounts ~ a corresponds to Aq a ) and a mixture coefficient ( m k , for now) that is the • The mixture has different components, each representing a different probability that component k is selected context of a protein sequence – E.g. in parts of a sequence folded near protein’s surface, more • Pseudocount model k is the “correct” one with probability m k weight (higher q a ) can be given to hydrophilic residues – But in other regions, may want to give more weight to hydrophobic • We’ll set the mixture coefficients for each column based on which vec- residues tors best fit the residues in that column – E.g. first column of alignment on slide 10 is dominated by V, so any • Will find a different mixture for each position of the alignment based on ↵ k that favors V will get a higher m k vector ~ the distribution of residues in that column 15 16 Dirichlet Mixtures Dirichlet Mixtures (cont’d) (cont’d) • Let ~ c j be vector of counts in column j c ja + ↵ k a ⇣ ⌘ X e M j ( a ) = P k | ~ c j • Γ is gamma function, and ln Γ is computed via lgamma and related ⇣ ⌘ c ja 0 + ↵ k P k a 0 a 0 functions in C ⇣ ⌘ • m k 0 is prior probability of component k ( = q in Sj¨ olander Table 1): • P k | ~ c j are the posterior mixture coefficients, which are easily computed [Sj¨ olander et al. 1996], yielding: X a e M j ( a ) = a 0 X a 0 , P where ↵ k c ja + ~ ↵ k + ~ a ⇣ ⇣ ⌘ ⇣ ↵ k ⌘⌘ X a = X m k 0 exp ln B � ln B ⌘ , ~ c j ~ ⇣ c ja 0 + ↵ k P k a 0 a 0 0 1 X @X ln B ( ~ x ) = ln Γ ( x i ) � ln Γ x i A . . . i i 17 18

Searching for Homologues • Score a candidate match x by using log-odds: – P ( x, ⇡ ⇤ | M ) is probability that x came from model M via most Outline likely path ⇡ ⇤ ) Find using Viterbi • Organization of a profile HMM – Pr ( x | M ) is probability that x came from model M summed over all possible paths • Building a model ) Find using forward algorithm – score ( x ) = log( P ( x | M ) /P ( x | � )) • Searching with HMMs ⇤ � is a “null model”, which is often the distribution of amino acids in the training set or AA distribution over each individual column ⇤ If x matches M much better than � , then score is large and positive 19 20 Forward Equations Viterbi Equations • V M ( i ) = log-odds score of best path matching x 1 ...i to the model, j where x i emitted by state M j (similarly define V I j ( i ) and V D e M j ( x i ) j ( i ) ) ! F M h ⇣ F M ⌘ j ( i ) = log + log a M j � 1 M j exp j � 1 ( i � 1) + • Rename B as M 0 , V M 0 (0) = 0 , rename E as M L +1 ( V M L +1 = final) q x i V M 8 j � 1 ( i � 1) + log a M j � 1 M j ⇣ F I ⌘ ⇣ F D ⌘i a I j � 1 M j exp j � 1 ( i � 1) + a D j � 1 M j exp j � 1 ( i � 1) e M j ( x i ) ! > > V M < V I ( i ) = log + max j � 1 ( i � 1) + log a I j � 1 M j j q x i > V D j � 1 ( i � 1) + log a D j � 1 M j > e I j ( x i ) : ! F I h ⇣ F M ⌘ j ( i ) = log + log a M j I j exp j ( i � 1) + V M 8 ( i � 1) + log a M j I j q x i j e I j ( x i ) > ! > V I < V I j ( i ) = log + max j ( i � 1) + log a I j I j ⇣ F I ⌘ ⇣ F D ⌘i a I j I j exp j ( i � 1) + a D j I j exp j ( i � 1) q x i V D > j ( i � 1) + log a D j I j > : F D h ⇣ F M ⌘ ⇣ F I ⌘ V M 8 j � 1 ( i ) + log a M j � 1 D j j ( i ) = log a M j � 1 D j exp j � 1 ( i ) + a I j � 1 D j exp j � 1 ( i ) > > V D < V I j � 1 ( i ) + log a I j � 1 D j j ( i ) = max ⇣ F D ⌘i + a D j � 1 D j exp j � 1 ( i ) V D > > j � 1 ( i ) + log a D j � 1 D j : • Similar to Chapter 2’s gapped alignment, but with position-specific • exp( · ) needed to use sums and logs (can still be fast; see p. 78) scoring scheme 21 22 Aligning a Sequence with a Model (Multiple Alignment) • Given a string x , use Viterbi to find most likely path ⇡ ⇤ and use the state sequence as the alignment Topic summary due in 1 week! • More detail in Durbin, Section 6.5 – Also discusses building an initial multiple alignment and HMM si- multaneously via Baum-Welch 23 24

I i M i M i 5 6 Handling non-Global Alignments Original profile - PDF document

Introduction CSCE 471/871 Lecture 4: Profile Hidden Markov Models Designed to model (profile) a multiple alignment of a protein family (e.g. p. 102) Gives a probabilistic model of the proteins in the family Stephen D. Scott Useful for

Pattern Discovery in Biosequences Pattern Discovery in Biosequences SDM 2005 tutorial (Appendix)

Physical Design Issues in Biofluidic Microchips Tamal Mukherjee MEMS Laboratory ECE Department

Fragfinder and Undertakernew-fold methods for protein structure prediction Kevin Karplus,

Numerical Solutions of Population-Balance Models in Particulate Systems Shamsul Qamar Gerald

CS481: Bioinformatics Algorithms Can Alkan EA224 calkan@cs.bilkent.edu.tr

Comparative Genomics Comparative Genomics Common Themes Gene and functional pathway

Metabolomic analysis revealed that green tea polyphenols decreased the formation of microbial

SciForum Computational study of aromatic compounds MOL2NET inhibiting Trypanosoma cruzi

1 3720 Web Resources 3720 Web Resources http://www.as.ysu.edu/~chemistry

!"#$%&'$("")$+#,&'"-,+.#

Bioprocess Control: S imulation, from Sensor Selection to Control, and Optimization of

Substitution Matrices Michael Schroeder Biotechnology Center TU Dresden Contents Why to

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

Incorporating Concept Hierarchies Into Usage Mining Based Recommendations Amit Bose - University

COMP364: PROSITE & Regexp Jrme Waldisphl, McGill University

Treating AML: Other Molecular Targets Richard A. Larson, MD The University of Chicago September

An alysis o f va riance (ANOVA) Lecture 4 Objectives By actively following the lecture and

Presenter : Peter Muhlberger on behalf of the SaTC Team: Nina Amla, Vijay Atluri, Jeremy Epstein,

NSF/Intel Partnership on Cyber-Physical Systems Security and

INTRODUCTION TO THE CCC AND THE CCC COUNCIL June 20, 2017 AN OVERVIEW OF THE COMPUTING

SMT-BASED ANALYSIS OF BIOLOGICAL SYSTEMS Nicola Paoletti CS department, Oxford University

SAT-based Verification Methods and Applications in Hardware Verification Aarti Gupta

Verification of Hybrid Controlled Processing Systems based on Decomposition and Deduction Goran

Machine Learning: Ensembles of Classifiers Madhavan Mukund Chennai Mathematical Institute

Sambuz

Useful Links

Newsletter

Mail Us

I i M i M i 5 6 Handling non-Global Alignments Original profile - PDF document

Introduction CSCE 471/871 Lecture 4: Profile Hidden Markov Models Designed to model (profile) a multiple alignment of a protein family (e.g. p. 102) Gives a probabilistic model of the proteins in the family Stephen D. Scott Useful for

Pattern Discovery in Biosequences Pattern Discovery in Biosequences SDM 2005 tutorial (Appendix)

Physical Design Issues in Biofluidic Microchips Tamal Mukherjee MEMS Laboratory ECE Department

Fragfinder and Undertakernew-fold methods for protein structure prediction Kevin Karplus,

Numerical Solutions of Population-Balance Models in Particulate Systems Shamsul Qamar Gerald

CS481: Bioinformatics Algorithms Can Alkan EA224 calkan@cs.bilkent.edu.tr

Comparative Genomics Comparative Genomics Common Themes Gene and functional pathway

Metabolomic analysis revealed that green tea polyphenols decreased the formation of microbial

SciForum Computational study of aromatic compounds MOL2NET inhibiting Trypanosoma cruzi

1 3720 Web Resources 3720 Web Resources http://www.as.ysu.edu/~chemistry

!&quot;#$%&amp;'$(&quot;&quot;)*$+#,&amp;'&quot;-,+.#*

Bioprocess Control: S imulation, from Sensor Selection to Control, and Optimization of

Substitution Matrices Michael Schroeder Biotechnology Center TU Dresden Contents Why to

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

Incorporating Concept Hierarchies Into Usage Mining Based Recommendations Amit Bose - University

COMP364: PROSITE &amp; Regexp Jrme Waldisphl, McGill University

Treating AML: Other Molecular Targets Richard A. Larson, MD The University of Chicago September

An alysis o f va riance (ANOVA) Lecture 4 Objectives By actively following the lecture and

Presenter : Peter Muhlberger on behalf of the SaTC Team: Nina Amla, Vijay Atluri, Jeremy Epstein,

NSF/Intel Partnership on Cyber-Physical Systems Security and

INTRODUCTION TO THE CCC AND THE CCC COUNCIL June 20, 2017 AN OVERVIEW OF THE COMPUTING

SMT-BASED ANALYSIS OF BIOLOGICAL SYSTEMS Nicola Paoletti CS department, Oxford University

SAT-based Verification Methods and Applications in Hardware Verification Aarti Gupta

Verification of Hybrid Controlled Processing Systems based on Decomposition and Deduction Goran

Machine Learning: Ensembles of Classifiers Madhavan Mukund Chennai Mathematical Institute

Sambuz

Useful Links

Newsletter

Mail Us

!"#$%&'$("")$+#,&'"-,+.#

COMP364: PROSITE & Regexp Jrme Waldisphl, McGill University