CSCE 471/871 Lecture 2: Alignments Pairwise Alignments Stephen - PowerPoint PPT Presentation

CSCE 471/871 Lecture 2: Pairwise CSCE 471/871 Lecture 2: Alignments Pairwise Alignments Stephen Scott Alignments Scoring Optimal Stephen Scott Algorithm Heuristic Algorithms Statistical Validation sscott@cse.unl.edu 1 / 55

Outline CSCE 471/871 Lecture 2: Pairwise Alignments Stephen Scott What is a sequence alignment? Alignments Why should we care? Scoring Optimal How do we do it? Algorithm Scoring matrices Heuristic Algorithms Algorithms for finding optimal alignments Statistical Statistically validating alignments Validation 2 / 55

What is a Sequence Alignment? CSCE 471/871 Lecture 2: Pairwise Alignments Stephen Scott Given two nucleotide or amino acid sequences, Alignments determine if they are related (descended from a What common ancestor) Why How Technically, we can align any two sequences, but not Scoring always in a meaningful way Optimal Algorithm In this lecture, we’ll focus on AA sequences, but same Heuristic Algorithms alignment principles hold for DNA sequences Statistical Validation 3 / 55

What is a Sequence Alignment? (cont’d) CSCE 471/871 Lecture 2: Pairwise HIGHLY RELATED: Alignments HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL Stephen Scott G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL HBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL Alignments What RELATED: Why HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL How ++ ++++H+ KV + +A ++ +L+ L+++H+ K Scoring LGB2_LUPLU NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG Optimal Algorithm SPURIOUS ALIGNMENT: Heuristic HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL Algorithms GS+ + G + +D L ++ H+ D+ A +AL D ++AH+ Statistical F11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE Validation How to filter out the last one & pick up the second? 4 / 55

Why Should We Care? CSCE 471/871 Lecture 2: Pairwise Alignments Fragment assembly in DNA sequencing Stephen Scott Experimental determination of nucleotide sequences is Alignments only reliable up to about 500-800 base pairs (bp) at a What Why time How But a genome can be millions of bp long! Scoring If fragments overlap, they can be assembled: Optimal Algorithm ...AAGTACAATCA Heuristic CAATTACTCGGA... Algorithms Need to align to detect overlap Statistical Validation 5 / 55

Why Should We Care? (cont’d) CSCE 471/871 Lecture 2: Pairwise Alignments Stephen Scott Alignments Finding homologous proteins and genes What Why I.e., evolutionarily related (common ancestor) How Structure and function are often similar, but this is Scoring reliable only if they are evolutionarily related Optimal Thus want to avoid the spurious alignment of Slide 4 Algorithm Heuristic Algorithms Statistical Validation 6 / 55

How do we do it? CSCE 471/871 Lecture 2: Pairwise Alignments Stephen Scott Alignments Choose a scoring scheme What Why Choose an algorithm to find optimal alignment wrt How Scoring scoring scheme Optimal Statistically validate alignment Algorithm Heuristic Algorithms Statistical Validation 7 / 55

Scoring Schemes CSCE 471/871 Lecture 2: Pairwise Since goal is to find related sequences, want Alignments evolution-based scoring scheme Stephen Scott Mutations occur often at the genomic level, but their Alignments rates of acceptance by natural selection vary depending Scoring on the mutation PAM BLOSUM E.g., changing an AA to one with similar properties is Gap Penalties more likely to be accepted Optimal Algorithm Assume that all changes occur independently of each Heuristic other and are Markovian Algorithms ⇒ Changes occuring now are independent of those in the Statistical Validation past ⇒ Makes working with probabilities easier 8 / 55

Scoring Schemes (cont’d) CSCE 471/871 Lecture 2: If AA a i is aligned with a j , then a j was substituted for a i Pairwise Alignments ...KALM... Stephen Scott ...KVLM... Alignments Was this due to an accepted mutation or simply by Scoring chance? PAM BLOSUM If A or V is likely in general, then there is less evidence Gap Penalties that this is a mutation Optimal Algorithm Want the score s ij to be higher if mutation more likely Heuristic Algorithms Take ratio of mutation prob. to prob. of AA appearing at Statistical random Validation Generally, if a j is similar to a i in property, then accepted mutation more likely and s ij higher 9 / 55

Scoring Schemes (cont’d) CSCE 471/871 Lecture 2: Pairwise Alignments Stephen Scott Alignments Only consider immediate mutations a i → a j , not Scoring a i → a k → a j PAM BLOSUM Mutations are undirected Gap Penalties Optimal ⇒ scoring matrix is symmetric Algorithm Heuristic Algorithms Statistical Validation 10 / 55

The PAM Transition Matrices CSCE 471/871 Lecture 2: Pairwise Alignments Dayhoff et al. started with several hundred manual Stephen Scott alignments between very closely related proteins Alignments ( ≥ 85 % similar in sequence), and manually-generated Scoring evolutionary trees PAM BLOSUM Computed the frequency with which each AA is Gap Penalties Optimal changed into each other AA over a short evolutionary Algorithm distance (short enough where only 1% AAs change) Heuristic Algorithms 1 PAM = 1% point accepted mutation Statistical Validation Becomes our measure of evolutionary “time” 11 / 55

The PAM Transition Matrices (cont’d) CSCE 471/871 Lecture 2: Pairwise Alignments Estimate p i with the frequency of AA a i over both Stephen Scott sequences, i.e., number of a i ’s/number of AAs Let f ij = f ji = number of a i ↔ a j changes in data set, Alignments f i = � Scoring j � = i f ij = number of changes involving a i , and PAM f = � i f i = number of changes BLOSUM Gap Penalties Define the scale to be the amount of evolution to Optimal Algorithm change 1 in 100 AAs (on average) [1 PAM dist] Heuristic Algorithms Relative mutability of a i is the ratio of number of Statistical mutations to total exposure to mutation: Validation m i = f i / ( 100 f p i ) 12 / 55

The PAM Transition Matrices (cont’d) CSCE 471/871 Lecture 2: Pairwise If m i is probability of a mutation for a i , then M ii = 1 − m i Alignments is prob. of no change Stephen Scott a i → a j if and only if a i changes and a i → a j given that a i Alignments changes, so Scoring PAM BLOSUM M ij = Pr ( a i → a j ) Gap Penalties Optimal = Pr ( a i → a j | a i changed ) Pr ( a i changed ) Algorithm Heuristic = ( f ij / f i ) m i = f ij / ( 100 f p i ) Algorithms Statistical Validation The 1 PAM transition matrix consists of the M ij and gives the probabilities of mutations from a i to a j 13 / 55

Properties of PAM Transition Matrices CSCE 471/871 Lecture 2: Pairwise Alignments � � = M ij + M ii M ij Stephen Scott j j � = i Alignments � = 1 / ( 100 f p i ) f ij + ( 1 − f i / ( 100 f p i )) Scoring PAM j � = i BLOSUM Gap Penalties = f i / ( 100 f p i ) + 1 − f i / ( 100 f p i ) = 1 Optimal Algorithm [sum of probabilities of changes to an AA + prob of no change = 1] Heuristic Algorithms Statistical � � � p i M ii = p i − f i / ( 100 f ) = 1 − f / ( 100 f ) = 0 . 99 Validation i i i [prob of no change to any AA is 99/100] 14 / 55

What About 2 PAM? CSCE 471/871 Lecture 2: Pairwise Alignments How about the probability that a i → a j in two Stephen Scott evolutionary steps? Alignments It’s the prob that a i → a k (for any k ) in step 1, and Scoring a k → a j in step 2. This is � k M ik M kj = M 2 ij PAM BLOSUM j Gap Penalties j Optimal Algorithm Heuristic Algorithms Statistical i i Validation 15 / 55

k PAM Transition Matrix CSCE 471/871 Lecture 2: Pairwise Alignments Stephen Scott In general, the probability that a i → a j in k evolutionary Alignments steps is M k Scoring ij PAM As k → ∞ , the rows of M k tend to be identical with the BLOSUM Gap Penalties i th entry of each row equal to p i Optimal Algorithm A result of our Markovian assumption of mutation Heuristic Algorithms Statistical Validation 16 / 55

Building a Scoring Matrix CSCE 471/871 Lecture 2: Pairwise Alignments When aligning different AAs in two sequences, want to Stephen Scott differentiate mutations and random events Alignments Thus, interested in ratio of transition probability to prob. Scoring of randomly seeing new AA PAM BLOSUM Gap Penalties M ij f ij = M ji Optimal = (symmetric) Algorithm p j 100 f p i p j p i Heuristic Algorithms Statistical Ratio > 1 if and only if mutation more likely than Validation random event 17 / 55

Building a Scoring Matrix (cont’d) CSCE 471/871 Lecture 2: Pairwise Alignments When aligning multiple AAs, ratio of probs for multiple Stephen Scott alignment = product of ratios: Alignments � M ij a i a k a n · · · Scoring � � � � � M k ℓ M nm − → · · · PAM · · · p j p ℓ p m a j a ℓ a m BLOSUM Gap Penalties Optimal Taking logs will let us use sums rather than products Algorithm Heuristic Algorithms ⇒ “Log odds” Statistical ⇒ Avoid underflow issues Validation 18 / 55

CSCE 471/871 Lecture 2: Alignments Pairwise Alignments Stephen - PowerPoint PPT Presentation

CSCE 471/871 Lecture 2: Pairwise CSCE 471/871 Lecture 2: Alignments Pairwise Alignments Stephen Scott Alignments Scoring Optimal Stephen Scott Algorithm Heuristic Algorithms Statistical Validation sscott@cse.unl.edu 1 / 55 Outline

Outline CSCE CSCE 471/871 471/871 Lecture 5: Lecture 5: Building Building CSCE 471/871

Introduction CSCE CSCE 471/871 471/871 Lecture 6: Lecture 6: Multiple Multiple CSCE

CSCE 471/871 Lecture 0: Stephen Scott Administrivia Welcome Introduction What is Bioin-

CSCE 471/871 Lecture 5: Phylogenetic Trees Building Phylogenetic Trees Stephen Scott

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

CSCE 625: Artificial Intelligence Dr. Dylan Shell 1 Shell CSCE 625 TAMU 2 Shell CSCE 625 TAMU

CSCE 471/871 Lecture 2: Pairwise Alignments Why should we care? How do we do it? Stephen

CSCE 471/871 Lecture 6: Multiple Sequence Alignments Residues occupy similar positions in 3D

CSCE 471/871 Lecture 5: Building Phylogenetic Trees Building trees from pairwise distances

Why Are We Here? CSCE CSCE 496/896 496/896 Lecture 10: Lecture 10: CSCE 496/896 Lecture 10:

CSCE 625: Artificial Intelligence Dr. Dylan Shell 1 Shell CSCE 625 TAMU CSCE 625: Artificial

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Introduction CSCE CSCE 496/896 496/896 Lecture 6: Lecture 6: Recurrent Recurrent CSCE

Class Overview 1 Shell CSCE 314 TAMU CSCE 314: Programming Languages Course Homepage:

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Introduction Out with the old ... CSCE 970 CSCE 970 Lecture 8: Lecture 8: Structured

15th Symposium on Computer Architecture and High Performance Computing 1/12 November 10 to 12 -

Towards Knowledge-guided Genetic Improvement [1] GI@ICSE 3. July 2020 Abstract -- Grammar-guided

The I ncompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University

Stability in the Homology of Torelli Groups Jenny Wilson (Michigan) joint with Jeremy Miller

The Least Spanning Area of a Knot and the Optimal Bounding Chain Problem Nathan M. Dunfield

Map the following onto this image. These are kind of imprecise arrows but I thought thinking

Multiple Alignments and Phylogenies Mark Voorhies 3/31/2011 Mark Voorhies Multiple Alignments

On the Cycle Structures of Hypergraphs Jianfang Wang Academy of Mathematics and System Science,