CSCE 471/871 Lecture 2: Pairwise Alignments Why should we care? How - PDF document

Outline • What is a sequence alignment? CSCE 471/871 Lecture 2: Pairwise Alignments • Why should we care? • How do we do it? Stephen D. Scott – Scoring matrices – Algorithms for finding optimal alignments – Statistically validating alignments 1 2 What is a Sequence Alignment? (cont’d) HIGHLY RELATED: What is a Sequence Alignment? HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL • Given two nucleotide or amino acid sequences, determine if they are HBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL related (descended from a common ancestor) RELATED: HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL • Technically, we can align any two sequences, but not always in a ++ ++++H+ KV + +A ++ +L+ L+++H+ K meaningful way LGB2_LUPLU NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG SPURIOUS ALIGNMENT: • In this lecture, we’ll focus on AA sequences (more reliable in modeling HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL evolution), but same alignment principles hold for DNA sequences GS+ + G + +D L ++ H+ D+ A +AL D ++AH+ F11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE How to filter out the last one & pick up the second? 3 4 Why Should We Care? Why Should We Care? (cont’d) • Fragment assembly in DNA sequencing – Experimental determination of nucleotide sequences is only reli- • Finding homologous proteins and genes able up to about 500-800 base pairs (bp) at a time – I.e. evolutionarily related (common ancestor) – But a genome can be millions of bp long! – Structure and function are often similar, but this is reliable only if – If fragments overlap, they can be assembled: they are evolutionarily related ...AAGTACAATCA – Thus want to avoid the spurious alignment of slide 4 CAATTACTCGGA... – Need to align to detect overlap 5 6

Scoring Schemes • Since goal is to find related sequences, want evolution-based scoring How do we do it? scheme – Mutations occur often at the genomic level, but their rates of acceptance • Choose a scoring scheme by natural selection vary depending on the mutation • Choose an algorithm to find optimal alignment wrt scoring scheme – I.e. changing an AA to one with similar properties is more likely to be accepted • Statistically validate alignment • Assume that all changes occur independently of each other and are Markovian (makes working with probabilities easier): changes occur- ing now are independent of those in the past 7 8 Scoring Schemes (cont’d) • If AA a i is aligned with a j , then a j was substituted for a i ...KALM... ...KVLM... Scoring Schemes (cont’d) • Was this due to an accepted mutation or simply by chance? • Only consider immediate mutations a i ! a j , not a i ! a k ! a j – If A or V is likely in general, then there is less evidence that this is a mutation • Mutations are undirected • Want the score s ij to be higher if mutations more likely ) scoring matrix is symmetric – Take ratio of mutation prob. to prob. of AA appearing at random • Generally, if a j is similar to a i in property, then accepted mutation more likely and s ij higher 9 10 The PAM Transition Matrices (cont’d) The PAM Transition Matrices • Estimate p i with the frequency of AA a i over both sequences, i.e. # of a i ’s/number of AAs • Dayhoff et al. started with several hundred manual alignments between very closely related proteins ( � 85% similar in sequence), and manually-generated evolutionary trees • Let f ij = f ji = # of a i $ a j changes in data set, f i = P j 6 = i f ij and f = P i f i • Computed the frequency with which each AA is changed into each other AA over a short evolutionary distance (short enough where only • Define the scale to be the amount of evolution to change 1 in 100 AAs 1% AAs change) (on average) [1 PAM dist] • 1 PAM = 1% point accepted mutation • Relative mutability of a i is the ratio of number of mutations to total exposure to mutation: m i = f i / (100 fp i ) 11 12

The PAM Transition Matrices (cont’d) Properties of PAM Transition Matrices • If m i is probability of a mutation for a i , then M ii = 1 � m i is prob. of X X M ij = M ij + M ii no change j j 6 = i X = 1 / (100 fp i ) f ij + (1 � f i / (100 fp i )) j 6 = i • a i ! a j if and only if a i changes and a i ! a j given that a i changes, = f i / (100 fp i ) + 1 � f i / (100 fp i ) = 1 so [sum of probabilities of changes to an AA + prob of no change = 1] M ij = Pr ( a i ! a j ) = Pr ( a i ! a j | a i changed ) Pr ( a i changed ) = ( f ij /f i ) m i = f ij / (100 fp i ) X X X p i M ii = p i � f i / (100 f ) = 1 � f/ (100 f ) = 0 . 99 i i i • The 1 PAM transition matrix consists of the M ij and gives the proba- [prob of no change to any AA is 99/100] bilities of mutations from a i to a j 13 14 What About 2 PAM? k PAM Transition Matrix • How about the probability that a i ! a j in two evolutionary steps? • In general, the probability that a i ! a j in k evolutionary steps is M k • It’s the prob that a i ! a k (for any k ) in step 1, and a k ! a j in step 2. ij k M ik M kj = M 2 This is P ij • As k ! 1 , the rows of M k tend to be identical with the i th entry of j j each row equal to p i – A result of our Markovian assumption of mutation i i 15 16 Building a Scoring Matrix • When aligning different AAs in two sequences, want to differentiate Building a Scoring Matrix (cont’d) mutations and random events • When aligning multiple AAs, ratio of probs for multiple alignment = product of ratios: • Thus, interested in ratio of transition probability to prob. of randomly seeing new AA a i a k a n · · · ✓ M ij ◆ ⇣ M k ` ⌘ ⇣ M nm ⌘ � ! · · · a j a ` a m · · · p j p ` p m M ij f ij = M ji = (symmetric) • Taking logs will let us use sums rather than products p j 100 fp i p j p i • Ratio > 1 if and only if mutation more likely than random event 17 18

Building a Scoring Matrix (cont’d) • Final step: computation faster with integers than with reals, so scale up (to increase precision) and round: s ij = C log 2 ( M ij /p j ) • C is a scaling constant • For k PAM, use M k ij 19 20 PAM Scoring Matrix Miscellany • Pairs of AAs with similar properties (e.g. hydrophobicity) have high BLOSUM Scoring Matrices pairwise scores, since similar AAs are more likely to be accepted mutations • Based on multiple alignments, not pairwise • In general, low PAM numbers find short, strong local similarities and • Direct derivation of scores for more distantly related proteins high PAM numbers find long, weak ones • Only possible because of new data: multiple alignments of known re- • Often multiple searches will be run, using e.g. 40 PAM, 120 PAM, 250 lated proteins PAM • Altschul ( JMB , 219:555–565, 1991) gives discussion of PAM choice 21 22 BLOSUM Scoring Matrices (cont’d) BLOSUM 50 Scoring Matrix • Started with ungapped alignments from BLOCKS database A R N D C Q E G H I L K M F P S T W Y V A 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -2 -1 -1 -3 -1 1 0 -3 -2 0 R -2 7 -1 -2 -4 1 0 -3 0 -4 -3 3 -2 -3 -3 -1 -1 -3 -1 -3 N -1 -1 7 2 -2 0 0 0 1 -3 -4 0 -2 -4 -2 1 0 -4 -2 -3 • Sequences clustered at L % sequence identity D -2 -2 2 8 -4 0 2 -1 -1 -4 -4 -1 -4 -5 -1 0 -1 -5 -3 -4 C -1 -4 -2 -4 13 -3 -3 -3 -3 -2 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1 Q -1 1 0 0 -3 7 2 -2 1 -3 -2 2 0 -4 -1 0 -1 -1 -1 -3 E -1 0 0 2 -3 2 6 -3 0 -4 -3 1 -2 -3 -1 -1 -1 -3 -2 -3 • This time, f ij = # of a i $ a j changes between pairs of sequences G 0 -3 0 -1 -3 -2 -3 8 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 H -2 0 1 -1 -3 1 0 -2 10 -4 -3 0 -1 -1 -2 -1 -2 -3 2 -4 from different clusters, normalizing by dividing by ( n 1 n 2 ) = product I -1 -4 -3 -4 -2 -3 -4 -4 -4 5 2 -3 2 0 -3 -3 -1 -3 -1 4 L -2 -3 -4 -4 -2 -2 -3 -4 -3 2 5 -3 3 1 -4 -3 -1 -2 -1 1 of sizes of clusters 1 and 2 K -1 3 0 -1 -3 2 1 -2 0 -3 -3 6 -2 -4 -1 0 -1 -3 -2 -3 M -1 -2 -2 -4 -2 0 -2 -3 -1 2 3 -2 7 0 -3 -2 -1 -1 0 1 F -3 -3 -4 -5 -2 -4 -3 -4 -1 0 1 -4 0 8 -4 -3 -2 1 4 -1 P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10 -1 -1 -4 -3 -3 • f i = P j f ij , f = P i f i (different from PAM) S 1 -1 1 0 -1 0 -1 0 -1 -3 -3 0 -2 -3 -1 5 2 -4 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0 W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1 1 -4 -4 -3 15 2 -3 Y -2 -1 -2 -3 -3 -1 -2 -3 2 -1 -1 -2 0 4 -3 -2 -2 2 8 -1 V 0 -3 -3 -4 -1 -3 -3 -4 -4 4 1 -3 1 -1 -3 -2 0 -3 -1 5 • Then the scoring matrix entry is f ij /f ! s ij = C log 2 p i p j 23 24

CSCE 471/871 Lecture 2: Pairwise Alignments Why should we care? How - PDF document

Outline What is a sequence alignment? CSCE 471/871 Lecture 2: Pairwise Alignments Why should we care? How do we do it? Stephen D. Scott Scoring matrices Algorithms for finding optimal alignments Statistically validating

Introduction CSCE CSCE 471/871 471/871 Lecture 6: Lecture 6: Multiple Multiple CSCE

Outline CSCE CSCE 471/871 471/871 Lecture 5: Lecture 5: Building Building CSCE 471/871

CSCE 471/871 Lecture 2: Alignments Pairwise Alignments Stephen Scott Alignments Scoring

CSCE 471/871 Lecture 0: Stephen Scott Administrivia Welcome Introduction What is Bioin-

CSCE 471/871 Lecture 5: Phylogenetic Trees Building Phylogenetic Trees Stephen Scott

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

CSCE 471/871 Lecture 6: Multiple Sequence Alignments Residues occupy similar positions in 3D

CSCE 471/871 Lecture 5: Building Phylogenetic Trees Building trees from pairwise distances

CSCE 625: Artificial Intelligence Dr. Dylan Shell 1 Shell CSCE 625 TAMU 2 Shell CSCE 625 TAMU

Pairwise sequence alignments Volker Flegel Vassilios Ioannidis VI - 2004 Page 1 Outline

Why Are We Here? CSCE CSCE 496/896 496/896 Lecture 10: Lecture 10: CSCE 496/896 Lecture 10:

CSCE 625: Artificial Intelligence Dr. Dylan Shell 1 Shell CSCE 625 TAMU CSCE 625: Artificial

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Introduction CSCE CSCE 496/896 496/896 Lecture 6: Lecture 6: Recurrent Recurrent CSCE

Multiple Sequence Multiple Sequence Alignments Alignments Multiple alignment Pairwise

Database searching Using pairwise alignments to search databases for similar sequences Query

The absolutely neutralizing coalescence theory of mutation Paul de Lacy Rutgers University

Longest Cycle Crossover for Solving the Capacitated Vehicle Routing Problem Depar artment ment

Purity Dependant Markov Models for Microsatellite Mutation Tristan L. Stark University of

rt t t

More complex scoring functions Until now: Bioinformatics Algorithms match, mismatch, gap

Detecting Self-Mutating Malware Using Control-Flow Graph Matching Danilo Bruschi Lorenzo

CSE 331 Mutation and immutability slides created by Marty Stepp based on materials by M. Ernst,

CS 251 Fall 2019 CS 251 Fall 2019 Topics Principles of Programming Languages Principles