More complex scoring functions Until now: Bioinformatics Algorithms - PDF document

More complex scoring functions Until now: Bioinformatics Algorithms • match, mismatch, gap (linear gap functions) (Fundamental Algorithms, module 2) • match, mismatch, gap open, gap extend (a ffi ne gap functions) • i.e. f ( a , b ) depends only on a = b or a 6 = b But: Zsuzsanna Lipt´ ak • For protein sequences, better to di ff erentiate between di ff erent pairs Masters in Medical Bioinformatics of AAs a and b , i.e. depending on how close / how di ff erent they are. academic year 2018/19, II. semester • Reason: homologous proteins often have di ff erent AAs in same position. If only match/mismatch are evaluated, then many Scoring Matrices homologous proteins are not found. So now: • f ( a , b ) depends on a and b • necessarily: f ( a , b ) = f ( b , a ) (symmetry) 2 / 14 Scoring matrices Scoring matrices • Scoring matrix S of dimension 20 ⇥ 20 (for protein), • Scoring matrix S of dimension 20 ⇥ 20 (for protein), also possible: dim. 4 ⇥ 4 (for DNA) also possible: dim. 4 ⇥ 4 (for DNA) • S ab = f ( a , b ) gives the similarity of a and b • S ab = f ( a , b ) gives the similarity of a and b • Similarity could be defined by 1. similarity of codon (DNA-level), e.g. min { dist Hamming ( xyz , uvw ) : xyz codon for a and uvw codon for b } 2. physico-chemical properties (hydrophobicity, size, basic/acidic, . . . ) 3. based on empirical data: How frequently do we observe this change? 3 / 14 3 / 14 Scoring matrices Basic idea: • S ab > 0 : probability that b has mutated into a at this evolutionary distance is greater than chance • Scoring matrix S of dimension 20 ⇥ 20 (for protein), also possible: dim. 4 ⇥ 4 (for DNA) • S ab = 0 : the two probabilities are equal (we cannot say anything) • S ab = f ( a , b ) gives the similarity of a and b • S ab < 0 : probability that b has been aligned to a by chance is greater • Similarity could be defined by than the probability that this is a true mutation 1. similarity of codon (DNA-level), e.g. min { dist Hamming ( xyz , uvw ) : xyz codon for a and uvw codon for b } 2. physico-chemical properties (hydrophobicity, size, basic/acidic, . . . ) 3. based on empirical data: How frequently do we observe this change? • PAM matrices: Scoring matrices based on empirical data (Margret Dayho ff , 1978) • PAM = Point Accepted Mutation (or: Percent Accepted Mutation) 3 / 14 4 / 14

PAM scoring matrices Basic idea: • S ab > 0 : probability that b has mutated into a at this evolutionary distance is greater than chance • S ab = 0 : the two probabilities are equal (we cannot say anything) • family of matrices: PAM k (for any k � 1), common are PAM40, PAM120, PAM250 • S ab < 0 : probability that b has been aligned to a by chance is greater • PAM k : k is the evolutionary distance between the sequences to be than the probability that this is a true mutation scored; needs to be guessed before scoring Meaning of ”by chance”: • higher k : applied to more distant / less closely related sequences / • We are comparing two probabilities species • prob1: that a and b are aligned together because there has been a • the scoring matrix PAM k is not a probability matrix series of mutations changing b into a • it is based on a probability matrix • prob2: that a and b have been aligned together by chance (e.g. if in the database all sequences consist only of a ’s, then the probability that a is there in a random alignment is 1) 4 / 14 5 / 14 Mutation probability at higher distances: M k Mutation probability matrix • How about the probability that b changes into a in 2 steps? • possibilities are: • Dayho ff et al. generated mutation probability matrix M (PAM1 mutation matrix) based on empirical data: a large set of aligned time step 1 time step 2 sequences which are known to be homologous (”trusted alignments”) a unchanged b ! a • M ab = probability that AA b will change into AA a in one time step b unchanged b ! a c 6 = a , b : b ! c c ! a • this probability is only estimated, based on observed data • one time step = 1 PAM unit evolutionary distance = 1 mutation every 100 AAs on average a 2 Σ M ab = 1 (sum over each column equals 1) 1 • P 1 in some areas of maths prob. matrices are defined di ff erently: P a , b = prob. that a turns into b , i.e. the transpose of M ; then the sum over the rows is 1 6 / 14 7 / 14 Mutation probability at higher distances: M k Computation of the scoring matrices • How about the probability that b changes into a in 2 steps? • the PAM scoring matrices are ”log-odds” matrices • possibilities are: • odds: compare two probabilities • log: take the logarithm (product ! sum) time step 1 time step 2 • PAM k scoring matrix: b ! a a unchanged b unchanged b ! a • take M k • M k c 6 = a , b : b ! c ab = Prob( b changed into a in k steps) c ! a • compare to: Prob( a is there by chance) = p a • Prob( b changes into a in 2 steps) p a = relative frequency of a , c 2 Σ M ac M cb = M 2 = M ab · M aa + M bb · M ab + P c 6 = a , b M cb M ac = P ab e.g. if the DB is: { aabc , abca } , then p a = 1 / 2 , p b , p c = 1 / 4 • M 2 ab is just the entry a , b , i.e. row a and column b , of the product • take log (base 10), multiply by 10 (for nicer numbers), round to matrix M 2 = M · M (matrix multiplication)—and not the real number nearest integer: M ab squared! • in general: M k contains the probabilities for k steps, i.e. M k S ab = 10 · log 10 ( M k ab = prob. ab ) rounded to nearest int. that b has mutated into a after k steps p a 7 / 14 8 / 14

Computation of the scoring matrices Computation of the scoring matrices S ab = 10 · log 10 ( M k S ab = 10 · log 10 ( M k p a ) ab p a ) ab 8 8 > 1 if M k ab > p a > 1 if M k ab > p a M k M k > > < < ab ab if M k if M k = 1 ab = p a = 1 ab = p a p a p a > if M k > if M k < 1 ab < p a < 1 ab < p a : : Therefore 8 if M k > 0 ab > p a i.e. if prob1 is greater than prob2 > < if M k S ab = 0 ab = p a i.e. if they are equal > if M k < 0 ab < p a i.e. if prob2 is greater than prob1 : Note that scoring matrices are symmetrical (but not the prob. matrices). 9 / 14 9 / 14 Why use logarithm? We use logarithms for computational reasons: • since log is strictly monotonically increasing, one can replace all x with log x : We have x < y if and only if log x < log y . • products of probs ! sums of log-of-probs • easier to compute sums than products of very small numbers (note that all probabilities are between 0 and 1): reduce rounding errors 10 / 14 11 / 14 Two caveats BLOSUM matrices BLOSUM scoring matrices (Heniko ff and Heniko ff , 1992) • other family of commonly used scoring matrices PAM matrices use two silent assumptions: • remedies second issue: uses no underlying evolutionary model 1. mutations (changes) of AAs happen independently (i.e. independent • same principle as PAM matrices, but: of context): scoring by individual columns • used di ff erent sets of aligned sequences for di ff erent distances 2. uses an evolutionary model: k distance = k identical steps (i.e. with • BLOSUM m : only used sequences that had m % identity or less same probabilites) • higher number ˆ = closer related • common: BLOSUM 45, 62, 80; BLOSUM62 ⇠ PAM120 12 / 14 13 / 14

Summary PAM matrices • allow scoring di ff erent AA pairs according to evolutionary relatedness • di ff erent PAM k acc. to evolutionary distance • all modern AA scoring matrices are based on empirical data: observed frequencies in trusted alignment data • the probabilities are estimated probabilites of AAs (from the data) • mutation probability matrix M (1 step = 1 PAM unit) M k mutation probability matrix for k steps ( k PAM units) PAM k scoring matrix S (log-odds matrix) • higher number ˆ = less related ˆ = more distant • commonly used: PAM40, PAM120, PAM160, PAM250 • k in PAM k needs to be decided before scoring • BLOSUM: similar to PAM but higher number ˆ = more related 14 / 14

More complex scoring functions Until now: Bioinformatics Algorithms - PDF document

More complex scoring functions Until now: Bioinformatics Algorithms match, mismatch, gap (linear gap functions) (Fundamental Algorithms, module 2) match, mismatch, gap open, gap extend (a ffi ne gap functions) i.e. f ( a , b ) depends

Exercise 8: Scoring Exercise 8: Scoring FLUKA Beginners Course Exercise 8: Scoring Aim of the

Mountain High Swim League Scoring Presentation 2018 Scoring Committee 1 MHSL Scoring Training

Exercise 8: Scoring FLUKA Beginners Course Exercise 8: Scoring Aim of the exercise: 1- Add

Complex Numbers Complex Numbers 1 / 19 Complex Numbers Complex numbers ( C ) are an extension of

Welcome to Scoring the ACIRI a Job Aid. 1 This job aid provides a brief review of the scoring

Investment Board April 21, 2014 Agenda UW-IT Portfolio Scoring Process Scoring Results

Mobile Credit Scoring: Powering Consumer Finance in Emerging Markets SUMMARY Credit Scoring

SI Scoring Guide SUBORDINATION INDEX USING SALT Discuss the scoring rules SALT SOFTWARE, LLC

Intermembrane Space H + H + Cyt c Co Q Complex Complex III IV H + ATPase H + Complex

More on Functions Thomas Schwarz, SJ Marquette University Functions of Functions Functions

Learn more Do more Be more Learn more Do more Be more UNITY Learn more Do

Higher order functions Functions that take other functions as parameters Easily supported

An introduction to complex numbers The complex numbers Are the real numbers not sufficient? A

Defect Detection Thomas Zimmermann The First Bug September 9, 1947 More Bugs More Bugs More

Why Transformers Work. More info blablabla More info blablabla More info blablabla More

Using Complex Formulas, Functions, and Tables Objectives Create complex formulas Use

CSCE 471/871 Lecture 2: Pairwise Alignments Why should we care? How do we do it? Stephen

The absolutely neutralizing coalescence theory of mutation Paul de Lacy Rutgers University

Longest Cycle Crossover for Solving the Capacitated Vehicle Routing Problem Depar artment ment

Purity Dependant Markov Models for Microsatellite Mutation Tristan L. Stark University of

Detecting Self-Mutating Malware Using Control-Flow Graph Matching Danilo Bruschi Lorenzo

CSE 331 Mutation and immutability slides created by Marty Stepp based on materials by M. Ernst,

CS 251 Fall 2019 CS 251 Fall 2019 Topics Principles of Programming Languages Principles

In a world where bindings and values Immutability: obstacle or tool? are

More complex scoring functions Until now: Bioinformatics Algorithms - PDF document

More complex scoring functions Until now: Bioinformatics Algorithms match, mismatch, gap (linear gap functions) (Fundamental Algorithms, module 2) match, mismatch, gap open, gap extend (a ffi ne gap functions) i.e. f ( a , b ) depends

Exercise 8: Scoring Exercise 8: Scoring FLUKA Beginners Course Exercise 8: Scoring Aim of the

Mountain High Swim League Scoring Presentation 2018 Scoring Committee 1 MHSL Scoring Training

Exercise 8: Scoring FLUKA Beginners Course Exercise 8: Scoring Aim of the exercise: 1- Add

Complex Numbers Complex Numbers 1 / 19 Complex Numbers Complex numbers ( C ) are an extension of

Welcome to Scoring the ACIRI a Job Aid. 1 This job aid provides a brief review of the scoring

Investment Board April 21, 2014 Agenda UW-IT Portfolio Scoring Process Scoring Results

Mobile Credit Scoring: Powering Consumer Finance in Emerging Markets SUMMARY Credit Scoring

SI Scoring Guide SUBORDINATION INDEX USING SALT Discuss the scoring rules SALT SOFTWARE, LLC

Intermembrane Space H + H + Cyt c Co Q Complex Complex III IV H + ATPase H + Complex

More on Functions Thomas Schwarz, SJ Marquette University Functions of Functions Functions

Learn more Do more Be more Learn more Do more Be more UNITY Learn more Do

Higher order functions Functions that take other functions as parameters Easily supported

An introduction to complex numbers The complex numbers Are the real numbers not sufficient? A

Defect Detection Thomas Zimmermann The First Bug September 9, 1947 More Bugs More Bugs More

Why Transformers Work. *More info blablabla *More info blablabla *More info blablabla *More

Using Complex Formulas, Functions, and Tables Objectives Create complex formulas Use

CSCE 471/871 Lecture 2: Pairwise Alignments Why should we care? How do we do it? Stephen

The absolutely neutralizing coalescence theory of mutation Paul de Lacy Rutgers University

Longest Cycle Crossover for Solving the Capacitated Vehicle Routing Problem Depar artment ment

Purity Dependant Markov Models for Microsatellite Mutation Tristan L. Stark University of

Detecting Self-Mutating Malware Using Control-Flow Graph Matching Danilo Bruschi Lorenzo

CSE 331 Mutation and immutability slides created by Marty Stepp based on materials by M. Ernst,

CS 251 Fall 2019 CS 251 Fall 2019 Topics Principles of Programming Languages Principles

In a world where bindings and values Immutability: obstacle or tool? are

Why Transformers Work. More info blablabla More info blablabla More info blablabla More