CSE182-L6 P-value and E-value Dicitionary matching Pattern - PowerPoint PPT Presentation

CSE182-L6 P-value and E-value Dicitionary matching Pattern matching October 09 CSE182

Why is BLAST fast? • Assume that keyword searching does not consume any time and that alignment computation the expensive step. • Query m=1000, random Db n=10 7 , no TP • SW = O(nm) = 1000*10 7 = 10 10 computations • BLAST, W=11 • E(#11-mer hits)= 1000* (1/4) 11 * 10 7 =2384 • Number of computations = 2384*100*100=2.384*10 7 • Ratio=10 10 /(2.384*10 7 )=420 • Further speed improvements are possible October 09 CSE 182

Keyword Matching • How fast can we match keywords? • Hash table/Db index? What is the size of the AATCA 567 hash table, for m=11 • Suffix trees? What is the size of the suffix trees? • Trie based search. We will do this in class. October 09 CSE 182

Silly Quiz Skin patterns Facial Features October 09 CSE182

Expectation? • Some quantities can be reasonably guessed by taking a statistical sample, others not – Average weight of a group of 100 people – Average height of a group of 100 people – Average grade on a test • Give an example of a quantity that cannot. • When the distribution, and the expectation is known, it is easy to see when you see something significant. • If the distribution is not well understood, or the wrong distribution is chosen, a wrong conclusion can be drawn October 09 CSE 182

P-value computation • BLAST: The matching regions are expanded into alignments, which are scored using SW, and an appropriate scoring matrix. • The results are presented in order of decreasing scores • The score is just a number. • How significant is the top scoring hits if it has a score S? • Expect/E-value (score S)= Number of times we would expect to see a random query generate a score S, or better • How can we compute E-value? October 09 CSE 182

What is a distribution function • Given a collection of numbers (scores) – 1, 2, 8, 3, 5, 3,6, 4, 4,1,5,3,6,7,…. • Plot its distribution as follows: – X-axis =each number – Y-axis (count/frequency/probability) of seeing that number – More generally, the x-axis can be a range to accommodate real numbers October 09 CSE 182

P-value • P-value: probability that a specific value (11) is achieved by chance. • Compute an scores obtained by chance – 1, 2, 8, 3, 5, 3,6,12,4, 4,1,5,3,6,7 • Compute a Distribution • 1-2 XXX • 3-4 XXXX • 5-6 XXXX • 7-8 XX • 9-10 • 11-13 X • 15-17 • Are October 09 CSE 182

P-value computation • A simple empirical method: • Compute a distribution of scores against a random database. • Use an estimate of the area under the curve to get the probability. • OR, fit the distribution to one of the standard distributions. October 09 CSE 182

Z-scores for alignment • Initial assumption was that the scores followed a normal distribution. • Z-score computation: – For any alignment, score S, shuffle one of the sequences many times, and recompute alignment. Get mean and standard deviation Z S = S − µ σ – Look up a table to get a P-value October 09 CSE 182

Blast E-value • Initial (and natural) assumption was that scores followed a Normal distribution • 1990, Karlin and Altschul showed that ungapped local alignment scores follow an exponential distribution • Practical consequence: – Longer tail. – Previously significant hits now not so significant October 09 CSE 182

Altschul Karlin statistics • For simplicity, assume that the database is a binary string, and so is the query. – Let match-score=1, – mismatch score=- ∞ , – indel=- ∞ (No gaps allowed) • What does it mean to get a score k? October 09 CSE 182

Exponential distribution • Random Database, Pr(1) = p • What is the expected number of hits to a sequence of k 1’s   − k ln 1   ( n − k ) p k ≅ ne k ln p = ne p   • Instead, consider a random binary Matrix. Expected # of diagonals of k 1s   − k ln 1   Λ = ( n − k )( m − k ) p k ≅ nme k ln p = nme p   October 09 CSE 182

• As you increase k, the number decreases exponentially. • The number of diagonals of k runs can be approximated by a Poisson process Pr[ u ] = Λ u e −Λ u ! Pr[ u > 0] = 1 − e −Λ • In ungapped alignments, we replace the coin tosses by column scores, but the behaviour does not change (Karlin & Altschul). • As the score increases, the number of alignments that achieve the score decreases exponentially October 09 CSE 182

Blast E-value • Choose a score such that the expected score between a pair of residues < 0 • Expected number of alignments with a score S   − λ S − ln K   E = Kmne − λ S = mn 2  ln 2  Pr( S ≥ x ) = 1 − e − Kmne − λ x For small values, E-value and P-value are the same • October 09 CSE 182

The last step in Blast • We have discussed – Alignments – Db filtering using keywords – Scoring matrices – E-values and P-values • The last step: Database filtering requires us to scan a large sequence fast for matching keywords October 09 CSE 182

Keyword search • Recall: In BLAST, we get a collection of keywords from the query sequence, and identify all db locations with an exact match to the keyword. • Question: Given a collection of strings (keywords), find all occurrences in a database string where they keyword might match. Fa05 CSE 182

• End of lecture 6 October 09 CSE182

CSE182-L6 P-value and E-value Dicitionary matching Pattern - PowerPoint PPT Presentation

CSE182-L6 P-value and E-value Dicitionary matching Pattern matching October 09 CSE182 Why is BLAST fast? Assume that keyword searching does not consume any time and that alignment computation the expensive step. Query m=1000, random

CSE182-L11 Protein sequencing and Mass Spectrometry CSE182 Course Summary Gene finding

CSE182-L7 CSE182-L7 Protein structure Basics Protein structure Basics Protein sequencing via MS

CSE182-L13 Mass Spectrometry Quantitation and other applications CSE182 The forbidden pairs

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 Dictionary Matching

CSE182-L12 Mass Spectrometry Peptide identification CSE182 General isotope computation

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

L14 Mass Spec Quantitation MS applications Microarray analysis CSE182 LC-MS Maps Peptide 2 I

CSE182-L10 Gene Finding November 09 HMM fair-coin example 0.6 0.6 1 0.4 0.4 E F (H)=0.5 E L

CSE182-L9 Protein domain analysis via HMMs Gene finding November 09 QUIZ! Question: Your

CSE182-L8 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding

CSE 182: Biological Data Analysis Instructor: Vineet Bafna TA: Ryan Kelley www. www.cse cse.

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence

CSE182-L12 LW statistics/Assembly Quiz Who are these people, and what is the occasion?

CSE 182-L2:Blast & variants I Dynamic Programming FA08 CSE182 Notes

CSE182-L9 Modeling Protein domains using HMMs Profiles Revisited Note that profiles are a

CSE182-L5: Scoring matrices Dictionary Matching October 09 CSE 182 Expectation? Some

Solving The Words Search Problem Ivan Kazmenko St. Petersburg State University Tuesday, July 5,

Indexing and Searching Indexing and Searching Berlin Chen 2005 References: 1. Modern

Sorting and Searching by Distribution: From Generic Discrimination to Generic Tries Fritz Henglein

Compressed Indexes for Fast Search of Semantic Data Ra ff aele Perego Giulio Ermanno Pibiri

Chapter 9: Text Processing 10/16/2015 3:40 PM Text Processing 1 Outline and Reading Strings

File indexing and searching

Recap CS 525: Advanced Database We have discussed Organization

A Study of Erlang ETS Table Implementations and Performance Or: Judy Arrays Are Amazing Data

CSE182-L6 P-value and E-value Dicitionary matching Pattern - PowerPoint PPT Presentation

CSE182-L6 P-value and E-value Dicitionary matching Pattern matching October 09 CSE182 Why is BLAST fast? Assume that keyword searching does not consume any time and that alignment computation the expensive step. Query m=1000, random

CSE182-L11 Protein sequencing and Mass Spectrometry CSE182 Course Summary Gene finding

CSE182-L7 CSE182-L7 Protein structure Basics Protein structure Basics Protein sequencing via MS

CSE182-L13 Mass Spectrometry Quantitation and other applications CSE182 The forbidden pairs

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 Dictionary Matching

CSE182-L12 Mass Spectrometry Peptide identification CSE182 General isotope computation

CSE 182-L2:Blast &amp; variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

L14 Mass Spec Quantitation MS applications Microarray analysis CSE182 LC-MS Maps Peptide 2 I

CSE182-L10 Gene Finding November 09 HMM fair-coin example 0.6 0.6 1 0.4 0.4 E F (H)=0.5 E L

CSE182-L9 Protein domain analysis via HMMs Gene finding November 09 QUIZ! Question: Your

CSE182-L8 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding

CSE 182: Biological Data Analysis Instructor: Vineet Bafna TA: Ryan Kelley www. www.cse cse.

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence

CSE182-L12 LW statistics/Assembly Quiz Who are these people, and what is the occasion?

CSE 182-L2:Blast &amp; variants I Dynamic Programming FA08 CSE182 Notes

CSE182-L9 Modeling Protein domains using HMMs Profiles Revisited Note that profiles are a

CSE182-L5: Scoring matrices Dictionary Matching October 09 CSE 182 Expectation? Some

Solving The Words Search Problem Ivan Kazmenko St. Petersburg State University Tuesday, July 5,

Indexing and Searching Indexing and Searching Berlin Chen 2005 References: 1. Modern

Sorting and Searching by Distribution: From Generic Discrimination to Generic Tries Fritz Henglein

Compressed Indexes for Fast Search of Semantic Data Ra ff aele Perego Giulio Ermanno Pibiri

Chapter 9: Text Processing 10/16/2015 3:40 PM Text Processing 1 Outline and Reading Strings

File indexing and searching

Recap CS 525: Advanced Database We have discussed Organization

A Study of Erlang ETS Table Implementations and Performance Or: Judy Arrays Are Amazing Data

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

CSE 182-L2:Blast & variants I Dynamic Programming FA08 CSE182 Notes