CSE 421 Midterm Scores Mean 83 Sigma 11 1 CSE 421 Algorithms - PowerPoint PPT Presentation

CSE 421 Midterm Scores Mean 83 Sigma 11 1

CSE 421 Algorithms Sequence Alignment 1

Sequence Alignment Goal: position characters in strings so they “best” line up with one another We can do this via Dynamic Programming 2

What is an alignment? Compare two strings and see how similar they are Maximize the # of chars in a string that line up ATGTTAT vs ATCGTAC A T - G T T A T - A T C G T - A - C 3

What is an alignment? Compare two strings and see how similar they are Maximize the # of chars in a string that line up ATGTTAT vs ATCGTAC A T - G T T A T - A T C G T - A - C matches mismatches 4

Why do we align? Biology Most widely used comp. tools in biology New sequences always compared to databases Similar sequences often have similar origin and/or function Other spell check, diff, svn/git/ … , plagiarism, … 5

Terminology string suffix ordered list of consecutive letters letters from T A T A A G back prefix substring consecutive consecutive subsequence letters from letters from any ordered, front anywhere nonconsecutive letters, i.e. AAA , TAG 6

Formal definition of an alignment a c g c t g a c – – g c t g c a t g t – c a t g – t – An alignment of strings S, T is represented as a pair of strings S’, T’ with gaps “-” s.t. |S’| = |T’|, and (|S| = “length of S”) 1. Removing gaps leaves S, T 2. (Note that this is a definition for a general alignment, not optimal.) 7

Scoring an arbitrary alignment Want to determine whether an alignment is “good” or “bad” so we define a cost function score of match 2 (mis)aligning = σ (x, y) = mismatch -1 chars x & y Total value/score of an alignment Σ σ (S’[i], T’[i]) Optimal alignment Max alignment score of all poss. alignments 8

Scoring an arbitrary alignment a c – – g c t g – c a t g – t – -1 +2 -1 -1 +2 -1 +2 -1 Score = +1 σ (x, y) = match 2 mismatch -1 9

Can we use Dynamic Programming? 1. Identify subproblems We can reuse the solution to smaller substrings (prefixes in this case) 2. Argue that we have optimal substructure Appending two optimal alignments should also be optimally aligned (some may change at the interface) 10

Arguing for Optimal Substructure Assume strings S & T are optimally aligned except for the last character 3 options for the last character: 1. match -- S[i] & T[j] aligned 2. mismatch -- S[i] & ”-” aligned 3. mismatch -- T[j] & ”-” aligned * Never align ”-” & ”-” ; i.e. σ ( ”-” , ”-” ) << 0 11

“Recipe” for using DP for problems like this 1. Argue for optimal substructure ( þ ) 2. Find a recursive relation for subproblem costs Use (1), find all subproblems that might contribute to an optimal cost 3. Implement a bottom-up use of (2) to fill in a table of subproblem costs 4. Write a recursive algorithm using the table from (3) to construct actual solutions to subproblems (“traceback”) 12

Setting up Optimal Alignment in O(n 2 ) via DP Input: strings S, T |S| = n, |T| = m Output: optimal alignment score à Generate the score first and then trace backwards to recover the actual alignment 13

Setting up Optimal Alignment in O(n 2 ) via DP Compute optimal alignment of all combinations of prefixes , & store in a table for the future T à - A C G T … T S v Start UL, nothing aligned - 0 -1 -2 -3 -4 -n End LR, w/ optimal score A -1 2 1 0 -1 C -2 1 4 3 G ★ -3 0 3 Move diagonally à align chars T -4 -1 Move vert/horiz à introduce gap … … T -n V(i,j) ¡ = optimal alignment score of S[1]…S[i] and T[1]…T[j] ¡ i.e. all possible prefixes of S and T 14

Computing the table: Base Case Column: T à - A C G T … T S aligns with nothing in T S v - 0 -1 -2 -3 -4 -n all mismatches A -1 2 1 0 -1 V(i,0) ¡= ¡Σσ(S[k], ¡“-‑”) ¡ C -2 1 4 3 G ★ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡i*σ(S[k], ¡“-‑”) ¡ ¡ -3 0 3 T -4 -1 Row: … … T aligns with nothing in S T -n all mismatches V(0,j) ¡= ¡Σσ(“-‑”, ¡T[k]) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡j*σ(“-‑”, ¡T[k]) 15

Computing the table: General Case T à - A C G T … T At any given point in S v - 0 -1 -2 -3 -4 -n computing the table, we can A -1 2 1 0 -1 choose whether it’s best to C -2 1 4 3 G ★ -3 0 3 Align 2 characters T -4 -1 Take a gap … … T -n 16

Computing the table: General Case V(i-‑1, ¡j-‑1) ¡+ ¡σ(S[i], ¡T[j]) ¡ match mismatch ★ = V(i, j) = max V(i-‑1, ¡j) ¡ ¡ ¡+ ¡σ(S[i], ¡“-‑”) ¡ mismatch V(i, ¡j-‑1) ¡ ¡ ¡+ ¡σ(“-‑”, ¡T[j]) ¡ Cost of next op Cost of ops so far (match/mismatch) - A C G T … T - 0 -1 -2 -3 -4 -n A -1 2 1 0 -1 C -2 1 4 3 Need these 3 positions G ★ -3 0 3 filled in to determine ★ T -4 -1 … … T -n 17

σ (x, y) = match 2 mismatch -1 Example: base case C A T G T T à S v i=0 1 2 3 4 5 j=0 0 -1 -2 -3 -4 -5 A 1 -1 C 2 -2 G 3 -3 C 4 -4 V(i,0) ¡= ¡i*σ(S[k], ¡“-‑”) ¡ ¡ V(0,j) ¡= ¡j*σ(“-‑”,,T[k]) ¡ ¡ `8

σ (x, y) = match 2 mismatch -1 Example: general step C A T G T T à S v i=0 1 2 3 4 5 j=0 0 -1 -2 -3 -4 -5 A 1 -1 -1 C 2 -2 G 3 -3 C 4 -4 19

σ (x, y) = match 2 mismatch -1 Example: general step C A T G T T à S v i=0 1 2 3 4 5 j=0 0 -1 -2 -3 -4 -5 A 1 -1 -1 C 2 -2 G 3 -3 V(i-‑1, ¡j-‑1) ¡+ ¡σ(S[i], ¡T[j]) ¡ C 4 -4 V(i, j) = max V(i-‑1, ¡j) ¡ ¡ ¡+ ¡σ(S[i], ¡“-‑”) ¡ V(i, ¡j-‑1) ¡ ¡ ¡+ ¡σ(“-‑”, ¡T[j]) ¡ 20

σ (x, y) = match 2 mismatch -1 Example: general step C A T G T T à S v i=0 1 2 3 4 5 j=0 0 -1 -2 -3 -4 -5 A 1 -1 -1 C 2 -2 G 3 -3 V(0,1) ¡+ ¡σ(S[1], ¡T[2]) ¡ C 4 -4 V(i, j) = max V(0,2) ¡+ ¡σ(S[1], ¡“-‑”) ¡ V(1,1) ¡+ ¡σ(“-‑”, ¡T[2]) ¡ 21

σ (x, y) = match 2 mismatch -1 Example: general step C A T G T T à S v i=0 1 2 3 4 5 j=0 0 -1 -2 -3 -4 -5 A 1 -1 -1 1 C 2 -2 G 3 -3 -‑1 ¡+ ¡2 ¡= ¡ 1, ¡match ¡ C 4 -4 V(i, j) = max -‑2 ¡-‑1 ¡= ¡-‑3 ¡ -‑1 ¡-‑1 ¡= ¡-‑2 ¡ 22

σ (x, y) = match 2 mismatch -1 Example: completed table C A T G T T à S v i=0 1 2 3 4 5 j=0 0 -1 -2 -3 -4 -5 A 1 -1 -1 1 0 -1 -2 C 2 -2 1 0 0 -1 -2 G 3 -3 0 0 -1 2 1 C 4 -4 -1 -1 -1 1 1 Time = O(mn) = O(|S|*|T|) 23

How do we find the alignment itself? Traceback Trace LR to UL following highest score path C A T G T Can go i=0 1 2 3 4 5 j=0 0 -1 -2 -3 -4 -5 Multiple optimal alignments are possible A 1 -1 -1 1 0 -1 -2 C 2 -2 1 0 0 -1 -2 We can break ties arbitrarily G 3 -3 0 0 -1 2 1 C 4 -4 -1 -1 -1 1 1 Corresponding Alignment: CATGT 24 -ACGC

Mismatch = -1 Match = 2 Example j 0 1 2 3 4 5 i c a t g t ← T 0 0 -1 -2 -3 -4 -5 1 a -1 -1 1 0 -1 -2 2 c -2 1 0 0 -1 -2 3 g -3 0 0 -1 2 1 4 c -4 -1 -1 -1 1 1 5 t -5 -2 -2 1 0 3 6 g -6 -3 -3 0 3 2 ↑ 21 S

Complexity Notes Time = O(mn), (value and alignment) Space = O(mn) Easy to get value in Time = O(mn) and Space = O(min(m,n)) Possible to get value and alignment in Time = O(mn) and Space =O(min(m,n)) (KT section 6.7) 25

Significance of Alignments Is “42” a good score? Compared to what? Easier to compare when using standardized scoring functions, esp. for DNA Usual approach: compared to a specific “null model”, such as “random sequences” Interesting stats problem; much is known 26

Variations Local Alignment Preceding gives global alignment, i.e. full length of both strings; Might well miss strong similarity of part of strings amidst dissimilar flanks Gap Penalties 10 adjacent spaces cost 10 x one space? Many others Similarly fast DP algs often possible 27

Summary: Alignment Functionally similar proteins/DNA often have recognizably similar sequences even after eons of divergent evolution Ability to find/compare/experiment with “same” sequence in other organisms is a huge win Surprisingly simple scoring works well in practice: score positions separately & add, usually w/ fancier gap model like affine Simple dynamic programming algorithms can find optimal alignments under these assumptions in poly time (product of sequence lengths) This, and heuristic approximations to it like BLAST, are workhorse tools in molecular biology, and elsewhere. 28

CSE 421 Midterm Scores Mean 83 Sigma 11 1 CSE 421 Algorithms - PowerPoint PPT Presentation

CSE 421 Midterm Scores Mean 83 Sigma 11 1 CSE 421 Algorithms Sequence Alignment 1 Sequence Alignment Goal: position characters in strings so they best line up with one another We can do this via Dynamic Programming 2 What is an

Chapter 5: z-Scores : Location of Scores Chapter 5: z-Scores : Location of Scores and Standardized

Dynamic Programming The most important algorithmic technique covered in CSE 421 CSE 421

Midterm Exam CSE 421/521 - Operating Systems Fall 2011 October 20th, Thursday Lecture - XIV

Midterm Introduction to Web Design Midterm exam on Tuesday, October 22 Midterm Introduction to

Midterm Solutions David M. Rocke BIM 105, Fall 2018 David M. Rocke Midterm Solutions November

CSE 115 Introduction to Computer Science I Midterm Midterm will be returned no later than

Midterm Exam October 20th, Thursday 9:30am-10:50am @215 NSC Chapters included in the Midterm

61A Lecture 11 Friday, September 21 Midterm 1 Recap 2 Midterm 1 Recap The exam was more

Parent Seminar Welcome! PSAT Scores SAT vs. ACT Next Steps Overview New PSAT Score Report

1/12/2011 Chapter 5: z-Scores : Location of Scores and Standardized Distributions Introduction to

Top-k Queries over Uncertain Scores Qing Liu, Debabrota Basu, Talel Abdessalem, St ephane

CSE 461 Midterm Review A quick tour of what we have learned so far Midterm Topic Coverage

Lecture 12 Stencil methods Atomics Announcements Midterm scores have been posted to Moodle

ARGO GROUP 421 WEST 14TH STREET NEW YORK, NY APRIL 3, 2018 1516876-19 GANSEVOORT MARKET

5.1 CABINETMAKERS SUPPLY www.cabinetmakerssupply.net fax) 703-421-6333 (ph) 703-421-6331 3554 -

Announcements Midterm 2 is Thursday The midterm will cover everything since the first midterm up

Bigger, Faster, Random(ized): Computing in the Era of Big Data Ioana Dumitriu Department of

13 B: Summary of CS1102S CS1102S: Data Structures and Algorithms Martin Henz April 16, 2010

Definite Integrals Fundamental Theorem of Calculus Slide 3 / 85 Slide 4 / 85 Consider the

I still have no voice, so Wendy (another calculus teacher) will be lecturing today. Yes, she

Gas Regulatory Change Programme EU/GB Charging & CAM Incremental 2019 Overview Sarah

Gradient, STEM, and Regression Models for Motion Perception: Relationships and Extensions Eero

Pairwise, Rigid Registration The ICP Algorithm and Its Variants 1 1 Correspondence Problem

4: Significance Testing Machine Learning and Real-world Data Simone Teufel Computer Laboratory

CSE 421 Midterm Scores Mean 83 Sigma 11 1 CSE 421 Algorithms - PowerPoint PPT Presentation

CSE 421 Midterm Scores Mean 83 Sigma 11 1 CSE 421 Algorithms Sequence Alignment 1 Sequence Alignment Goal: position characters in strings so they best line up with one another We can do this via Dynamic Programming 2 What is an

Chapter 5: z-Scores : Location of Scores Chapter 5: z-Scores : Location of Scores and Standardized

Dynamic Programming The most important algorithmic technique covered in CSE 421 CSE 421

Midterm Exam CSE 421/521 - Operating Systems Fall 2011 October 20th, Thursday Lecture - XIV

Midterm Introduction to Web Design Midterm exam on Tuesday, October 22 Midterm Introduction to

Midterm Solutions David M. Rocke BIM 105, Fall 2018 David M. Rocke Midterm Solutions November

CSE 115 Introduction to Computer Science I Midterm Midterm will be returned no later than

Midterm Exam October 20th, Thursday 9:30am-10:50am @215 NSC Chapters included in the Midterm

61A Lecture 11 Friday, September 21 Midterm 1 Recap 2 Midterm 1 Recap The exam was more

Parent Seminar Welcome! PSAT Scores SAT vs. ACT Next Steps Overview New PSAT Score Report

1/12/2011 Chapter 5: z-Scores : Location of Scores and Standardized Distributions Introduction to

Top-k Queries over Uncertain Scores Qing Liu, Debabrota Basu, Talel Abdessalem, St ephane

CSE 461 Midterm Review A quick tour of what we have learned so far Midterm Topic Coverage

Lecture 12 Stencil methods Atomics Announcements Midterm scores have been posted to Moodle

ARGO GROUP 421 WEST 14TH STREET NEW YORK, NY APRIL 3, 2018 1516876-19 GANSEVOORT MARKET

5.1 CABINETMAKERS SUPPLY www.cabinetmakerssupply.net fax) 703-421-6333 (ph) 703-421-6331 3554 -

Announcements Midterm 2 is Thursday The midterm will cover everything since the first midterm up

Bigger, Faster, Random(ized): Computing in the Era of Big Data Ioana Dumitriu Department of

13 B: Summary of CS1102S CS1102S: Data Structures and Algorithms Martin Henz April 16, 2010

Definite Integrals Fundamental Theorem of Calculus Slide 3 / 85 Slide 4 / 85 Consider the

I still have no voice, so Wendy (another calculus teacher) will be lecturing today. Yes, she

Gas Regulatory Change Programme EU/GB Charging &amp; CAM Incremental 2019 Overview Sarah

Gradient, STEM, and Regression Models for Motion Perception: Relationships and Extensions Eero

Pairwise, Rigid Registration The ICP Algorithm and Its Variants 1 1 Correspondence Problem

4: Significance Testing Machine Learning and Real-world Data Simone Teufel Computer Laboratory

Gas Regulatory Change Programme EU/GB Charging & CAM Incremental 2019 Overview Sarah