lecture 4 sequence alignment how to discover similarities
play

Lecture 4 Sequence alignment: how to discover similarities between - PowerPoint PPT Presentation

Lecture 4 Sequence alignment: how to discover similarities between biological sequences Chapter 6 in Jones and Pevzner Fall 2019 September 10, 2019 Evolution as a tool for biological insight Nothing in biology makes sense except in


  1. Lecture 4 Sequence alignment: how to discover similarities between biological sequences Chapter 6 in Jones and Pevzner Fall 2019 September 10, 2019

  2. Evolution as a tool for biological insight • “Nothing in biology makes sense except in the light of evolution” - Theodosius Dobzhansky. • The functionality of many genes is virtually the same among many organisms: Can understand biology in simpler organisms than ourselves (“model organisms”).

  3. Homology • Genes in organisms A and B that have evolved from the same ancestral gene are said to be homologs. • Homology between genes typically indicates conserved function. • Sequence similarity is used to infer homology.

  4. Sequence Comparison: Early Success Story • In 1983 Russell Doolittle and colleagues found similarities between a cancer-causing gene from the Simian Sarcoma virus and a normal growth factor gene (PDGF). • Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced gene’s function.

  5. The drosophila “eyeless” gene • W. Gehring discovered that turning on the “eyeless” gene in drosophila leads to the growth of ectopic eyes. • “eyeless” is a master control gene for eye formation (transcription factor).

  6. A similar gene in humans • The aniridia gene in humans has a sequence that is similar to the drosophila eyeless gene. • Eye morphogenesis is under similar genetic control in vertebrates and insects.

  7. 
 
 
 
 
 PAX6_HUMAN aligned against PAX6_DRO 5 HSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVS 54 
 ||||||||||||.||||||||||||||||||||||||||||||||||||| 
 57 HSGVNQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVS 106 
 55 KILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRD 104 
 ||||||||||||||||||||||||||.||||||:|||||||||||||||| 
 107 KILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRD 156 
 105 RLLSEGVCTNDNIPSVSSINRVLRNLASEKQQMGA--------------- 139 
 |||.|.|||||||||||||||||||||::|:|... 
 157 RLLQENVCTNDNIPSVSSINRVLRNLAAQKEQQSTGSGSSSTSAGNSISA 206 
 155 -----------SWGTR---PGWYPGTSVPGQPTQ---------------- 174 
 ||..| ..||| ||:...|.. 
 307 NHQALQQHQQQSWPPRHYSGSWYP-TSLSEIPISSAPNIASVTAYASGPS 355 
 175 ------------------------------------DGCQQQE---GGGE 185 
 ||.|..| |.|| 
 356 LAHSLSPPNDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGE 405 
 186 NTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEALEKEFERTHYP 235 
 |:|..:||..::::.|.||.|||||||||||||.:||::||||||||||| 
 406 NSNGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYP 455 


  8. Sequence alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings v = v 1 v 2 ...v m , w = w 1 w 2 …w n , an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter in one sequence with either a letter, or a gap in the other sequence.

  9. Mutations at the DNA level Deletion Substitution SEQUENCE EDITS …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… REARRANGEMENTS Inversion Translocation Duplication

  10. Scoring an alignment • A simple scoring scheme: • Penalize mismatches by – μ • Penalize indels by – σ , • Reward matches with +1 • Resulting score: #matches – ( #mismatches) μ – ( #indels) σ • Objective: find the best scoring alignment

  11. Number of pairwise alignments • Given sequences of length m and n, the number of alignments is: min( m,n ) � m ⇥� n ⇥ � n + m ⇥ ⇤ = k k n k =0 • For two sequences of length n: ( n !) 2 ≈ 2 2 n � 2 n ⇥ = (2 n )! √ π n n √ ⇥ n � n Derived using Stirling’ s approximation: n ! ≈ 2 π n e

  12. Substrings and subsequences Definition: A string x’ is a substring of a string x, if x = ux’v for some prefix string u and suffix string v (x’ = x i …x j , for some 1 ≤ i ≤ j ≤ |x|) A string x’ is a subsequence of a string x if x’ can be obtained from x by deleting 0 or more letters (x’ = x i1 …x ik , for some 1 ≤ i 1 ≤ … ≤ i k ≤ |x|) Note: a substring is always a subsequence Example: x = abracadabra y = cadabr; substring z = brcdbr; subseqence, not substring

  13. Encoding alignment as a path in a 2-d grid 0 1 2 2 3 3 4 5 6 7 8 i coords: elements of v A T -- C -- T G A T C elements of w -- T G C A T -- A -- C 0 0 1 2 3 4 5 5 6 6 7 j coords: (0,0) à (1,0) à (2,1) à (2,2) à (3,3) à (3,4) à (4,5) à (5,5) à (6,6) à (7,6) à (8,7) Every alignment is a path in 2-D grid

  14. Alignment as a path A T C T G A T C j 0 1 2 3 4 5 6 7 8 i 0 T 1 G 2 C 3 A 4 T 5 A 6 C 7

  15. Alignment as a Path in the Edit Graph 0 1 2 2 3 4 5 6 7 7 A T - G T T A T - A T C G T - A - C 0 1 2 3 4 5 5 6 6 7 - Corresponding path - (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7)

  16. Alignment as a Path in the Edit Graph and represent indels in v and w with score -1. represent matches with score 1. The score of the alignment is 1.

  17. Alignment as a Path in the Edit Graph Every path in the edit graph corresponds to an alignment:

  18. Alignment algorithms we will cover • Global alignment • Local alignment • Alignment with affine gap penalties • Scoring matrices

  19. Our simple scoring scheme • The score when mismatches are penalized by – μ , indels are penalized by - σ , and matches are rewarded by +1 : #matches – μ ( #mismatches) – σ ( #indels)

  20. Global Alignment: The Needleman- Wunsch algorithm 1 Find the best alignment between two strings under our scoring scheme Input : Strings v and w and a scoring scheme Output : Maximum scoring alignment µ : mismatch penalty s i-1,j-1 + 1 if v i = w j σ : indel penalty s i,j = max s i-1,j-1 - µ if v i ≠ w j s i-1,j - σ s i,j-1 - σ s i,j – the score for the best alignment of a length i prefix of v and a length j prefix of w 1 A general method applicable to the search for similarities in the amino acid sequence of two proteins , J Mol Biol. 48 (3):443-53, 1970.

  21. Needleman Wunsch (cont) • What about the base case?

  22. NW as a DP algorithm NW( NW(v,w,sigma,mu v,w,sigma,mu) ) for for i in range(0, m): Runtime: O(nm) s i,0 = -sigma * i Memory: O(nm) for for j in range(0, n) : s 0,j = -sigma * j for for i in range(1, m) : for for j in range(1, n) : fill in s i,j return return ( s m,n )

  23. Now What? • The DP algorithm created the alignment grid. • To read the best alignment: Follow the pointers from sink.

  24. Scoring Matrices To generalize scoring, we use a scoring matrix δ . Size of the matrix: Alignment of DNA sequences: (4+1) x (4+1) Alignment of amino acids: (20+1) x (20+1) The additional row/column includes scores for the gap character “-” s i-1,j-1 + δ (v i , w j ) s i,j = max s i-1,j + δ (v i , -) s i,j-1 + δ (-, w j )

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend