Sequence Alignment Sequence Alignment - PowerPoint PPT Presentation

Sequence Alignment

Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- || ||||||| |||| | || ||| ||||| TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Distance from sequences Distance from sequences Mutation events Mutation we need a score for the substitution of the symbol i with j (amino acidic residues, nucleotides, etc.) substitution matrices s(i,j) A: ALASVLIRLITRLYP B: ASAVHLNRLITRLYP The substitution matrix should account for the underlying biological process (conserved structures or functions)

Substitution matrices Substitution matrices The basic idea is to measure the correlation between 2 sequences Given a pair of “correlated” sequences we measure the substitution frequency of a -> b ( assuming symmetry) P ab The null hypothesis (random correlation or independent events) is the product P a P b We then defjne s ( a,b ) = log( P ab / P a P b ) (log is additive so that the probability => sum over pairs) Minimal distance = Maximum score (s)

Substitution matrices Substitution matrices Following this idea several matrices have been derived Their main difgerence is relative to the alignment types used for computing the frequencies PAMx: (Point Accepted Mutation). Number x % of mutation events. The matrix is built as: A 1 ik = P ( k | i ) for sequences with 1% mutations. ik ) n n % mutation events (number of mutations A n ik =(A 1 NOT mutated residues. E.g.: n =2 P( i | k ) = Σ a P( i | a ) P( a | k ) PAMn = Log( A n ij / P i )

PAM10 PAM10 Very stringent matrix, no out of diagonal element is > 0

PAM160 PAM160 Some positive values outside the diagonal appear

PAM250 PAM250 This is one of the most widely used matrix

PAM500 PAM500

Substitution matrices Substitution matrices PAM matrices are computed starting from very similar sequences and more distance scoring matrices are derived by matrix products BLOSUMx: This family is computed directly from blocks of alignments with a defjned sequence identity threshold (> x %). => For very related sequences we must use PAM with low numbers and BLOSUM with large numbers. The opposite holds for distant sequences

BLOSUM62 BLOSUM62 Probably the most widely used

BLOSUM90 BLOSUM90

BLOSUM30 BLOSUM30

Exercise: Exercise: Write your own substitution Matrix: Given e set of aligned sequences compute the P ab =frequency of mutations between a->b (assume symmetry, a->b counts also as b->a). P a =as the marginal probability of P ab Finally, derive the substitution matrix: s ( a,b ) = log( P ab / P a P b ) Use the fjles provided at: http://www.biocomp.unibo.it/piero/P4B/Malignments/ Hints: 1.start with toy.aln to check your algorithm 2. Initialize your P[a][b] matrix with pseudocounts (such as 1 instead of 0)

DotPlots: DotPlots:

DotPlot Exercise: DotPlot Exercise: Write a program dotplot.py That takes as input a fasta fjle with two sequences A scoring matrix and sliding window and the threshold cut-ofg, such as: dotplot.py fasta.in score.mat 11 threshold and prints on the standard oputput the dotplot. PS: you can use matplotlib for a nicer presentation

Distance between sequences Distance between sequences What kind of operations we consider? Mutation Deletion and Insertion (rarely rearrangements ) A: ALASVLIRLIT--YP B: ASAVHL---ITRLYP The negative gap score value depends only on the number of holes σ ( n ) = - nd linear σ ( n ) = - d - ( n-1 ) e affjne ( d : open e : extension) !! Please note that all scores are position- independent along the sequence

Pairwise alignment Pairwise alignment Given 2 sequences what is an alignment with a maximum score? Naïf solution: try every possible alignments and select one with the best score Using the score function :

How may alignments there are? How may alignments there are? Case without internal gaps --AAA -AAA AAA AAA AAA- AAA-- BB--- BB-- BB- -BB --BB ---BB Given 2 sequences of length m e n , the number of shifts is m + n +1

How may alignments there are? How may alignments there are? Case with internal gaps --AAA -AAA -AAA -AAA A-AA BB--- BB-- B-B- B--B BB-- BBAAA BABAA BAABA BAAAB ABBAA AAA AAA AA-A AAA AAA- BB- B-B -BB- -BB --BB ... ABABA ABAAB AABBA AABAB AAABB The number of the alignments is equal at the number of all possible way of mixing 2 sequences keeping track of the original sequence order. Given 2 sequences of lengths n and m, they are  ∑ k=0,min(m,n) (m+n-k)!/k!(n-k)!(m-k)! For n=m=80we get > 10 43 possible alignments !!!!!!!

Basic idea Basic idea We can keep a table of the precomputed substring alignments (dynamic programming) ALSKLASPALSAKDLDSPAL S ALSKIADSLAPIKDLSPASL T ALSKLASPALSAKDLDSPAL -S ALSKIADSLAPIKDLSPASL T- ALSKLASPALSAKDLDSPAL S- ALSKIADSLAPIKDLSPASL- T

Basic idea Basic idea Building step by step Given the 2 sequences ALSKLASPALSAKDLDSPALS, ALSKIADSLAPIKDLSPASLT The best alignment between the two substrings ALSKLASPA ALSKIAD can be computed taking only into consideration ALSKLASP A ALSKLASP A ALSKLASPA - + + + ALSKIA D ALSKIAD - ALSKIA D The best among these 3 possibilities

The Needleman-Wunsch Matrix x 1 ……………………………… x M y N ……………………………… y 1 Every nondecreasing path from (0,0) to (M, N) corresponds to an alignment of the two sequences

The Needleman-Wunsch Algorithm x = AGTA m = 1 y = ATA s = -1 d = -1 F(i,j) i = 0 1 2 3 4 A G T A Optimal Alignment: 0 -1 -2 -3 -4 j = 0 F(4,3) = 2 A -1 1 0 -1 -2 1 AGTA A - TA T -2 0 0 1 0 2 A -3 -1 -1 0 2 3

The Needleman-Wunsch Algorithm • Initialization. • F(0, 0) = 0 • F(0, j) = - j × d • F(i, 0) = - i × d • Main Iteration. Filling-in partial alignments • For each i = 1……M For each j = 1……N F(i-1,j) – d [case 1] F(i, j) = max F(i, j-1) – d [case 2] F(i-1, j-1) + s(x i , y j ) [case 3] UP, if [case 1] Ptr(i,j) = LEFT if [case 2] DIAG if [case 3] • Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Exercise: Suppose you want only to know the score of a global alignment. => Write a program that given two input sequence (in a single file in fasta format), a gap cost and a similarity matrix computes the score of the global alignment in O(N*M) time and in O(M) space, where M and N are the lengths of the input sequences and M<=N

The Overlap Detection variant x 1 ……………………………… x N Changes: y M ……………………………… y 1 • Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 • Termination max i F(i, N) F OPT = max max j F(M, j)

The local alignment problem Given two strings x = x 1 ……x M , y = y 1 ……y N Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc

Why local alignment • Genes are shuffled between genomes • Portions of proteins (domains) are often conserved

The Smith-Waterman algorithm Idea : Ignore badly aligning regions Modifications to Needleman-Wunsch: Initialization : F(0, j) = F(i, 0) = 0 0 Iteration : F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + s(x i , y j )

The Smith-Waterman algorithm Termination : • If we want the best local alignment… F OPT = max i,j F(i, j) • If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back

Scoring the gaps more accurately G(n) Current model: Gap of length n incurs penalty n × d However, gaps usually occur in bunches Concave gap penalty function: G(n) G(n): for all n, G(n + 1) - G(n) ≤ G(n) - G(n – 1)

General gap dynamic programming Initialization: same Iteration: F(i-1, j-1) + s(x i , y j ) F(i, j) = max max k=0…i-1 F(k,j) – γ (i-k) max k=0…j-1 F(i,k) – γ (j-k) Termination: same Running Time: O(N 2 M) (assume N>M) Space: O(NM)

Exercise: Write a program that given two input sequence (in a single file in fasta format), and a choice of a general gap function and scoring matrix computes the alignments of the two sequences and returns one of the possible best alignments. Remember that when you store that the best score is obtained using max k=0…i-1 F(k,j) – g(i-k) max k=0…j-1 F(i,k) – g(j-k) You have to store this information in the corresponding pointer (back-trace) matrix.

Compromise: affine gaps g(n) g(n) = d + (n – 1) × e | | e gap gap d open extend To compute optimal alignment, At position i,j, need to “remember” best score if gap is open best score if gap is not open F(i, j): score of alignment x 1 …x i to y 1 …y j if x i aligns to y j if G(i, j): score if if x i , or y j , aligns to a gap

Needleman-Wunsch with affine gaps Initialization: F(0,0)=0, F(i, 0) = d + (i – 1) × e, F(0, j) = d + (j – 1) × e R(0,j)= -∞ , C(i,0)= -∞ Iteration: F(i – 1, j – 1) + s(x i , y j ) F(i, j) = max R(i , j) C(i , j) F(i – 1, j) – d R(i, j) = max R(i – 1, j) – e F(i , j -1) – d C(i, j) = max C(i , j -1 ) – e Termination: same

Sequence Alignment Sequence Alignment - PowerPoint PPT Presentation

Sequence Alignment Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- || ||||||| |||| | || ||| ||||| TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Distance from sequences

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

This week CSE 527 Sequence alignment Computational Biology More sequence alignment

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence Alignment Mark Voorhies 5/20/2015 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 5/29/2013 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 4/12/2018 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 4/24/2012 Mark Voorhies Sequence Alignment Exercise:

CSE 421 Algorithms Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE 427 Comp Bio Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE 427 Computational Biology Winter 2008 Sequence Alignment; DNA Replication 1 Sequence

CSE 427 Comp Bio Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE421 Algorithms Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Whole genome alignments http://faculty.washington.edu/jht/GS559_2013/ Genome 559: Introduction to

Algoritmi per la Bioinformatica Zsuzsanna Lipt ak Laurea Magistrale Bioinformatica e

Genome 559 Introduction to Statistical and Computational Genomics Winter 2010 Lecture 14a: BLAST

A few BLAST details Julin Maloof April 16, 2019 Slides courtesy of Venkatsean Sundaresan BLAST

Data Mining: Concepts and Techniques Additional Applications and Emerging Topics Li Xiong

String comparison problems, Myers (91) So far our goal was to maximize the alignments

Heuristic searches Genomics Compare DNA sequences to discover similarities/differences

Alignments in Practice BLAST and CLUSTAL Introduction to Bioinformatics Dortmund, 16.-20.07.2007

Sequence Alignment Sequence Alignment - PowerPoint PPT Presentation

Sequence Alignment Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- || ||||||| |||| | || ||| ||||| TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Distance from sequences

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

This week CSE 527 Sequence alignment Computational Biology More sequence alignment

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence Alignment Mark Voorhies 5/20/2015 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 5/29/2013 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 4/12/2018 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 4/24/2012 Mark Voorhies Sequence Alignment Exercise:

CSE 421 Algorithms Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE 427 Comp Bio Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE 427 Computational Biology Winter 2008 Sequence Alignment; DNA Replication 1 Sequence

CSE 427 Comp Bio Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE421 Algorithms Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Whole genome alignments http://faculty.washington.edu/jht/GS559_2013/ Genome 559: Introduction to

Algoritmi per la Bioinformatica Zsuzsanna Lipt ak Laurea Magistrale Bioinformatica e

Genome 559 Introduction to Statistical and Computational Genomics Winter 2010 Lecture 14a: BLAST

A few BLAST details Julin Maloof April 16, 2019 Slides courtesy of Venkatsean Sundaresan BLAST

Data Mining: Concepts and Techniques Additional Applications and Emerging Topics Li Xiong

String comparison problems, Myers (91) So far our goal was to maximize the alignments

Heuristic searches Genomics Compare DNA sequences to discover similarities/differences

Alignments in Practice BLAST and CLUSTAL Introduction to Bioinformatics Dortmund, 16.-20.07.2007

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or