Sequence Alignment Sequence Alignment - - PowerPoint PPT Presentation

sequence alignment sequence alignment
SMART_READER_LITE
LIVE PREVIEW

Sequence Alignment Sequence Alignment - - PowerPoint PPT Presentation

Sequence Alignment Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- || ||||||| |||| | || ||| ||||| TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Distance from sequences


slide-1
SLIDE 1

Sequence Alignment

slide-2
SLIDE 2

Sequence Alignment

  • AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

|| ||||||| |||| | || ||| ||||| TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

slide-3
SLIDE 3

Distance from sequences Distance from sequences

Mutation events Mutation we need a score for the substitution of the symbol i with j (amino acidic residues, nucleotides, etc.) substitution matrices s(i,j) A: ALASVLIRLITRLYP B: ASAVHLNRLITRLYP

The substitution matrix should account for the underlying biological process (conserved structures or functions)

slide-4
SLIDE 4

Substitution matrices Substitution matrices

The basic idea is to measure the correlation between 2 sequences Given a pair of “correlated” sequences we measure the substitution frequency of a-> b ( assuming symmetry) Pab The null hypothesis (random correlation or independent events) is the product PaPb We then defjne s (a,b) = log(Pab/PaPb) (log is additive so that the probability => sum over pairs) Minimal distance = Maximum score (s)

slide-5
SLIDE 5

Following this idea several matrices have been derived Their main difgerence is relative to the alignment types used for computing the frequencies PAMx: (Point Accepted Mutation). Number x % of mutation events. The matrix is built as: A1

ik = P(k|i)

for sequences with 1% mutations. An

ik=(A1 ik)n n % mutation events (number of mutations

NOT mutated residues. E.g.: n=2 P(i|k) = Σa P(i|a) P(a|k) PAMn = Log(An

ij /Pi)

Substitution matrices Substitution matrices

slide-6
SLIDE 6

PAM10 PAM10 Very stringent matrix, no out of diagonal element is > 0

slide-7
SLIDE 7

PAM160 PAM160 Some positive values outside the diagonal appear

slide-8
SLIDE 8

PAM250 PAM250 This is one of the most widely used matrix

slide-9
SLIDE 9

PAM500 PAM500

slide-10
SLIDE 10

PAM matrices are computed starting from very similar sequences and more distance scoring matrices are derived by matrix products BLOSUMx: This family is computed directly from blocks

  • f alignments with a defjned sequence identity threshold

(> x%). => For very related sequences we must use PAM with low numbers and BLOSUM with large numbers. The opposite holds for distant sequences

Substitution matrices Substitution matrices

slide-11
SLIDE 11

BLOSUM62 BLOSUM62 Probably the most widely used

slide-12
SLIDE 12

BLOSUM90 BLOSUM90

slide-13
SLIDE 13

BLOSUM30 BLOSUM30

slide-14
SLIDE 14

Exercise: Exercise:

Write your own substitution Matrix: Given e set of aligned sequences compute the Pab=frequency of mutations between a->b (assume symmetry, a->b counts also as b->a). Pa=as the marginal probability of Pab Finally, derive the substitution matrix: s (a,b) = log(Pab/PaPb) Use the fjles provided at: http://www.biocomp.unibo.it/piero/P4B/Malignments/ Hints: 1.start with toy.aln to check your algorithm

  • 2. Initialize your P[a][b] matrix with pseudocounts (such

as 1 instead of 0)

slide-15
SLIDE 15

DotPlots: DotPlots:

slide-16
SLIDE 16

DotPlot Exercise: DotPlot Exercise:

Write a program dotplot.py That takes as input a fasta fjle with two sequences A scoring matrix and sliding window and the threshold cut-ofg, such as: dotplot.py fasta.in score.mat 11 threshold and prints on the standard oputput the dotplot. PS: you can use matplotlib for a nicer presentation

slide-17
SLIDE 17

Distance between sequences Distance between sequences What kind of operations we consider? Mutation Deletion and Insertion (rarely rearrangements ) A: ALASVLIRLIT--YP B: ASAVHL---ITRLYP The negative gap score value depends only on the number of holes σ(n) = -nd linear σ(n) = -d - (n-1)e affjne (d: open e: extension) !! Please note that all scores are position- independent along the sequence

slide-18
SLIDE 18

Pairwise alignment Pairwise alignment Given 2 sequences what is an alignment with a maximum score? Naïf solution: try every possible alignments and select one with the best score Using the score function :

slide-19
SLIDE 19

How may alignments there are? How may alignments there are? Case without internal gaps

  • -AAA
  • AAA AAA AAA AAA- AAA--

BB--- BB-- BB- -BB --BB ---BB Given 2 sequences of length m e n, the number of shifts is m +n +1

slide-20
SLIDE 20

Case with internal gaps

  • -AAA -AAA -AAA -AAA A-AA

BB--- BB-- B-B- B--B BB-- BBAAA BABAA BAABA BAAAB ABBAA AAA AAA AA-A AAA AAA- BB- B-B -BB- -BB --BB ... ABABA ABAAB AABBA AABAB AAABB The number of the alignments is equal at the number of all possible way of mixing 2 sequences keeping track of the

  • riginal sequence order. Given 2 sequences of lengths n and

m, they are ∑k=0,min(m,n)(m+n-k)!/k!(n-k)!(m-k)! For n=m=80we get > 1043 possible alignments !!!!!!!

How may alignments there are? How may alignments there are?

slide-21
SLIDE 21

We can keep a table of the precomputed substring alignments (dynamic programming) ALSKLASPALSAKDLDSPALS ALSKIADSLAPIKDLSPASLT ALSKLASPALSAKDLDSPAL-S ALSKIADSLAPIKDLSPASLT- ALSKLASPALSAKDLDSPALS- ALSKIADSLAPIKDLSPASL-T Basic idea Basic idea

slide-22
SLIDE 22

Building step by step Given the 2 sequences ALSKLASPALSAKDLDSPALS, ALSKIADSLAPIKDLSPASLT The best alignment between the two substrings ALSKLASPA ALSKIAD can be computed taking only into consideration ALSKLASP A ALSKLASP A ALSKLASPA - ALSKIA D ALSKIAD - ALSKIA D The best among these 3 possibilities Basic idea Basic idea + + +

slide-23
SLIDE 23

The Needleman-Wunsch Matrix

x1 ……………………………… xM yN ……………………………… y1 Every nondecreasing path from (0,0) to (M, N) corresponds to an alignment

  • f the two sequences
slide-24
SLIDE 24

The Needleman-Wunsch Algorithm

x = AGTA m = 1 y = ATA s = -1 d = -1

2

  • 1
  • 1
  • 3

A 1

  • 2

T

  • 2
  • 1

1

  • 1

A

  • 4
  • 3
  • 2
  • 1

A T G A

F(i,j) i = 0 1 2 3 4 j = 0 1 2 3

Optimal Alignment: F(4,3) = 2 AGTA A - TA

slide-25
SLIDE 25

The Needleman-Wunsch Algorithm

  • Initialization.
  • F(0, 0)

= 0

  • F(0, j)

= - j × d

  • F(i, 0)

= - i × d

  • Main Iteration. Filling-in partial alignments
  • For each i = 1……M

For each j = 1……N F(i-1,j) – d [case 1] F(i, j) = max F(i, j-1) – d [case 2] F(i-1, j-1) + s(xi, yj) [case 3] UP, if [case 1] Ptr(i,j) = LEFT if [case 2] DIAG if [case 3]

  • Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace

back optimal alignment

slide-26
SLIDE 26

Exercise:

Suppose you want only to know the score of a global alignment. => Write a program that given two input sequence (in a single file in fasta format), a gap cost and a similarity matrix computes the score of the global alignment in O(N*M) time and in O(M) space, where M and N are the lengths of the input sequences and M<=N

slide-27
SLIDE 27

The Overlap Detection variant

Changes:

  • Initialization

For all i, j, F(i, 0) = 0 F(0, j) = 0

  • Termination

maxi F(i, N) FOPT = max maxj F(M, j) x1 ……………………………… xN yM ……………………………… y1

slide-28
SLIDE 28

The local alignment problem

Given two strings x = x1……xM, y = y1……yN Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc

slide-29
SLIDE 29

Why local alignment

  • Genes are shuffled between genomes
  • Portions of proteins (domains) are often conserved
slide-30
SLIDE 30

The Smith-Waterman algorithm

Idea: Ignore badly aligning regions Modifications to Needleman-Wunsch: Initialization: F(0, j) = F(i, 0) = 0 Iteration: F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + s(xi, yj)

slide-31
SLIDE 31

The Smith-Waterman algorithm

Termination:

  • If we want the best local alignment…

FOPT = maxi,j F(i, j)

  • If we want all local alignments scoring > t

For all i, j find F(i, j) > t, and trace back

slide-32
SLIDE 32

Scoring the gaps more accurately

Current model: Gap of length n incurs penalty n×d However, gaps usually occur in bunches Concave gap penalty function: G(n): for all n, G(n + 1) - G(n) ≤ G(n) - G(n – 1)

G(n) G(n)

slide-33
SLIDE 33

General gap dynamic programming

Initialization: same Iteration: F(i-1, j-1) + s(xi, yj) F(i, j) = max maxk=0…i-1F(k,j) – γ(i-k) maxk=0…j-1F(i,k) – γ(j-k) Termination: same Running Time: O(N2M) (assume N>M) Space: O(NM)

slide-34
SLIDE 34

Exercise:

Write a program that given two input sequence (in a single file in fasta format), and a choice of a general gap function and scoring matrix computes the alignments of the two sequences and returns one of the possible best alignments. Remember that when you store that the best score is obtained using maxk=0…i-1F(k,j) – g(i-k) maxk=0…j-1F(i,k) – g(j-k) You have to store this information in the corresponding pointer (back-trace) matrix.

slide-35
SLIDE 35

Compromise: affine gaps

g(n) = d + (n – 1)×e | | gap gap

  • pen

extend To compute optimal alignment, At position i,j, need to “remember” best score if gap is open best score if gap is not open F(i, j): score of alignment x1…xi to y1…yj if if xi aligns to yj G(i, j): score if if xi, or yj, aligns to a gap

d e

g(n)

slide-36
SLIDE 36

Needleman-Wunsch with affine gaps

Initialization: F(0,0)=0, F(i, 0) = d + (i – 1)×e, F(0, j) = d + (j – 1)×e R(0,j)= -∞ , C(i,0)= -∞ Iteration: F(i – 1, j – 1) + s(xi, yj) F(i, j) = max R(i , j) C(i , j) F(i – 1, j) – d R(i, j) = max R(i – 1, j) – e F(i , j -1) – d C(i, j) = max C(i , j -1 ) – e Termination: same

slide-37
SLIDE 37

Bounded Dynamic Programming

Assume we know that x and y are very similar Assumption: # gaps(x, y) < k(N) ( say N>M ) xi Then, | implies | i – j | < k(N) yj We can align x and y more efficiently: Time, Space: O(N × k(N)) << O(N2)

slide-38
SLIDE 38

Bounded Dynamic Programming

Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration:

For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+ s(xi, yj) F(i, j) = max F(i, j – 1) – d, if j > i – k(N) F(i – 1, j) – d, if j < i + k(N) Termination: same x1 ………………………… xM y1 ………………………… yN

k(N)

Easy to extend to the affine gap case

slide-39
SLIDE 39

Significant alignments

Signifjcance of an alignment Signifjcance of an alignment Given an alignment with score S, is it signifjcant? How are the score of random alignments distributed? 100,000 alignments of unrelated and shuffmed sequences:

Score Occorrenza

slide-40
SLIDE 40

Z=(S-<S>)/σs S= Alignment score <S>= average of the scores on a random set of alignments σs= Standard deviation of the scores on a random set of alignments Signifjcance of the alignment Z<3 not signifjcant 3<Z<10 probably signifjcant Z>10 signifjcant Z-score Z-score

slide-41
SLIDE 41

Write a program that takes in input a fasta with two sequences, and a number N. Compute the score of the global alignment of the two sequence and the Z-score with respect N shuffmed sequences (generated from the fjrst

  • f the fasta) against the original second

sequence of the fasta. Z=(S-<S>)/s S= Alignment score <S>= average of the scores on a random set of alignments s Standard deviation of the scores on a random set of alignments Execise: Z-score Execise: Z-score

slide-42
SLIDE 42

The Z-score of this alignment is 7.5 over 54 residues Sequence identity is as high as 25.9%. The sequences have a completely difgerent structure Citrate synthase (2cts) vs transthyritin (2paba) Is the Z-score reliable? Is the Z-score reliable?

slide-43
SLIDE 43

E-value E-value Expected number

  • f

random alignments

  • btaining a score greater or equal to a given

score (s) Relies on the Extreme Value Distribution E=Kmn e-λs m, n: lengths of the sequences K, λ: “scaling” constants The number of high scoring random alignment increases when the sequence lengths increase and decreases in an exponential way when the score increases.

slide-44
SLIDE 44

Alignment signifjcance The signifjcance of the E-value depends on the length of the considered database. Considering Swiss Prot, E> 10-1 non signifjcant 10-1 > E > 10-3 probably not signifjcant 10 -3 > E > 10-8 probably signifjcant E < 10-8 signifjcant E-value E-value

slide-45
SLIDE 45

P-value P-value Probability for random alignments to obtain a score greater or equal to a given score (s) Given the E-value (expected number of alignments), which statistics do describe the probability of having a number a of random alignments with score ≥ S? Poisson: P(a) = Which is the probability of fjnding at least one random alignment with score ≥ S? P(a ≥ 1) = 1 – P(0) = 1 – exp (-E)

slide-46
SLIDE 46

Similarity search in Data Bases Similarity search in Data Bases Given a sequence, search for similar sequences in huge data sets In principle, the alignments between the query sequence and ALL the sequences in the data sets could be tried T

  • o many sequences!

Heuristic algorithms can be used. They do not assure to fjnd the optimal alignment FASTA BLAST

slide-47
SLIDE 47

FASTA FASTA The query sequence is chopped in words of k-tup

  • characters. Usually k-tup = 2 for proteins, 6 for DNA

ADKLPTLPLRLDPTNMVFGHLRI Words (indexed by position): AD, DK, KL, LP, PT, TL, LP, PR, RL, …,…, 1 2 3 4 5 6 7 8 9 …. The list of indexed words is compiled for each sequence in the data set (subject) The search of the correspondence between the words is very fast. The difgerence between the indexes of the matches in the query and the subject sequences determines the distance from the main diagonal

slide-48
SLIDE 48

FASTA FASTA Query Subject Many matches along the same diagonal correspond to longer identical segments along the sequences

slide-49
SLIDE 49

FASTA FASTA Query Subject The alignment of the longest matched diagonals are evaluated with a score matrix (PAM or BLOSUM)

slide-50
SLIDE 50

FASTA FASTA Query Subject Most similar regions on close diagonals are isolated

slide-51
SLIDE 51

Query Subject FASTA FASTA An exact Smith-Waterman alignment is computed on a narrow band around the diagonal endowed with the highest similarity (a 32-residue band is usually adopted)

slide-52
SLIDE 52

Sequence similarity with FASTA

slide-53
SLIDE 53

BLAST BLAST The sequence data set is indexed as follow: for each possible residue triplet the occurrence and the position along the sequences are stored. (FORMATDB) AAA AAC AAD ACA ... ... ...

slide-54
SLIDE 54

BLAST BLAST The query sequence is chopped in words, W-residue long (usually W=3 for proteins) LSHLPTLPLRLDPTNMVFGHLRI LSH, SHL, HLP, LPT, PTL, TLR, …,…, For each word, all the similar proteins are generated, using the BLOSUM62 matrix and setting a similarity threshold (usually T = 11--13) LSH 16 ISH 14 MSH 14 VSH 13 LAH 13 LTH 13 LNH 13

slide-55
SLIDE 55

BLAST BLAST Each word included in the list of the similar words is retrieved in the sequence of the data set by means of the indexes The match is extended, until the score is higher than a threshold S

slide-56
SLIDE 56

Sequence similarity with BLAST (Basic Local Alignment Search Tool)

slide-57
SLIDE 57

Alignment of all the retrieved sequences Alignment of all the retrieved sequences ATTENTION: It is not a multiple sequence alignment

slide-58
SLIDE 58

1 Y K D Y H S - D K K K G E L - - 2 Y R D Y Q T - D Q K K G D L - - 3 Y R D Y Q S - D H K K G E L - - 4 Y R D Y V S - D H K K G E L - - 5 Y R D Y Q F - D Q K K G S L - - 6 Y K D Y N T - H Q K K N E S - - 7 Y R D Y Q T - D H K K A D L - - 8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K 10 T K G Y G F G L I K N T E T T K A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0 E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0 G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0 H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0 K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100 I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0 N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0 R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0 T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0 V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0 Position

Sequence profjle Sequence profjle

slide-59
SLIDE 59

Usefulness of the sequence profjles Usefulness of the sequence profjles Sequence profjles describes the basic features of all the sequence used in the alignment Most conserved regions and most frequent mutations for each position are highlighted Sequence-to-profjle alignment The alignment score are weighted position by position using the profjle. The same mutations in difgerent positions are scored with difgerent values

slide-60
SLIDE 60

Sequence-to-profjle alignment Sequence-to-profjle alignment

Given the position i along a sequence profjle, it is represented by a 20-valued vector Pi = Pi(A) Pi(C) …… Pi (Y) Given the residue in position j along the sequence to align: Sj The score for aligning Sj to the vector Pi is: where M is a matrix score (BLOSUM or PAM)

  • How to score a profjle-profjle alignment?
slide-61
SLIDE 61

PSI-BLAST PSI-BLAST

http://www.ncbi.nlm.nih.gov/BLAST/

Sequence Data Base BLAST Sequence profjle PSI-BLAST Until converges

slide-62
SLIDE 62
  • PSI-BLAST takes as an input a single protein sequence and compares it

to a protein database, using the gapped BLAST program

  • The program constructs a multiple alignment, and then a profjle, from

any signifjcant local alignments found. The original query sequence serves as a template for the multiple alignment and profjle, whose lengths are identical to that of the query. Difgerent numbers of sequences can be aligned in difgerent template positions

  • The profjle is compared to the protein database, again seeking local
  • alignments. After a few minor modifjcations, the BLAST algorithm can

be used for this directly.

  • PSI-BLAST estimates the statistical signifjcance of the local alignments
  • found. Because profjle substitution scores are constructed to a fjxed

scale, and gap scores remain independent of position, the statistical theory and parameters for gapped BLAST alignments remain applicable to profjle alignments.

  • Finally, PSI-BLAST iterates, by returning to step (2), an arbitrary

number of times or until convergence.

The design of PSI-BLAST