Pairwise Alignment Mark Voorhies 3/27/2012 Mark Voorhies Pairwise - - PowerPoint PPT Presentation

pairwise alignment
SMART_READER_LITE
LIVE PREVIEW

Pairwise Alignment Mark Voorhies 3/27/2012 Mark Voorhies Pairwise - - PowerPoint PPT Presentation

Pairwise Alignment Mark Voorhies 3/27/2012 Mark Voorhies Pairwise Alignment Review: Tips and tricks Making a file executable: chmod a+x pydotter . py Handling file/directory names with spaces: cd My \ D i r e c t o r y \ with \ Spaces


slide-1
SLIDE 1

Pairwise Alignment

Mark Voorhies 3/27/2012

Mark Voorhies Pairwise Alignment

slide-2
SLIDE 2

Review: Tips and tricks

Making a file executable: chmod ”a+x” pydotter . py Handling file/directory names with spaces: cd My\ D i r e c t o r y \ with \ Spaces

  • r

cd ”My D i r e c t o r y with Spaces ”

Mark Voorhies Pairwise Alignment

slide-3
SLIDE 3

Review: Tips and tricks

Killing a process on OS X: Try ctrl-c If that doesn’t work:

ps −awx | grep name of process First column in ps output is PID (process ID) k i l l PID If that doesn’t work: k i l l −KILL PID

On Linux: ps −e a l f | grep name of process

Mark Voorhies Pairwise Alignment

slide-4
SLIDE 4

Review: Content

FASTA files >Name Free-form annotation MGCLLIMKEGGPGRKHKLIVMLYLDENQ EHELPIMTRAPPEDINADNAMACHINEW NQEDLYMNILKHGPPGEDEDRKHEDEDG

Mark Voorhies Pairwise Alignment

slide-5
SLIDE 5

Review: Content

FASTA files >Name Free-form annotation MGCLLIMKEGGPGRKHKLIVMLYLDENQ EHELPIMTRAPPEDINADNAMACHINEW NQEDLYMNILKHGPPGEDEDRKHEDEDG Dotplots: unbiased plot of all possible ungapped alignments of two sequences.

Mark Voorhies Pairwise Alignment

slide-6
SLIDE 6

Pairwise Alignment

How can we automate our dotplot protocol to find the “best” gapped alignment of our sequences?

Mark Voorhies Pairwise Alignment

slide-7
SLIDE 7

Pairwise Alignment

How can we automate our dotplot protocol to find the “best” gapped alignment of our sequences? What do we mean by best?

Mark Voorhies Pairwise Alignment

slide-8
SLIDE 8

Pairwise Alignment

How can we automate our dotplot protocol to find the “best” gapped alignment of our sequences? What do we mean by best? Residues with equivalent functional roles are paired

Mark Voorhies Pairwise Alignment

slide-9
SLIDE 9

Pairwise Alignment

How can we automate our dotplot protocol to find the “best” gapped alignment of our sequences? What do we mean by best? Residues with equivalent functional roles are paired Residues that derive from the same position in the common ancestor are paired (homology)

Mark Voorhies Pairwise Alignment

slide-10
SLIDE 10

Pairwise Alignment

How can we automate our dotplot protocol to find the “best” gapped alignment of our sequences? What do we mean by best? Residues with equivalent functional roles are paired Residues that derive from the same position in the common ancestor are paired (homology) The sequence alignment maximizes a similarity function

Mark Voorhies Pairwise Alignment

slide-11
SLIDE 11

Deriving scores from alignments

Frequency of residue i: pi

Mark Voorhies Pairwise Alignment

slide-12
SLIDE 12

Deriving scores from alignments

Frequency of residue i: pi Frequency of residue i aligned to residue j: qij

Mark Voorhies Pairwise Alignment

slide-13
SLIDE 13

Deriving scores from alignments

Frequency of residue i: pi Frequency of residue i aligned to residue j: qij Expected frequency if i and j are independent: pipj

Mark Voorhies Pairwise Alignment

slide-14
SLIDE 14

Deriving scores from alignments

Frequency of residue i: pi Frequency of residue i aligned to residue j: qij Expected frequency if i and j are independent: pipj Ratio of observed to expected frequency: qij pipj

Mark Voorhies Pairwise Alignment

slide-15
SLIDE 15

Deriving scores from alignments

Frequency of residue i: pi Frequency of residue i aligned to residue j: qij Expected frequency if i and j are independent: pipj Ratio of observed to expected frequency: qij pipj Log odds (LOD) score: s(i, j) = log qij pipj

Mark Voorhies Pairwise Alignment

slide-16
SLIDE 16

PAM (Dayhoff) and BLOSUM matrices

PAM1 matrix originally calculated from manual alignments of highly conserved sequences (myoglobin, cytochrome C, etc.)

Mark Voorhies Pairwise Alignment

slide-17
SLIDE 17

PAM (Dayhoff) and BLOSUM matrices

PAM1 matrix originally calculated from manual alignments of highly conserved sequences (myoglobin, cytochrome C, etc.) We can think of a PAM matrix as evolving a sequence by one unit of time.

Mark Voorhies Pairwise Alignment

slide-18
SLIDE 18

PAM (Dayhoff) and BLOSUM matrices

PAM1 matrix originally calculated from manual alignments of highly conserved sequences (myoglobin, cytochrome C, etc.) We can think of a PAM matrix as evolving a sequence by one unit of time. If evolution is uniform over time, then PAM matrices for larger evolutionary steps can be generated by multiplying PAM1 by itself (so, higher numbered PAM matrices represent greater evolutionary distances).

Mark Voorhies Pairwise Alignment

slide-19
SLIDE 19

PAM (Dayhoff) and BLOSUM matrices

PAM1 matrix originally calculated from manual alignments of highly conserved sequences (myoglobin, cytochrome C, etc.) We can think of a PAM matrix as evolving a sequence by one unit of time. If evolution is uniform over time, then PAM matrices for larger evolutionary steps can be generated by multiplying PAM1 by itself (so, higher numbered PAM matrices represent greater evolutionary distances). The BLOSUM matrices were determined from automatically generated ungapped alignments. Higher numbered BLOSUM matrices correspond to smaller evolutionary distances. BLOSUM62 is the default matrix for BLAST.

Mark Voorhies Pairwise Alignment

slide-20
SLIDE 20

BLOSUM80

Mark Voorhies Pairwise Alignment

slide-21
SLIDE 21

BLOSUM62

Mark Voorhies Pairwise Alignment

slide-22
SLIDE 22

BLOSUM45

Mark Voorhies Pairwise Alignment

slide-23
SLIDE 23

Fun with logarithms

In log space, multiplication and division become addition and subtraction: log(xy) = log(x) + log(y) log(x/y) = log(x) − log(y) Therefore, exponentiation becomes multiplication: log(xy) = y log(x) Also, we can change of the base of a logarithm like so: logA(x) = log(x)/ log(A)

Mark Voorhies Pairwise Alignment

slide-24
SLIDE 24

Scoring an alignment

Log odds (LOD) score: s(i, j) = log qij pipj

Mark Voorhies Pairwise Alignment

slide-25
SLIDE 25

Scoring an alignment

Log odds (LOD) score: s(i, j) = log qij pipj Multiplying independent probabilities is equivalent to adding independent log probabilities.

Mark Voorhies Pairwise Alignment

slide-26
SLIDE 26

Scoring an alignment

Log odds (LOD) score: s(i, j) = log qij pipj Multiplying independent probabilities is equivalent to adding independent log probabilities. Therefore, for an ungapped alignment can be scored as: S(x, y) = log

N

  • i

qxiyi pxipyi =

N

  • i

s(xi, yi)

Mark Voorhies Pairwise Alignment

slide-27
SLIDE 27

Scoring an alignment

Log odds (LOD) score: s(i, j) = log qij pipj Multiplying independent probabilities is equivalent to adding independent log probabilities. Therefore, for an ungapped alignment can be scored as: S(x, y) = log

N

  • i

qxiyi pxipyi =

N

  • i

s(xi, yi) What about gaps?

Mark Voorhies Pairwise Alignment

slide-28
SLIDE 28

Scoring an alignment

Log odds (LOD) score: s(i, j) = log qij pipj Multiplying independent probabilities is equivalent to adding independent log probabilities. Therefore, for an ungapped alignment can be scored as: S(x, y) = log

N

  • i

qxiyi pxipyi =

N

  • i

s(xi, yi) What about gaps? Probability of an insertion/deletion event (gap opening, G) Length distribution of insertions/deletions (gap extension, E)

Mark Voorhies Pairwise Alignment

slide-29
SLIDE 29

Scoring an alignment

Log odds (LOD) score: s(i, j) = log qij pipj Multiplying independent probabilities is equivalent to adding independent log probabilities. Therefore, for an ungapped alignment can be scored as: S(x, y) = log

N

  • i

qxiyi pxipyi =

N

  • i

s(xi, yi) What about gaps? Probability of an insertion/deletion event (gap opening, G) Length distribution of insertions/deletions (gap extension, E) Sgapped(x, y) = S(x, y) +

gaps

  • i

(G + E ∗ Li)

Mark Voorhies Pairwise Alignment

slide-30
SLIDE 30

Scoring an alignment

Log odds (LOD) score: s(i, j) = log qij pipj Multiplying independent probabilities is equivalent to adding independent log probabilities. Therefore, for an ungapped alignment can be scored as: S(x, y) = log

N

  • i

qxiyi pxipyi =

N

  • i

s(xi, yi) What about gaps? Probability of an insertion/deletion event (gap opening, G) Length distribution of insertions/deletions (gap extension, E) Sgapped(x, y) = S(x, y) +

gaps

  • i

(G + E ∗ Li) We find an optimal alignment by finding x and y that maximize S.

Mark Voorhies Pairwise Alignment

slide-31
SLIDE 31

How many ways can we align two sequences?

Mark Voorhies Pairwise Alignment

slide-32
SLIDE 32

How many ways can we align two sequences?

Mark Voorhies Pairwise Alignment

slide-33
SLIDE 33

How many ways can we align two sequences?

Mark Voorhies Pairwise Alignment

slide-34
SLIDE 34

How many ways can we align two sequences?

Mark Voorhies Pairwise Alignment

slide-35
SLIDE 35

How many ways can we align two sequences?

Binomial formula: k r

  • =

k! (k − r)!r! 2n n

  • = (2n)!

n!n! Stirling’s approximation: x! ≈ √ 2π

  • xx+ 1

2

  • e−x

2n n

  • ≈ 22n

√πn

Mark Voorhies Pairwise Alignment

slide-36
SLIDE 36

Scoring an alignment quickly

22n √πn is too expensive.

Mark Voorhies Pairwise Alignment

slide-37
SLIDE 37

Scoring an alignment quickly

22n √πn is too expensive.

Sgapped(x, y) = S(x, y) +

gaps

  • i

(G + E ∗ Li)

Mark Voorhies Pairwise Alignment

slide-38
SLIDE 38

Scoring an alignment quickly

22n √πn is too expensive.

Sgapped(x, y) = S(x, y) +

gaps

  • i

(G + E ∗ Li) The best alignment of any pair of subsequences is independent of the global alignment.

Mark Voorhies Pairwise Alignment

slide-39
SLIDE 39

Dynamic Programming

Mark Voorhies Pairwise Alignment

slide-40
SLIDE 40

Needleman-Wunsch Global Alignment

Mark Voorhies Pairwise Alignment

slide-41
SLIDE 41

Needleman-Wunsch Global Alignment

Mark Voorhies Pairwise Alignment

slide-42
SLIDE 42

Needleman-Wunsch Global Alignment

Mark Voorhies Pairwise Alignment

slide-43
SLIDE 43

Needleman-Wunsch Global Alignment

Mark Voorhies Pairwise Alignment

slide-44
SLIDE 44

Needleman-Wunsch Global Alignment

Mark Voorhies Pairwise Alignment

slide-45
SLIDE 45

Alignment speeds

DOTTER: O(n2) Exhaustive search:

22n √πn

Dynamic programming: O(n2) to O(n3)

Mark Voorhies Pairwise Alignment

slide-46
SLIDE 46

Setting gap penalties in CLUSTALX

Mark Voorhies Pairwise Alignment

slide-47
SLIDE 47

Annotating features in JALVIEW

Mark Voorhies Pairwise Alignment

slide-48
SLIDE 48

Annotating features in JALVIEW

Mark Voorhies Pairwise Alignment

slide-49
SLIDE 49

Annotating features in JALVIEW

Mark Voorhies Pairwise Alignment

slide-50
SLIDE 50

Annotating features in JALVIEW

Mark Voorhies Pairwise Alignment

slide-51
SLIDE 51

Annotating features in JALVIEW

Mark Voorhies Pairwise Alignment

slide-52
SLIDE 52

Annotating features in JALVIEW

Mark Voorhies Pairwise Alignment

slide-53
SLIDE 53

Annotating features in JALVIEW

Mark Voorhies Pairwise Alignment

slide-54
SLIDE 54

Annotating features in JALVIEW

Mark Voorhies Pairwise Alignment

slide-55
SLIDE 55

Annotating features in JALVIEW

Mark Voorhies Pairwise Alignment

slide-56
SLIDE 56

Comparing files on *NIX

# L i s t a l l d i f f e r e n c e s between two t e x t f i l e s # ( empty output f o r i d e n t i t y ) d i f f HvSs . gap0 . 0 . both . aln HvSs . gap0 . 0 . mult . aln # Report

  • nly

whether the f i l e s d i f f e r # ( empty output f o r i d e n t i t y ) d i f f −q HvSs . gap0 . 0 . both . aln HvSs . gap0 . 0 . mult . aln (*NIX = *BSD, OS X, Solaris, Linux, Windows with Cygwin, ...)

Mark Voorhies Pairwise Alignment

slide-57
SLIDE 57

Homework

Use a text or sequence editor to create a spliced variant of HvPhytepsin that can be aligned to the full HsSaposinC sequence Find the GenBank entries for HvPhytepsin and SsPepsinogen (tip: use the identifiers from the FASTA files) and find the corresponding transcript sequences.

How easy is it to align the proteins vs. the transcripts? Can you tell if you are getting equivalent results from the two alignments?

Try repeating this exercise for a pair of sequences where genomic sequence is available; e.g., A. nidulans VosA (ABQ18268.1), and H. capsulatum Ryp2 (ACB59236.1).

Mark Voorhies Pairwise Alignment