Sequence Comparison: Dynamic Programming Genome 373 Genomic - - PowerPoint PPT Presentation

sequence comparison
SMART_READER_LITE
LIVE PREVIEW

Sequence Comparison: Dynamic Programming Genome 373 Genomic - - PowerPoint PPT Presentation

Sequence Comparison: Dynamic Programming Genome 373 Genomic Informatics Elhanan Borenstein GAATC CATAC Mission: Find the best alignment between two sequences. A search algorithm for A method for finding the alignment scoring with


slide-1
SLIDE 1

Sequence Comparison: Dynamic Programming

Genome 373 Genomic Informatics Elhanan Borenstein

slide-2
SLIDE 2

Mission: Find the best alignment between two sequences.

A method for scoring alignments A “search” algorithm for finding the alignment with the best score

  • Substitution matrix
  • Gap penalties
  • Dynamic programming

GAATC CATAC

slide-3
SLIDE 3

Scoring Aligned Bases

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

  • Substitution matrix:
  • Gap penalty:
  • Linear gap penalty
  • Affine gap penalty

GAAT-C d=-4 CA-TAC

  • 5 + 10 + -4 + 10 + -4 + 10 = 17
slide-4
SLIDE 4

GAATC CATAC GAATC- CA-TAC GAAT-C C-ATAC GAAT-C CA-TAC

  • GAAT-C

C-A-TAC GA-ATC CATA-C

Exhaustive search

slide-5
SLIDE 5

GAATC CATAC GAATC- CA-TAC GAAT-C C-ATAC GAAT-C CA-TAC

  • GAAT-C

C-A-TAC GA-ATC CATA-C

How many possibilities?

  • How many different possible alignments of

two sequences of length n exist?

slide-6
SLIDE 6

GAATC CATAC GAATC- CA-TAC GAAT-C C-ATAC GAAT-C CA-TAC

  • GAAT-C

C-A-TAC GA-ATC CATA-C

How many possibilities?

  • How many different possible alignments of

two sequences of length n exist?

5 2.5x102 10 1.8x105 20 1.4x1011 30 1.2x1017 40 1.1x1023

slide-7
SLIDE 7

The Needleman–Wunsch Algorithm

  • An algorithm for global alignment on two

sequences

  • A Dynamic Programming (DP) approach

– Yes, it’s a weird name. – DP is closely related to recursion and to mathematical induction

  • We can prove that the resulting score is
  • ptimal.
slide-8
SLIDE 8

DP matrix

G A A T C C A T A C

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

i 1 2 3 4 5 j 0 1 2 3 etc.

5

The value at (i,j) is the score of the best alignment of the first i characters

  • f one sequence versus the first j

characters of the other sequence.

GA CA

initial row and column

slide-9
SLIDE 9

DP matrix

G A A T C C A

5 1

T A C

Moving horizontally in the matrix introduces a gap in the sequence along the left edge.

GAA CA-

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

slide-10
SLIDE 10

DP matrix

G A A T C C A

5

T

1

A C

Moving vertically in the matrix introduces a gap in the sequence along the top edge.

GA- CAT

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

slide-11
SLIDE 11

DP matrix

G A A T C C A

5

T A C

Moving diagonally in the matrix aligns two residues

GAA CAT

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

slide-12
SLIDE 12

Initialization

G A A T C C A T A C

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

Start at top left and move progressively

slide-13
SLIDE 13

Introducing a gap

G A A T C

  • 4

C A T A C

G

  • A

C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

slide-14
SLIDE 14

G A A T C

  • 4

C

  • 4

A T A C

  • C

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

Introducing a gap

slide-15
SLIDE 15

Complete first row and column

G A A T C

  • 4
  • 8
  • 12
  • 16
  • 20

C

  • 4

A

  • 8

T

  • 12

A

  • 16

C

  • 20

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

  • CATAC
slide-16
SLIDE 16

Three ways to get to i=1, j=1

G A A T C

  • 4

C

  • 8

A T A C

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

G-

  • C

j 0 1 2 3 etc. i 1 2 3 4 5

slide-17
SLIDE 17

G A A T C C

  • 4
  • 8

A T A C

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

  • G

C-

Three ways to get to i=1, j=1

j 0 1 2 3 etc. i 1 2 3 4 5

slide-18
SLIDE 18

G A A T C C

  • 5

A T A C

G C

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

Three ways to get to i=1, j=1

i 1 2 3 4 5 j 0 1 2 3 etc.

slide-19
SLIDE 19

Accept the highest scoring

  • f the three

G A A T C

  • 4
  • 8
  • 12
  • 16
  • 20

C

  • 4
  • 5

A

  • 8

T

  • 12

A

  • 16

C

  • 20

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

Then simply repeat the same rule progressively across the matrix

slide-20
SLIDE 20

DP matrix

G A A T C

  • 4
  • 8
  • 12
  • 16
  • 20

C

  • 4
  • 5

A

  • 8

? T

  • 12

A

  • 16

C

  • 20

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

slide-21
SLIDE 21

DP matrix

G A A T C

  • 4
  • 8
  • 12
  • 16
  • 20

C

  • 4
  • 5

A

  • 8

? T

  • 12

A

  • 16

C

  • 20
  • 4
  • 4
  • G

CA G- CA

  • -G

CA-

  • 4+0=-4
  • 5+-4=-9
  • 8+-4=-12

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

slide-22
SLIDE 22

G- CA

DP matrix

G A A T C

  • 4
  • 8
  • 12
  • 16
  • 20

C

  • 4
  • 5

A

  • 8
  • 4

T

  • 12

A

  • 16

C

  • 20
  • 4
  • 4
  • -G

CA-

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

  • 8+-4=-12
  • G

CA

  • 4+0=-4
  • 5+-4=-9
slide-23
SLIDE 23

DP matrix

G A A T C

  • 4
  • 8
  • 12
  • 16
  • 20

C

  • 4
  • 5

A

  • 8
  • 4

T

  • 12

? A

  • 16

? C

  • 20

?

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

slide-24
SLIDE 24

DP matrix

G A A T C

  • 4
  • 8
  • 12
  • 16
  • 20

C

  • 4
  • 5

A

  • 8
  • 4

T

  • 12
  • 8

A

  • 16
  • 12

C

  • 20
  • 16

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

slide-25
SLIDE 25

DP matrix

G A A T C

  • 4
  • 8
  • 12
  • 16
  • 20

C

  • 4
  • 5

? A

  • 8
  • 4

? T

  • 12
  • 8

? A

  • 16
  • 12

? C

  • 20
  • 16

?

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

slide-26
SLIDE 26

Traceback

G A A T C

  • 4
  • 8
  • 12
  • 16
  • 20

C

  • 4
  • 5
  • 9

A

  • 8
  • 4

5 T

  • 12
  • 8

1 A

  • 16
  • 12

2 C

  • 20
  • 16
  • 2

What is the alignment associated with this entry?

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

Just follow the arrows back - this is called the traceback

  • G-A

CATA

slide-27
SLIDE 27

G A A T C

  • 4
  • 8
  • 12
  • 16
  • 20

C

  • 4
  • 5
  • 9

A

  • 8
  • 4

5 T

  • 12
  • 8

1 A

  • 16
  • 12

2 C

  • 20
  • 16
  • 2

?

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

Continue and find the optimal global alignment, and its score.

Full Alignment

slide-28
SLIDE 28

G A A T C

  • 4
  • 8
  • 12
  • 16
  • 20

C

  • 4
  • 5
  • 9
  • 13
  • 12
  • 6

A

  • 8
  • 4

5 1

  • 3
  • 7

T

  • 12
  • 8

1 11 7 A

  • 16
  • 12

2 11 7 6 C

  • 20
  • 16
  • 2

7 11 17

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

Full Alignment

slide-29
SLIDE 29

G A A T C

  • 4
  • 8
  • 12
  • 16
  • 20

C

  • 4
  • 5
  • 9
  • 13
  • 12
  • 6

A

  • 8
  • 4

5 1

  • 3
  • 7

T

  • 12
  • 8

1 11 7 A

  • 16
  • 12

2 11 7 6 C

  • 20
  • 16
  • 2

7 11 17

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

Best alignment starts at bottom right and follows traceback arrows to top left

Full Alignment

slide-30
SLIDE 30

G A A T C

  • 4
  • 8
  • 12
  • 16
  • 20

C

  • 4
  • 5
  • 9
  • 13
  • 12
  • 6

A

  • 8
  • 4

5 1

  • 3
  • 7

T

  • 12
  • 8

1 11 7 A

  • 16
  • 12

2 11 7 6 C

  • 20
  • 16
  • 2

7 11 17

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

GA-ATC CATA-C

One best traceback

slide-31
SLIDE 31

G A A T C

  • 4
  • 8
  • 12
  • 16
  • 20

C

  • 4
  • 5
  • 9
  • 13
  • 12
  • 6

A

  • 8
  • 4

5 1

  • 3
  • 7

T

  • 12
  • 8

1 11 7 A

  • 16
  • 12

2 11 7 6 C

  • 20
  • 16
  • 2

7 11 17

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

GAAT-C

  • CATAC Another best traceback
slide-32
SLIDE 32

G A A T C

  • 4
  • 8
  • 12
  • 16
  • 20

C

  • 4
  • 5
  • 9
  • 13
  • 12
  • 6

A

  • 8
  • 4

5 1

  • 3
  • 7

T

  • 12
  • 8

1 11 7 A

  • 16
  • 12

2 11 7 6 C

  • 20
  • 16
  • 2

7 11 17

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

GA-ATC CATA-C GAAT-C

  • CATAC
slide-33
SLIDE 33

Multiple solutions

  • When a program returns a single

sequence alignment, it may not be the only best alignment but it is guaranteed to be one of them.

  • In our example, all of the alignments

at the left have equal scores. GA-ATC CATA-C GAAT-C CA-TAC GAAT-C C-ATAC GAAT-C

  • CATAC
slide-34
SLIDE 34

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

G A A T C A A T T C

Practice problem:

Find a best pairwise alignment of GAATC and AATTC

d = -4

slide-35
SLIDE 35