Scoring Alignments Genome 373 Genomic Informatics Elhanan - - PowerPoint PPT Presentation

scoring alignments
SMART_READER_LITE
LIVE PREVIEW

Scoring Alignments Genome 373 Genomic Informatics Elhanan - - PowerPoint PPT Presentation

Scoring Alignments Genome 373 Genomic Informatics Elhanan Borenstein A quick review Course logistics Genomes (so many genomes) The computational bottleneck Informatic Challenges: Examples Sequence comparison: Find the best


slide-1
SLIDE 1

Scoring Alignments

Genome 373 Genomic Informatics Elhanan Borenstein

slide-2
SLIDE 2

A quick review

  • Course logistics
  • Genomes (so many genomes)
  • The computational bottleneck
slide-3
SLIDE 3

Informatic Challenges: Examples

  • Sequence comparison:

– Find the best alignment of two sequences – Find the best match (alignment) of a given sequence in a large dataset of sequences – Find the best alignment of multiple sequences

  • Motif and gene finding
  • Relationship between sequences

– Phylogeny

  • Clustering and classification
  • Many many many more …
slide-4
SLIDE 4

Motivation

  • Why compare two protein or DNA sequences?
slide-5
SLIDE 5

Motivation

  • Why compare two protein or DNA sequences?

– Determine whether they are descended from a common ancestor (homologous) – Infer a common function – Locate functional elements (motifs or domains) – Infer protein or RNA structure, if the structure of

  • ne of the sequences is known

– Analyze sequence evolution – Infer the species from which a sequence originated

slide-6
SLIDE 6

Informatic Challenges: Examples

  • Sequence comparison:

– Find the best alignment of two sequences – Find the best match (alignment) of a given sequence in a large dataset of sequences – Find the best alignment of multiple sequences

  • Motif and gene finding
  • Relationship between sequences

– Phylogeny

  • Clustering and classification
  • Many many many more …
slide-7
SLIDE 7

Informatic Challenges: Examples

  • Sequence comparison:

Find the best alignment of two sequences Find the best match (alignment) of a given

sequence in a large dataset of sequences – Find the best alignment of multiple sequences

  • Motif and gene finding
  • Relationship between sequences

– Phylogeny

  • Clustering and classification
  • Many many many more …
slide-8
SLIDE 8

One of many commonly used tools that depend

  • n sequence alignment.
slide-9
SLIDE 9

Sequence Alignment

slide-10
SLIDE 10

Mission: Find the best alignment between two sequences.

slide-11
SLIDE 11

Mission: Find the best alignment between two sequences.

GAATC CATAC GAATC- CA-TAC GAAT-C C-ATAC GAAT-C CA-TAC

  • GAAT-C

C-A-TAC GA-ATC CATA-C

(some of a very large number of possibilities)

GAAT-C C-ATAC GAAT-C CA-TAC

Find the best alignment of GAATC and CATAC:

slide-12
SLIDE 12

Mission: Find the best alignment between two sequences.

This is an optimization problem! What do we need to solve this problem?

slide-13
SLIDE 13

Mission: Find the best alignment between two sequences.

A method for scoring alignments A “search” algorithm for finding the alignment with the best score

slide-14
SLIDE 14

Scoring Principles

  • Score each locus independently.
  • The alignment score will be the sum of the scores in

all loci.

  • Perfect Matches will get a positive (good) score.
  • What about mismatches?

GAATC CATAC

slide-15
SLIDE 15

Scoring Principles

  • Score each locus independently.
  • The alignment score will be the sum of the scores in

all loci.

  • Perfect Matches will get a positive (good) score.
  • What about mismatches?

GAATC CATAC

(transitions are typically about 2x as frequent as transversions in real sequences)

slide-16
SLIDE 16

Scoring Aligned Bases

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

  • A reasonable substitution matrix:

GAATC CATAC

  • 5 + 10 + -5 + -5 + 10 = 5

What about gaps?

slide-17
SLIDE 17

What About Gaps?

A C G T A 10

  • 5
  • 5

C

  • 5

10

  • 5

G

  • 5

10

  • 5

T

  • 5
  • 5

10

  • A reasonable substitution matrix:

GAAT-C CA-TAC

  • 5 + 10 + ? + 10 + ? + 10 = ?

What if gaps have no penalty? What do gaps mean? What if gaps have no penalty? What do gaps mean?

slide-18
SLIDE 18
  • Linear gap penalty: every gap receives a score of d:

GAAT-C d=-4 CA-TAC

  • 5 + 10 + -4 + 10 + -4 + 10 = 17

Scoring Gaps?

slide-19
SLIDE 19
  • Linear gap penalty: every gap receives a score of d:
  • Affine gap penalty: opening a gap receives a score of d;

extending a gap receives a score of e:

GAAT-C d=-4 CA-TAC

  • 5 + 10 + -4 + 10 + -4 + 10 = 17

G--AATC d=-4 CATA--C e=-1

  • 5 + -4 + -1 + 10 + -4 + -1 + 10 = 5

Scoring Gaps?

slide-20
SLIDE 20

Same Method Applies to AA

regular 20 amino acids ambiguity codes and stop

BLOSUM62 Score Matrix

YMEGDLEIAPDAK VL--DKELSPDGT

Y mutates to V receives -1 M mutates to L receives 2 E gets deleted receives -10 G gets deleted receives -10 D matches D receives 6 Total score = -13

slide-21
SLIDE 21

Mission: Find the best alignment between two sequences.

A method for scoring alignments A “search” algorithm for finding the alignment with the best score

?

slide-22
SLIDE 22

Exhaustive search

  • Align the two sequences: GAATC and CATAC

Simple (exhaustive search) algorithm 1) Construct all possible alignments 2) Use the substitution matrix and gap penalty to score each alignment 3) Pick the alignment with the best score

GAATC CATAC GAATC- CA-TAC GAAT-C C-ATAC GAAT-C CA-TAC

  • GAAT-C

C-A-TAC GA-ATC CATA-C GAAT-C C-ATAC GAAT-C CA-TAC

slide-23
SLIDE 23

How many possibilities?

  • How many different possible alignments of

two sequences of length n exist?

  • Align the two sequences: GAATC and CATAC

GAATC CATAC GAATC- CA-TAC GAAT-C C-ATAC GAAT-C CA-TAC

  • GAAT-C

C-A-TAC GA-ATC CATA-C GAAT-C C-ATAC GAAT-C CA-TAC

slide-24
SLIDE 24

How many possibilities?

  • How many different possible alignments of

two sequences of length n exist? 5 2.5x102

10 1.8x105 20 1.4x1011 30 1.2x1017 40 1.1x1023

  • Align the two sequences: GAATC and CATAC

GAATC CATAC GAATC- CA-TAC GAAT-C C-ATAC GAAT-C CA-TAC

  • GAAT-C

C-A-TAC GA-ATC CATA-C GAAT-C C-ATAC GAAT-C CA-TAC

slide-25
SLIDE 25

Mission: Find the best alignment between two sequences.

A method for scoring alignments A “search” algorithm for finding the alignment with the best score

  • Needleman–Wunsch

Algorithm

  • Dynamic programming
slide-26
SLIDE 26

The Needleman–Wunsch Algorithm

  • An algorithm for global alignment on two

sequences

  • A Dynamic Programming (DP) approach

– Yes, it’s a weird name. – DP is closely related to recursion and to mathematical induction

  • We can prove that the resulting score is
  • ptimal.
slide-27
SLIDE 27

DP matrix

G A A T C C A T A C

i 1 2 3 4 5 j 0 1 2 3 etc.

slide-28
SLIDE 28