Heuristic Alignment and Searching Mark Voorhies 3/28/2012 Mark - - PowerPoint PPT Presentation

heuristic alignment and searching
SMART_READER_LITE
LIVE PREVIEW

Heuristic Alignment and Searching Mark Voorhies 3/28/2012 Mark - - PowerPoint PPT Presentation

Heuristic Alignment and Searching Mark Voorhies 3/28/2012 Mark Voorhies Heuristic Alignment and Searching Types of alignments Global Alignment Each letter of each sequence is aligned to a letter or a gap ( e.g. , Needleman-Wunsch). Local


slide-1
SLIDE 1

Heuristic Alignment and Searching

Mark Voorhies 3/28/2012

Mark Voorhies Heuristic Alignment and Searching

slide-2
SLIDE 2

Types of alignments

Global Alignment Each letter of each sequence is aligned to a letter or a gap (e.g., Needleman-Wunsch). Local Alignment An optimal pair of subsequences is taken from the two sequences and globally aligned (e.g., Smith-Waterman).

Mark Voorhies Heuristic Alignment and Searching

slide-3
SLIDE 3

Types of alignments

Global Alignment Each letter of each sequence is aligned to a letter or a gap (e.g., Needleman-Wunsch). Local Alignment An optimal pair of subsequences is taken from the two sequences and globally aligned (e.g., Smith-Waterman). This tends to be more biologically relevant.

Mark Voorhies Heuristic Alignment and Searching

slide-4
SLIDE 4

Smith-Waterman

The implementation of local alignment is the same as for global alignment, with a few changes to the rules: Initialize edges to 0 (no penalty for starting in the middle of a sequence) The maximum score is never less than 0, and no pointer is recorded unless the score is greater than 0 (note that this implies negative scores for gaps and bad matches) The trace-back starts from the highest score in the matrix and ends at a score of 0 (local, rather than global, alignment) Because the naive implementation is essentially the same, the time and space requirements are also the same.

Mark Voorhies Heuristic Alignment and Searching

slide-5
SLIDE 5

Timing CLUSTALW

Timing CLUSTALW from the command line:

f o r i i n 50 100 150 200 250 300 350 400 450; do head −n $ i −q G217B iron . f a s t a Pb01 iron . f a s t a > temp . f a s t a ; time c l u s t a l w −i n f i l e =temp . f a s t a −type= DNA −a l i g n ; done

The output looks like this:

Sequences (1:2) Aligned. Score: Guide tree file created: [temp.dnd] There are 1 groups Start of Multiple Alignment Aligning... Group 1: Delayed Alignment Score 7238 CLUSTAL-Alignment file created [temp.aln] real 0m3.400s user 0m3.388s sys 0m0.012s Mark Voorhies Heuristic Alignment and Searching

slide-6
SLIDE 6

Timing CLUSTALW

You can copy the timing results into Excel. Here is a Python script that does the same thing:

#!/ usr / bin / env python # Time−stamp : <ParseTimes . py 2011−03−29 21:10:59 Mark Voorhies> ””” Parse w a l l times from a log f i l e

  • n

s t d i n and w r i t e them as a CSV formatted column f o r Excel / OpenOffice / etc

  • n

stdout . I f command l i n e arguments are given , t r e a t them as a second

  • utput

column . ””” from csv import w r i t e r import re t i m e r e = re . compile ( ”ˆ r e a l .∗(?P <minutes >[\d]+)m(?P <seconds >[\d ]+\.[\ d]+) s ” , re .M) i f ( name == ” m a i n ” ) : import s y s args = s y s . argv [ 1 : ]

  • ut = w r i t e r ( s y s . stdout )

i = 0 f o r t i n t i m e r e . f i n d i t e r ( s y s . s t d i n . read ( ) ) : t r y : y = args [ i ] i += 1 except I n d e x E r r o r : y = ””

  • ut . writerow (

( f l o a t ( t . group ( ” minutes ”))∗60+ f l o a t ( t . group ( ” seconds ” ) ) , y ) ) del

  • ut

Mark Voorhies Heuristic Alignment and Searching

slide-7
SLIDE 7

Timing CLUSTALW

You can fit the timing results to a curve in Excel. y = AxB (1) log y = log AxB (2) = log A + B log x (3) = A′ + B log x (4) Here is an R script that does the same thing:

data < − read . csv ( ” t i m i n g s . csv ” , header = FALSE , c o l . names = c ( ” t ” , ”n” )) x < − log ( data $n∗80) y < − log ( data $ t / 60) f < − lm ( y ˜ x ) x0 < − 0:40000 a < − exp ( f $ c o e f f [ 1 ] ) b < − f $ c o e f f [ 2 ] pdf ( ” ClustalwTimings . pdf ” ) p l o t ( data $n∗80 , data$ t / 60 , x la b = ” l e n g t h /bp” , y la b = ” time / minutes ” , main = ”CLUSTALW t i m i n g s

  • n

I n t e l Core2 T7300@2 .00GHz , 32 b i t ” ) p o i n t s ( x0 , a∗x0ˆb , c o l = ” blue ” , type = ” l ” ) legend ( ” t o p l e f t ” , c ( ”y = ( 1 . 8 e−9)x ˆ ( 2 . 0 8 ) ” ) , c o l = ” blue ” , l t y = 1) dev . o f f ( ) Mark Voorhies Heuristic Alignment and Searching

slide-8
SLIDE 8

CLUSTALW takes O(MN) time

  • 5000

10000 15000 20000 25000 30000 35000 1 2 3 4 5

CLUSTALW timings on Intel Core2 T7300@2.00GHz, 32bit

length/bp time/minutes y = (1.8e−9)x^(2.08)

Mark Voorhies Heuristic Alignment and Searching

slide-9
SLIDE 9

O(MN) time is too slow!

Dec 1982 Dec 1990 Dec 1994 Dec 1998 Dec 2002 Dec 2006 Dec 2010 0.0e+00 2.0e+10 4.0e+10 6.0e+10 8.0e+10 1.0e+11 1.2e+11

source: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt

Genbank Release Size/bp

Mark Voorhies Heuristic Alignment and Searching

slide-10
SLIDE 10

Basic Local Alignment Search Tool

Why BLAST? Fast, heuristic approximation to a full Smith-Waterman local alignment Developed with a statistical framework to calculate expected number of false positive hits. Heuristics biased towards “biologically relevant” hits.

Mark Voorhies Heuristic Alignment and Searching

slide-11
SLIDE 11

Seeding searches

Most of the magic in a sequence-search tool lives in its indexing scheme

Program Purpose Indexing BLAST Database searching Target indexing, 3aa or 11nt words BLAT mRNA mapping Query indexing BOWTIE RnaSeq Specialized index for low quality, short reads e-PCR Simulated PCR Annealing-oriented index

Mark Voorhies Heuristic Alignment and Searching

slide-12
SLIDE 12

BLAST: A quick overview

Mark Voorhies Heuristic Alignment and Searching

slide-13
SLIDE 13

BLAST: Seed from exact word hits

Mark Voorhies Heuristic Alignment and Searching

slide-14
SLIDE 14

BLAST: Myers and Miller local alignment around seed pairs

Mark Voorhies Heuristic Alignment and Searching

slide-15
SLIDE 15

BLAST: High Scoring Pairs (HSPs)

Mark Voorhies Heuristic Alignment and Searching

slide-16
SLIDE 16

Gapped BLAST: Merge neighboring HSPs

Mark Voorhies Heuristic Alignment and Searching

slide-17
SLIDE 17

How fast is BLAST?

Mark Voorhies Heuristic Alignment and Searching

slide-18
SLIDE 18

How fast is BLAST?

Mark Voorhies Heuristic Alignment and Searching

slide-19
SLIDE 19

How fast is BLAST?

time bl2seq -p blastn -i G217B_iron.fasta -j Pb01_iron.fasta -e 1e-6 > temp.blastn real 0m0.342s user 0m0.080s sys 0m0.032s Mark Voorhies Heuristic Alignment and Searching

slide-20
SLIDE 20

The basic flavors of BLAST

Target Protein DNA Query Protein BLASTP TBLASTN DNA BLASTX BLASTN TBLASTX

Mark Voorhies Heuristic Alignment and Searching

slide-21
SLIDE 21

BLASTX: Nucleotide query vs. Protein Database

Mark Voorhies Heuristic Alignment and Searching

slide-22
SLIDE 22

BLASTX: Nucleotide query vs. Protein Database

Mark Voorhies Heuristic Alignment and Searching

slide-23
SLIDE 23

Sometimes it’s still worth running locally...

Mark Voorhies Heuristic Alignment and Searching

slide-24
SLIDE 24

Karlin-Altschul Statistics

E = kmne−λS (5) S: HSP score E: Expected number of “random” hits in a database of this size scoring at least S. m: Query length n: Database size k: Correction for similar, overlapping hits λ: normalization factor for scoring matrix

Mark Voorhies Heuristic Alignment and Searching

slide-25
SLIDE 25

Karlin-Altschul Statistics

E = kmne−λS (5) S: HSP score E: Expected number of “random” hits in a database of this size scoring at least S. m: Query length n: Database size k: Correction for similar, overlapping hits λ: normalization factor for scoring matrix A variant of this formula is used to generate sum probabilities for combined HSPs.

Mark Voorhies Heuristic Alignment and Searching

slide-26
SLIDE 26

Karlin-Altschul Statistics

E = kmne−λS (5) S: HSP score E: Expected number of “random” hits in a database of this size scoring at least S. m: Query length n: Database size k: Correction for similar, overlapping hits λ: normalization factor for scoring matrix A variant of this formula is used to generate sum probabilities for combined HSPs. p = 1 − e−E (6)

Mark Voorhies Heuristic Alignment and Searching

slide-27
SLIDE 27

Karlin-Altschul Statistics

E = kmne−λS (5) S: HSP score E: Expected number of “random” hits in a database of this size scoring at least S. m: Query length n: Database size k: Correction for similar, overlapping hits λ: normalization factor for scoring matrix A variant of this formula is used to generate sum probabilities for combined HSPs. p = 1 − e−E (6) (If you care about the difference between E and p, you’re already in trouble)

Mark Voorhies Heuristic Alignment and Searching

slide-28
SLIDE 28

Karlin-Altschul Statistics

Important points: Extreme value distribution Assumption of infinite sequence length No rigorous framework for gap statistics (hmmer3 tries to fill this gap)

Mark Voorhies Heuristic Alignment and Searching

slide-29
SLIDE 29

Summary

BLAST is very fast, at the expense of not guaranteeing globally optimal results

Mark Voorhies Heuristic Alignment and Searching

slide-30
SLIDE 30

Summary

BLAST is very fast, at the expense of not guaranteeing globally optimal results But the trade-offs that it makes are biased towards “biologically relevant” results

Mark Voorhies Heuristic Alignment and Searching

slide-31
SLIDE 31

Summary

BLAST is very fast, at the expense of not guaranteeing globally optimal results But the trade-offs that it makes are biased towards “biologically relevant” results And it provides a statistical framework for evaluating its results.

Mark Voorhies Heuristic Alignment and Searching

slide-32
SLIDE 32

Summary

BLAST is very fast, at the expense of not guaranteeing globally optimal results But the trade-offs that it makes are biased towards “biologically relevant” results And it provides a statistical framework for evaluating its results. We can, and should, treat our computer work as we would an experiment:

Document protocols and observations Run positive and negative controls Keep results organized and dated

Mark Voorhies Heuristic Alignment and Searching

slide-33
SLIDE 33

Homework

Search your favorite proteins and collate interesting hits in one FASTA file per query – play with adding informative names and annotations (we will use these FASTA files tomorrow). Play with the BLAST book protocols (chapter 9) on the NCBI website Play with positive and negative controls (including permuted sequences)

Mark Voorhies Heuristic Alignment and Searching