BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol - - PowerPoint PPT Presentation

blast basic local alignment search tool altschul et al j
SMART_READER_LITE
LIVE PREVIEW

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol - - PowerPoint PPT Presentation

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. Hashing A hash function maps a key to a value Hash table Hash table is a data structure: a way to store key-value pairs , and a way to retrieve them Based


slide-1
SLIDE 1

BLAST:
 Basic Local Alignment Search Tool
 Altschul et al. J. Mol Bio. 1990.

slide-2
SLIDE 2

Hashing

A hash function maps a key to a value

slide-3
SLIDE 3
  • Hash table is a data structure: a way to store

key-value pairs, and a way to retrieve them

  • Based on the idea of a hash function. This maps

a key or an object (e.g., a string, or a more complex record) to an integer, the “address”

  • The value of the key is then stored at that

address in memory

Hash table

slide-4
SLIDE 4
  • Key: (AAACGTAT, 1234321)
  • i.e., a 8 bp-string and its location in genome
  • We want to store many such strings and their

locations

  • and later retrieve all locations of a particular

string really quickly

  • Hash function h(AAACGTAT) = 435

Hashing: an example

Key=String Value = Address of where Location(String) is stored

slide-5
SLIDE 5
  • Let’s assume that there are 48 = 64K memory

locations available.

  • The first time we see (AAACGTAT, *), we store it at

address h(AAACGTAT) = 435.

  • The next time we see (AAACGTAT, *), we compute

h(AAACGTAT), go to 435, find it already occupied. A collision!

Hashing: an example

slide-6
SLIDE 6
  • Buckets: Address 435 can store multiple keys/
  • bjects (e.g., as a linked list)
  • Linear probing: If an address is occupied, store the

key/object in next available location

  • Multiple hashing: have an army of hash functions.

If the first one (“h”) led to a collision, try another hash function (“h2”)

How to handle collisions

slide-7
SLIDE 7

Bucketing and Chaining

slide-8
SLIDE 8

Open addressing and linear probing

slide-9
SLIDE 9

Preprocessing and hash

Preprocessing: store exact matches of all short patterns on the text by a hash table

ATC AAA TTC

{1,6,100,2000,5454, …, } {15,21,30,785,3434, …, } {5,164,220,502,943, …, } h h

address1

h

address2 address3

retrieve retrieve retrieve

slide-10
SLIDE 10
  • Given two sequences of same length, the similarity

score of their alignment (without gaps) is the sum of similarity values for each pair of aligned residues

  • Maximal segment pair (MSP): Highest scoring pair of

identical length segments from the two sequences being compared (“query” and “subject”)

  • The similarity score of an MSP is called the MSP score
  • BLAST heuristically aims to find them

BLAST: finding maximal segment pairs

slide-11
SLIDE 11

Maximal segment pairs and High scoring pairs

  • Goal: report database sequences that have MSP score above some

threshold S.

  • Thus, sequences with at least one locally maximal segment pair

that scores above S.

slide-12
SLIDE 12
  • A molecular biologist may be interested in all

conserved regions shared by two proteins, not just their highest scoring pair

  • A segment pair (segments of identical lengths) is

locally maximal if its score cannot be improved by extending or shortening in either direction

  • BLAST attempts to find all locally maximal

segment pairs above some score cutoff.

High scoring pairs (or local maximal segment pairs)

slide-13
SLIDE 13

A quick way to find MSPs

Seed Extend Extend

  • Homologous sequences tend to have very

similar or even identical substrings, also called seeds.

  • From a seed, it is possible to construct a local

HSP/MSP by extending to flanking regions.

slide-14
SLIDE 14

Efficient algorithm?

slide-15
SLIDE 15
  • 1. Break query sequence into words
slide-16
SLIDE 16
  • Find exact matches to query words
  • Can be done in efficiently
  • Hashing
  • Alternatively AC finite state machine

ATC AAA TTC

{1,6,100,2000,5454, …, } {15,21,30,785,3434, …, } {5,164,220,502,943, …, } h h

address1

h

address2 address3

retrieve retrieve retrieve

  • 2. Find database hits
slide-17
SLIDE 17
  • 2. Find database hits
slide-18
SLIDE 18
  • 3. Extend hits
slide-19
SLIDE 19

How to handle possible mismatches in words?

Neighbor words

slide-20
SLIDE 20

How to handle possible mismatches in words?

slide-21
SLIDE 21

How to handle possible mismatches in words?

slide-22
SLIDE 22

How to handle possible mismatches in words?

slide-23
SLIDE 23

Parameters

  • Word length: 3 for protein, 11 for DNA/RNA
  • Thresholds T and S:
  • BLAST minimizes time spent on database

sequences whose similarity with the query has little chance of exceeding this cutoff S.

  • Main strategy: seek only segment pairs (one from

database, one query) that contain a word pair with score >= T

  • Intuition: If the sequence pair has to score above

S, its most well matched word (of some predetermined small length) must score above T

  • Lower T => Fewer false negatives
  • Lower T => More pairs to analyze
slide-24
SLIDE 24
  • BLAST may not find all segment pairs above threshold S
  • Bounds on the error: not hard bounds, but statistical bounds
  • “Highly likely” to find the MSP

Choosing threshold S

slide-25
SLIDE 25
  • Is the score high enough to provide evidence of

homology?

  • Are the scores of alignments of random sequences

higher than this score?

  • What are is the expected number of alignments

between random sequences with score greater than this score?

Choosing threshold S

  • BLAST may not find all segment pairs above threshold S
  • Bounds on the error: not hard bounds, but statistical bounds
  • “Highly likely” to find the MSP
slide-26
SLIDE 26
  • BLAST may not find all segment pairs above threshold S
  • Bounds on the error: not hard bounds, but statistical bounds
  • “Highly likely” to find the MSP

Choosing threshold S

  • Suppose the MSP has been calculated by BLAST (and

suppose this is the true MSP)

  • Suppose this observed MSP with a score S.
  • What are the chances that the MSP score for two

unrelated sequences would be >= S?

  • If the chances are very low, then we can be confident that

the two sequences must not have been unrelated

slide-27
SLIDE 27

Statistics: Question

  • Given two random sequences of lengths m and n
  • What is the probability that they will produce an MSP score
  • f >= S ?
slide-28
SLIDE 28

Statistics: intuition

Given a binary 0/1 sequence and a query string of k consecutive ones

  • Probability in a sequence of length k: 1/2k
  • Probability in a sequence of length k+1?
  • 1 - (1 - 1/2k)2
  • How about the probability in a sequence of length k+n?
  • 1 - (1 - 1/2k)n+1
  • The longer the sequence, the more likely you are going to

get k ones by chance!

slide-29
SLIDE 29

The probability will depend on:

  • How long is are the sequences (the longer the

easier to get a local score above threshold by chance)

  • Scoring matrix
  • Distribution of amino acids in each sequence

Statistics: more intuition

slide-30
SLIDE 30

Statistics: Intuition

slide-31
SLIDE 31

Approach

slide-32
SLIDE 32

How to compute the probability?

slide-33
SLIDE 33

Simulation

  • 1. Generate many random sequence pairs
  • 2. Compute the distribution of the SCOREs

score frequency

SCORE

slide-34
SLIDE 34

How to compute the p-value (probability)?

slide-35
SLIDE 35

Statistical test

Simulation p-value = 0.001 p-value = 0.45

slide-36
SLIDE 36

Is this efficient enough?

slide-37
SLIDE 37

Another observation

slide-38
SLIDE 38

Extreme value distribution

slide-39
SLIDE 39

Extreme value distribution

z

z

slide-40
SLIDE 40

Compute a p-value

slide-41
SLIDE 41

Parameters

z z

slide-42
SLIDE 42

Statistical test

EVD p-value = 0.001 p-value = 0.45

slide-43
SLIDE 43

Significance: P-value and E-value

slide-44
SLIDE 44

Parameters

z z

slide-45
SLIDE 45

E-value

Approximation: if x is very small, then 1-exp(-x) can be approximated by x Therefore, P(Z>=x) So E-value = DatabaseLength * p-value where N is the database size (not the aligned length n)