BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol - - PowerPoint PPT Presentation

blast basic local alignment search tool altschul et al j
SMART_READER_LITE
LIVE PREVIEW

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol - - PowerPoint PPT Presentation

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. One of the highest cited papers in history B asic L ocal A lignment S earch T ool Why is database search difficult? Consider a simpler problem p t Goal: Find


slide-1
SLIDE 1

BLAST:
 Basic Local Alignment Search Tool
 Altschul et al. J. Mol Bio. 1990.

slide-2
SLIDE 2

One of the highest cited papers in history

slide-3
SLIDE 3

Basic Local Alignment Search Tool

slide-4
SLIDE 4

Why is database search difficult?

slide-5
SLIDE 5

Consider a simpler problem

Goal: Find all occurrences of a pattern in a text Input: Pattern p = p1…pn and text t = t1…tm Output: All positions 1< i < (m – n + 1) such that the n- letter substring of t starting at i matches p Motivation: Searching database for a known pattern

t p

slide-6
SLIDE 6

Text = Human genome (3 billion basepairs) Pattern = a 3-letter word Output: All positions 1< i < (m – n + 1) such that the n-letter substring of t starting at i matches p

Further simplified version

t p p = ATC or AAA or TTC or …

slide-7
SLIDE 7

Key idea: preprocessing

Preprocessing: store exact matches of all short patterns on the text

ATC AAA TTC

{1,6,100,2000,5454, …, } {15,21,30,785,3434, …, } {5,164,220,502,943, …, } …

slide-8
SLIDE 8

Key idea: preprocessing

Preprocessing: store exact matches of all short patterns on the text

ATC AAA TTC

{1,6,100,2000,5454, …, } {15,21,30,785,3434, …, } {5,164,220,502,943, …, } …

what if n is big?

slide-9
SLIDE 9

Hashing

A hash function maps a key to a value

slide-10
SLIDE 10
  • Hash table is a data structure: a way to store

key-value pairs, and a way to retrieve them

  • Based on the idea of a hash function. This maps

a key or an object (e.g., a string, or a more complex record) to an integer, the “address”

  • The value of the key is then stored at that

address in memory

Hash table

slide-11
SLIDE 11
  • Key: (AAACGTAT, 1234321)
  • i.e., a 8 bp-string and its location in genome
  • We want to store many such strings and their

locations

  • and later retrieve all locations of a particular

string really quickly

  • Hash function h(AAACGTAT) = 435

Hashing: an example

Key=String Value = Address of where Location(String) is stored

slide-12
SLIDE 12
  • Let’s assume that there are 48 = 64K memory

locations available.

  • The first time we see (AAACGTAT, *), we store it at

address h(AAACGTAT) = 435.

  • The next time we see (AAACGTAT, *), we compute

h(AAACGTAT), go to 435, find it already occupied. A collision!

Hashing: an example

slide-13
SLIDE 13
  • Buckets: Address 435 can store multiple keys/
  • bjects (e.g., as a linked list)
  • Linear probing: If an address is occupied, store the

key/object in next available location

  • Multiple hashing: have an army of hash functions.

If the first one (“h”) led to a collision, try another hash function (“h2”)

How to handle collisions

slide-14
SLIDE 14

Bucketing and Chaining

slide-15
SLIDE 15

Open addressing and linear probing

slide-16
SLIDE 16

Preprocessing: store exact matches of all short patterns on the text by a hash table

ATC AAA TTC

{1,6,100,2000,5454, …, } {15,21,30,785,3434, …, } {5,164,220,502,943, …, } h h

address1

h

address2 address3

retrieve retrieve retrieve

Preprocessing and hash

slide-17
SLIDE 17
  • Given two sequences of same length, the similarity

score of their alignment (without gaps) is the sum of similarity values for each pair of aligned residues

  • Maximal segment pair (MSP): Highest scoring pair of

identical length segments from the two sequences being compared (“query” and “subject”)

  • The similarity score of an MSP is called the MSP score
  • BLAST heuristically aims to find them

BLAST: finding maximal segment pairs

slide-18
SLIDE 18

Maximal segment pairs and High scoring pairs

  • Goal: report database sequences that have MSP score above some

threshold S.

  • Thus, sequences with at least one locally maximal segment pair

that scores above S.

slide-19
SLIDE 19

A quick way to find MSPs

Seed Extend Extend

  • Homologous sequences tend to have very

similar or even identical substrings, also called seeds.

  • From a seed, it is possible to construct a local

HSP/MSP by extending to flanking regions.

slide-20
SLIDE 20

Efficient algorithm?