SLIDE 2 25‐Mar‐15 2
k‐mer searches
- k‐mers are “words” consisting of k nucleotides or amino acids
– For k = 5, the amino acid sequence KAWSADV consists of the k‐mers: KAWSA, AWSAD, and WSADV
- Identical words are easy to identify for a computer
– An index of the sequences can be stored in rapidly accessible memory (RAM)
- However, this is not suitable to identify sequences at large
, y q g evolutionary distance (many mutations)
Degeneracy of the genetic code
- Mutations in the 3rd nucleotide of a codon often translate
into the same amino acid (synonymous mutations)
– Discontiguous Megablast searches with spaced words containing two out of every three nucleotides, allowing variations at the third nucleotide of the codon
– Nucleotide sequences are the least conserved – Protein sequences are more conserved
Sensitivity versus speed
- (Almost) exact hits are easy to identify using fast k‐mer searches
– Not suitable for distant homologs – For example: which genes are expressed in a human cancer?
- Here, the sequences can be matched to the human reference genome
- Highly diverged sequences (distant homologs) require careful,
- ptimal alignment algorithms
– This is slow: many algorithmic steps need to be performed by the central processing unit (CPU) of the computer – For example: which unknown microbes are associated to coral disease?
- Here, the sequences have to be compared with known microbial genomes (distantly related)
Basic Local Alignment Search Tool (BLAST)
- Heuristic search algorithm
– Makes shortcuts that are likely (but not guaranteed) to find the
- ptimal hits
- BLAST finds good potential homologs at reasonable speed
– 10‐50x faster than Smith‐Waterman – More than 100,000 queries per day
- n the NCBI BLAST server
- n the NCBI BLAST server
- Terminology:
– Query: sequence we search the database with – Hit or Subject: similar sequence found in the database
- BLAST is the most used bioinformatics program
– The BLAST article has been cited >54,000 times
BLAST input and output
>p >pro rote tein in_s _seq eque uence_A MTQSSHAVAA FD FDLGA LGAAL ALRQ RQE GL E GLTET TETDY DYSE SEI QR I QRDP DPNR NRAEL AELG TF TFGV GV >p >pro rote tein in_s _seq eque uence_B MLTETDYSEI QR QRRLG RLGRD RDPN PNR AE R AELGM LGMFG FGVM VMN RA N RAEL ELGM GMFGY FGY >p >pro rote tein in_s _seq eque uence_C MHAVAAFDLG AA AALRQ LRQEG EGLT LTE TD E TDYSE YSEIQ IQRR RRL GR L GRAM AMFG FGVMW VMWS EH EHCC CCYR YRNDD NDDA RP RPLL LLRP RPIK IKSP SP F FGAWVVIV
BLAST input (query sequences) BLAST output (hits)
- Identifies potentially high‐scoring words (k‐mers) in the query
– W = 3 for protein, W = 11 for DNA – Based on substitution scores
- Quickly finds similar words in the database
– All words in the database are indexed and stored in RAM, linked to similar “neighborhood words”
The BLAST search algorithm
- Extends seeds in both directions to find HSPs between query and hit
– HSP: region that can be aligned with a score above a certain threshold