Heuristic searches Genomics Compare DNA sequences to discover - - PDF document

heuristic searches
SMART_READER_LITE
LIVE PREVIEW

Heuristic searches Genomics Compare DNA sequences to discover - - PDF document

25 Mar 15 Omics Heuristic searches Genomics Compare DNA sequences to discover similarities/differences between genomes Transcriptomics Compare RNA sequences to discover similarities/differences in Compare RNA sequences to


slide-1
SLIDE 1

25‐Mar‐15 1

Heuristic searches

Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 26th 2015

Omics

  • Genomics

– Compare DNA sequences to discover similarities/differences between genomes

  • Transcriptomics

Compare RNA sequences to discover similarities/differences in – Compare RNA sequences to discover similarities/differences in which genes are expressed

  • Proteomics

– Compare protein sequences to discover similarities/differences in protein content

Transcriptomics by RNA sequencing

  • Used to get an overview of all RNA sequences in a sample

– Extract total RNA from a sample – Random pieces are sequenced – The sequencing reads are aligned to the genome to see which genes are transcribed

Astronomical Biological numbers

  • Stars in the universe:

~70,000,000,000,000,000,000,000

  • Bacteria on earth:

~5,000,000,000,000,000,000,000,000,000,000

  • Viruses on earth:

~50,000,000,000,000,000,000,000,000,000,000

Metagenomics

Database

Known Unknown

Need for fast similarity search algorithms

  • Find potential homologs for these sequences fast
  • Find potential homologs for these sequences

– Make all‐against‐all Smith‐Waterman alignments?

  • Sometimes billions of DNA sequences from thousands of

different bacteria or viruses

  • Trade‐off between sensitivity and speed

– Sensitivity: ability of an algorithm to detect distant homologs in a database – Speed: time the program needs to search a database

, fast

slide-2
SLIDE 2

25‐Mar‐15 2

k‐mer searches

  • k‐mers are “words” consisting of k nucleotides or amino acids

– For k = 5, the amino acid sequence KAWSADV consists of the k‐mers: KAWSA, AWSAD, and WSADV

  • Identical words are easy to identify for a computer

– An index of the sequences can be stored in rapidly accessible memory (RAM)

  • However, this is not suitable to identify sequences at large

, y q g evolutionary distance (many mutations)

Degeneracy of the genetic code

  • Mutations in the 3rd nucleotide of a codon often translate

into the same amino acid (synonymous mutations)

– Discontiguous Megablast searches with spaced words containing two out of every three nucleotides, allowing variations at the third nucleotide of the codon

  • Rule of thumb:

– Nucleotide sequences are the least conserved – Protein sequences are more conserved

Sensitivity versus speed

  • (Almost) exact hits are easy to identify using fast k‐mer searches

– Not suitable for distant homologs – For example: which genes are expressed in a human cancer?

  • Here, the sequences can be matched to the human reference genome
  • Highly diverged sequences (distant homologs) require careful,
  • ptimal alignment algorithms

– This is slow: many algorithmic steps need to be performed by the central processing unit (CPU) of the computer – For example: which unknown microbes are associated to coral disease?

  • Here, the sequences have to be compared with known microbial genomes (distantly related)

Basic Local Alignment Search Tool (BLAST)

  • Heuristic search algorithm

– Makes shortcuts that are likely (but not guaranteed) to find the

  • ptimal hits
  • BLAST finds good potential homologs at reasonable speed

– 10‐50x faster than Smith‐Waterman – More than 100,000 queries per day

  • n the NCBI BLAST server
  • n the NCBI BLAST server
  • Terminology:

– Query: sequence we search the database with – Hit or Subject: similar sequence found in the database

  • BLAST is the most used bioinformatics program

– The BLAST article has been cited >54,000 times

BLAST input and output

>p >pro rote tein in_s _seq eque uence_A MTQSSHAVAA FD FDLGA LGAAL ALRQ RQE GL E GLTET TETDY DYSE SEI QR I QRDP DPNR NRAEL AELG TF TFGV GV >p >pro rote tein in_s _seq eque uence_B MLTETDYSEI QR QRRLG RLGRD RDPN PNR AE R AELGM LGMFG FGVM VMN RA N RAEL ELGM GMFGY FGY >p >pro rote tein in_s _seq eque uence_C MHAVAAFDLG AA AALRQ LRQEG EGLT LTE TD E TDYSE YSEIQ IQRR RRL GR L GRAM AMFG FGVMW VMWS EH EHCC CCYR YRNDD NDDA RP RPLL LLRP RPIK IKSP SP F FGAWVVIV

BLAST input (query sequences) BLAST output (hits)

  • Identifies potentially high‐scoring words (k‐mers) in the query

– W = 3 for protein, W = 11 for DNA – Based on substitution scores

  • Quickly finds similar words in the database

– All words in the database are indexed and stored in RAM, linked to similar “neighborhood words”

The BLAST search algorithm

  • Extends seeds in both directions to find HSPs between query and hit

– HSP: region that can be aligned with a score above a certain threshold

slide-3
SLIDE 3

25‐Mar‐15 3

BLAST flavors

  • Nucleotide‐nucleotide searches

– Nucleotide database, nucleotide query – blastn (default: W = 11 nucleotides)

  • Find homologous genes in different species

– Megablast (default: W = 28 nucleotides)

  • Designed to efficiently find longer alignments between very similar nucleotide sequences
  • Best tool to find highly identical hits for a query sequence
  • For example: find sequences from the same species

– Discontiguous Megablast Discontiguous Megablast

  • Uses discontiguous words (e.g. W = 11 nucleotides: AT-GT-AC-CG-CG-T)
  • For example, this can focus the search on codons (the third nucleotide of codons is less

conserved due to the degeneracy of the genec code → next slide)

  • Best tool to find nucleotide‐nucleotide hits at larger evolutionary distances for protein‐

coding query sequences

  • Protein‐protein searches

– Protein database, protein query sequences – blastp (default: W = 3 amino acids)

  • Find homologous proteins in different species

BLAST flavors: translated searches

  • We can exploit the higher

conservation of protein sequences when aligning DNA sequences, by using translated searches

  • This allows for more sensitive searches that detect

homology at greater evolutionary distances

– For example: homologous genes in distantly related species – For example: homologous genes in distantly related species

  • blastx and tblastx first translate the nucleotide query into

protein before identifying high‐scoring words

  • tblastn and tblastx use a database of translated nucleotide

sequences stored as proteins

The alignment bit‐score

  • For a given query, we are mostly interested in finding good

hits (highly similar, likely true homologs)

  • We could estimate this based on a score derived only from

the alignment like the bit‐score or percent identity

– … but the chance of finding a hit with a high score by random chance increases if you use a larger database – … so we have to correct for that … so we have to correct for that

Expect value (E‐value)

  • E‐value: how many times would you expect a hit this good,

by random chance

– Of course, this depends on the alignment score (S), the length of the query sequence (m), and the size of the database (n): – K: constant for search space scaling λ f b i i i i S

Kmne E

 

– λ: constant for substitution matrix correction

Low E‐values are often given as exponents

  • In the search below, we expect 10‐149 hits with a score of

≥436 by random chance

– Given the database size and query sequence length we expect

0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001

hits by chance (this is not much, so this is a good hit)

Low E‐values are often given as exponents

  • In the search below, we expect 9.3 hits with a score of

≥38.9 by random chance

– This is a lot, so this is a bad hit

slide-4
SLIDE 4

25‐Mar‐15 4

Distribution of the number of hits found for by random chance in a Large DB

E‐value differs for different databases

  • Because the E‐value depends on the size of the DB, the E‐

value for a given query/hit pair is different when the search is done (i.e. the hit is found) in a different DB

More hits expected by random chance in Large DB

5=

Fewer hits expected by random chance in Small DB

Distribution of the number of hits found by random chance in a Small DB

What is a good E‐value?

  • This is very difficult to say!
  • But as a rule of thumb:

– E‐value <10‐6 for nucleotide blast (blastn, megablast) are good – E‐value <10‐3 for protein blast (blastp, blastx) are good

  • If you want to be very sure that your query and hit

sequences are homologs, you should only trust extremely low E‐values

  • … but sometimes you really have no other information

about a protein, except a distant homolog with a very bad (= high, like 0.1) E‐value

Weighing the conservation of a position in a search

  • If you want to sensitively detect

distant homologs in a search, it helps

...atG AA AT TT C ac... ...ccG AA GT TT C tg... ...agG AA AA TT C aa... tG AA AT TT C

to give conserved positions more weight than non‐conserved positions

  • A sequence profile contains this

information

– But how to get a sequence profile when you start with only one query sequence?

...gtG AA AT TT C cg... ...caG AA AT TT C tc... ...tgG AA AT TT C gt...

A 2 1 0 6 6 5 1 0 0 0 2 1 C 2 1 0 0 0 0 0 0 0 6 1 2 G 1 2 6 0 0 1 0 0 0 0 1 2 T 1 2 0 0 0 0 5 6 6 0 2 1

BLAST flavors: profile searches

  • Position‐Specific Iterated BLAST (PSI‐BLAST) algorithm

1. Perform initial BLAST search with a query sequence 2. Use the database hits to construct a sequence profile

  • (This is also called a position specific scoring matrix, or PSSM)

3. Search database with this sequence profile instead of with the

  • riginal query sequence

4. Iterate steps 2‐3

Iterate

Low‐complexity regions

  • Low‐complexity regions occur in DNA and protein sequences

– In DNA they can be cause and effect of recombination errors – In proteins they may be functional – In some cases they are used to allow for rapid evolution

  • For example in the shematrin protein in pearl oyster shells (see Figure)
  • They make sequence alignments and database searches difficult!

– Before performing a similarity search, low‐complexity regions are often masked in the query so that they cannot contribute to the alignment score – This prevents identifying spurious hits

Low‐complexity regions in the output

  • Lower case letters in the output show low‐complexity

sequence fragments that were ignored in the search

slide-5
SLIDE 5

25‐Mar‐15 5

Which database to use?

  • General databases

– nr: the Genbank non‐redundant protein database

  • Non‐redundant refers to the identifiers, not to the sequences

– nt: the Genbank non‐redundant nucleotide database

  • Non‐redundant refers to the identifiers, not to the sequences

– RefSeq: NCBI’s Reference Sequences

  • Curated database, contains only non‐redundant sequences
  • Includes genomic DNA transcripts (RNA) and proteins

Includes genomic DNA, transcripts (RNA), and proteins

  • Specific databases

– SILVA: ribosomal RNA sequences

  • SSU rRNA is a taxonomic marker gene
  • Universal in all cellular organisms (… so not in viruses!)

– KEGG: Kyoto Encyclopedia of Genes and Genomes

  • Database of enzymes in metabolic pathways