Algoritmi per la Bioinformatica Zsuzsanna Lipt ak Laurea - PowerPoint PPT Presentation

Algoritmi per la Bioinformatica Zsuzsanna Lipt´ ak Laurea Magistrale Bioinformatica e Biotechnologie Mediche (LM9) a.a. 2014/15, spring term Database search with BLAST

Database search • Until now: compare two sequences • how similar/different are they? (score/value) • where are the similarities/differences? (alignment) 2 / 8

Database search • Until now: compare two sequences • how similar/different are they? (score/value) • where are the similarities/differences? (alignment) • Now: compare one sequence to a database (i.e. to many sequences) 2 / 8

Database search Goal: Identifying sequences in the DB which have high local similarity with the query. • We know how to do this: Smith-Waterman DP-algorithm. • But: too slow! 3 / 8

Say all sequences have length n (query t and all DB seq’s), and there are r sequences in the DB. • exact solution (Smith-Waterman): O ( r · n 2 ) Example • UniProt/SwissProt (protein database): 548 454 sequences, 195 409 447 aa’s (avg. length 350 aa’s) version 29/04/15 • NCBI Genbank (nucleotide database): 182 188 746 sequences, 189 739 230 107 nucleotides (avg. length 1041 nucl.) April 2015, no WGS 4 / 8

Say all sequences have length n (query t and all DB seq’s), and there are r sequences in the DB. • exact solution (Smith-Waterman): O ( r · n 2 ) Example • UniProt/SwissProt (protein database): 548 454 sequences, 195 409 447 aa’s (avg. length 350 aa’s) version 29/04/15 • NCBI Genbank (nucleotide database): 182 188 746 sequences, 189 739 230 107 nucleotides (avg. length 1041 nucl.) April 2015, no WGS So we would get something like 350 · 350 · 548454 = 67 185 615 000 = about 67 billion (67 · 10 9 ) steps, which takes 18 hours on a computer that performs 1 million operations per second (for UniProt), and 197 434 482 454 026 ( ≈ 1 . 9 · 10 12 ), about 6 years, for Genbank. And still about 1 hour on a computer performing 1 billion operations per second. 4 / 8

Say all sequences have length n (query t and all DB seq’s), and there are r sequences in the DB. • exact solution (Smith-Waterman): O ( r · n 2 ) Example • UniProt/SwissProt (protein database): 548 454 sequences, 195 409 447 aa’s (avg. length 350 aa’s) version 29/04/15 • NCBI Genbank (nucleotide database): 182 188 746 sequences, 189 739 230 107 nucleotides (avg. length 1041 nucl.) April 2015, no WGS So we would get something like 350 · 350 · 548454 = 67 185 615 000 = about 67 billion (67 · 10 9 ) steps, which takes 18 hours on a computer that performs 1 million operations per second (for UniProt), and 197 434 482 454 026 ( ≈ 1 . 9 · 10 12 ), about 6 years, for Genbank. And still about 1 hour on a computer performing 1 billion operations per second. And this is for one query only! 4 / 8

BLAST: Basic Local Alignment Search Tool • Altschul et al. 1990, 1997 • looks for sequences in a database with high local similarity to query • heuristic algorithm • solid mathematical foundations (Karlin-Altschul statistics) • extremely successful, now the database search tool (“to blast a sequence against a database”) • NCBI 1 Blast at: http://blast.ncbi.nlm.nih.gov/Blast.cgi 1 NCBI = National Center for Biotechnology Information 5 / 8

Basic idea Basic idea If there is a good local alignment between two sequences, then this local alignment is likely to contain two short substrings with high score when aligned without gaps. Basic steps of BLAST 1. create list of high-scoring words with query 2. scan DB for these words (called seeds) 3. extend seeds in both directions to form good local alignment (these are called MSPs = maximum segment pairs) BLAST then gives a significance score to the MSPs and only retains them if above a certain threshold. 6 / 8

BLAST2 Some innovations of BLAST2 (Altschul 1997) • start with two seeds instead of one, not too far apart • gapped alignments • extension of statistical theory to HSPs (high-scoring segment pairs) 7 / 8

The NCBI BLAST website • Different versions of BLAST, depending on the task (nucl-nucl: blastn, megablast, . . . , prot-prot: blastp, psi-blast, nucl-prot: blastx, prot-nucl: tblastn, . . . ) • Different databases (nucl vs. prot, different organisms, different types of db, different levels of assembly, . . . ) • Very good explanations and help pages! • If you haven’t done it yet, then you should try it and play around! E.g. download a sequence from Genbank or Swissprot, modify it and blast it! 8 / 8

Algoritmi per la Bioinformatica Zsuzsanna Lipt ak Laurea - PowerPoint PPT Presentation

Algoritmi per la Bioinformatica Zsuzsanna Lipt ak Laurea Magistrale Bioinformatica e Biotechnologie Mediche (LM9) a.a. 2014/15, spring term Database search with BLAST Database search Until now: compare two sequences how

Algoritmi per la Bioinformatica Zsuzsanna Lipt ak Laurea Magistrale Bioinformatica e

Algoritmi di Bioinformatica Zsuzsanna Lipt ak Laurea Magistrale Bioinformatica e

Algoritmi di Bioinformatica Zsuzsanna Lipt ak Laurea Magistrale Bioinformatica e

Similarity vs. distance Algoritmi per la Bioinformatica Two ways of measuring the same thing:

Algoritmi per la Bioinformatica To abstract from specific computers (processor speed, computer

Econom ical Aspects Econom ical Aspects Pay per Risk Pay per Use Pay per Use Pay per

Top 10 Adult Visits per 100 persons Emergencies 1994 - 36 per 100 2004 - 38.2 per 100

History of the Per-Mile Charge in the United States 2 What is a Per Mile Charge? A VMT?

Ho How MyDo yDoc Healt Health Works ks For $75 per member per month, each member receives 4

SVA Health Insurance Presentation Plan Highlights and Benefits Unlimited Maximum Per Insured

Per-Pupil Budgeting for iDesign Schools Los Angeles Unified School District iDesign Division

Wireless Plans Bill Dickhardt 10/13/2017 My mobile gear iPhone SE (pay as you go) iPhone

Deep Encode: Machine Learning for Per-Title Encoding Daniel Silhavy| IBC20| Per-Title Encoding

Introducing the CIE Membership Type Rates Mentor $75 per year Entrepreneur $125 per year;

Globalization and the state Jaume Ventura Bojos per lEconomia! 2017 Jaume Ventura ( ) Bojos

Figure 1: GDP per capita before and after a democratization. 25 Change in GDP per capita log

Genome 559 Introduction to Statistical and Computational Genomics Winter 2010 Lecture 14a: BLAST

A few BLAST details Julin Maloof April 16, 2019 Slides courtesy of Venkatsean Sundaresan BLAST

Similarity Searches on Sequence Databases: BLAST, FASTA Lorenza Bordoli Swiss Institute of

Heuris'c)search:)FastA)and)BLAST ) COMPSCI)260))Spring)2016 ) Previous)lectures)

Whole genome alignments http://faculty.washington.edu/jht/GS559_2013/ Genome 559: Introduction to

Sequence Alignment Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC

Data Mining: Concepts and Techniques Additional Applications and Emerging Topics Li Xiong

String comparison problems, Myers (91) So far our goal was to maximize the alignments