Approximate matching Ben Langmead Department of Computer Science - PowerPoint PPT Presentation

Approximate matching Ben Langmead Department of Computer Science You are free to use these slides. If you do, please sign the guestbook (www.langmead-lab.org/teaching-materials), or email me (ben.langmead@gmail.com) and tell me brie fl y how you’re using them. For original Keynote fi les, email me.

Read alignment requires approximate matching Read CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC Reference GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT ACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA Sequence di ff erences occur ACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCA AACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAA because of... ACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAAT CTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATA CCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAA GCAATACACTGACCCGCTCAAACTCCTGGATTTTGGATCCACCCAGCGCCTTGGCCTAAA 1. Sequencing error CTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGT TCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTC AAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAA 2. Natural variation ACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGC GGTCACACGATTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCC TCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGAC TACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGA TACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATCAACAAAACTGCTCGCCAGAA CACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATCCCTCTAGAGG AGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTGCTCAGCCTATATA CCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAGTACCCACGTAAAG ACGTTAGGTCAAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACATTTTCTACCCCAG AAAACTACGATAGCCCTTATGAAACTTAAGGGTCGAAGGTGGATTTAGCAGTAAACTAAG AGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCCCGTCACCCTCCTC AAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCATTTATATAGAGGAGACAAGT CGTAACCTCAAACTCCTGCCTTTGGTGATCCACCCGCCTTGGCCTACCTGCATAATGAAG AAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGAGCTAAACCTA GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAA AGTATAGGCGATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATG AAAAATTATAACCAAGCATAATATAGCAAGGACTAACCCCTATACCTTCTGCATAATGAA TTAACTAGAAATAACTTTGCAAGGAGAGCCAAAGCTAAGACCCCCGAAACCAGACGAGCT ACCTAAGAACAGCTAAAAGAGCACACCCGTCTATGTAGCAAAATAGTGGGAAGATTTATA

Approximate string matching Looking for places where a P matches T with up to a certain number of mismatches or edits. Each such place is an approximate match . A mismatch is a single-character substitution: T : G G A A A A A G A G G T A G C G G C G T T T A A C A G T A G | | | | | | | | P : G T A A C G G C G An edit is a single-character substitution or gap ( insertion or deletion ): T : G G A A A A A G A G G T A G C G G C G T T T A A C A G T A G | | | | | | | | P : G T A A C G G C G Gap in T T : G G A A A A A G A G G T A G C - G C G T T T A A C A G T A G | | | | | | | | P : G T A G C G G C G T : G G A A A A A G A G G T A G C G G C G T T T A A C A G T A G | | | | | | | | P : G T - G C G G C G Gap in P

Hamming and edit distance For two same-length strings X and Y , hamming distance is the minimum number of single-character substitutions needed to turn one into the other: X : G A G G T A G C G G C G T T T A A C G A G G T A G C G G C G T T T A A C Hamming distance = 3 | | | | | | | | | | | | | | | Y : G T G G T A A C G G G G T T T A A C G T G G T A A C G G G G T T T A A C Edit distance ( Levenshtein distance ): minimum number of edits required to turn one into the other: X : T G G C C G C G C A A A A A C A G C T G G C C G C G C A A A A A C A G C | | | | | | | | | | | | | | | | Edit distance = 2 Y : T G A C C G C G C A A A A C A G C T G A C C G C G C A A A A - C A G C X : G C G T A T G C G G C T A A C G C G C G T A T G C G G C T A - A C G C | | | | | | | | | | | | | | | Edit distance = 2 Y : G C T A T G C G G C T A T A C G C G C - T A T G C G G C T A T A C G C

Approximate string matching Adapting the naive algorithm to do approximate string matching within con fi gurable Hamming distance: def ¡naiveApproximate(p, ¡t, ¡maxHammingDistance=1): ¡ ¡ ¡ ¡occurrences ¡= ¡[] ¡ ¡ ¡ ¡for ¡i ¡in ¡xrange(0, ¡len(t) ¡-‑ ¡len(p) ¡+ ¡1): ¡# ¡for ¡all ¡alignments ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡nmm ¡= ¡0 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡for ¡j ¡in ¡xrange(0, ¡len(p)): ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡# ¡for ¡all ¡characters ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡t[i+j] ¡!= ¡p[j]: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡# ¡does ¡it ¡match? ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡nmm ¡+= ¡1 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡# ¡mismatch ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡nmm ¡> ¡maxHammingDistance: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡# ¡exceeded ¡maximum ¡distance ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡nmm ¡<= ¡maxHammingDistance: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡# ¡approximate ¡match; ¡return ¡pair ¡where ¡first ¡element ¡is ¡the ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡# ¡offset ¡of ¡the ¡match ¡and ¡second ¡is ¡the ¡Hamming ¡distance ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡occurrences.append((i, ¡nmm)) ¡ ¡ ¡ ¡return ¡occurrences Instead of stopping upon fi rst mismatch, stop when maximum distance is exceeded Python example: http://bit.ly/CG_NaiveApprox

Approximate string matching How to make Boyer-Moore and index-assisted exact matching approximate? Helpful fact: Split P into non-empty non-overlapping substrings u and v. If P occurrs in T with 1 edit, either u or v must match exactly. P u v Either the edit goes here... ...or here. Can’t go anywhere else! More generally: Let p 1 , p 2 , ..., p k+1 be a partitioning of P into k +1 non- overlapping non-empty substrings. If P occurrs in T with up to k edits, then at least one of p 1 , p 2 , ..., p k+1 must match exactly. P ... p 1 p 2 p 3 p 4 p k+1 ≤ k edits can a ff ect as many as k of these, but not all

Approximate string matching These rules provides a bridge from the exact-matching methods we’ve studied so far, and approximate string matching. P ... p 1 p 2 p 3 p 4 p k+1 ≤ k edits can overlap as many as k of these, but not all Use an exact matching algorithm to fi nd exact matches for p 1 , p 2 , ..., p k+1 . Look for a longer approximate match in the vicinity of the exact match. check check p 1 p 2 p 3 p 4 p 5 Exact match T

Approximate matching Ben Langmead Department of Computer Science - PowerPoint PPT Presentation

Approximate matching Ben Langmead Department of Computer Science You are free to use these slides. If you do, please sign the guestbook (www.langmead-lab.org/teaching-materials), or email me (ben.langmead@gmail.com) and tell me brie fl y how

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

Matching Bipartite Matching Input Given a (undirected) graph G = ( V , E ) Input Given a bipartite

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

Impedance Matching of 640 GHz SIS Mixer Impedance Matching of 640 GHz SIS Mixer of 640 GHz SIS

String Matching Inge Li Grtz CLRS 32 String Matching String matching problem: string

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 Dictionary Matching

Graph Matchings Matching A matching M in a graph G is a set of non-loop edges with no shared

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program

1 Shape- -Context: Matching Context: Matching Scale Invariance in Clutter ? Shape Scale

The Stable Matching Linear Program and an Approximate Rural Hospital Theorem with Couples Marzieh

Virtual Machine Security CSE443 - Spring 2012 Introduction to Computer and Network Security

CI with Cassandra, Portworx, Gitlab and DC/OS Background Background node1 node2 node3

Alliance for Childrens Rights Technical Details Call-in number is (562) 247-8321 and

OK, so I have all these Containers What now? Image by Connie Zhou Developer View job

QEMU: Architecture and Internals Lecture for the Embedded Systems Course CSD, University of Crete

CPU VIRTUALIZATION Cheuk on,CHUNG Hsiang yu Cheng Challenges without VT-x Techniques such as

Guest Lecture: Prof. Allan Borodin Game Theory : Zero-Sum Games, The Minimax Theorem CSC304 -

Lafros GUI-App: a Lafros GUI-App: a monitoring and control-oriented monitoring and