Least Random Suffix/Prefix Matches in Output-Sensitive Time Niko - PowerPoint PPT Presentation

Least Random Suffix/Prefix Matches in Output-Sensitive Time Niko Välimäki Department of Computer Science University of Helsinki nvalimak@cs.helsinki.fi 23rd Annual Symposium on Combinatorial Pattern Matching

Suffix/Prefix Matching Problem Input: A set of r strings of total length n . Output: Longest non-zero length suffix/prefix match for each string-pair. A suffix/prefix match ( overlap ): VÄLIMÄKI |||| MÄKINEN Motivation Approximating the shortest common superstring.

Longest Exact Overlaps Optimal-time by [Gusfield & Landau & Schieber, 1992] • O ( n + output ) time, O ( n ) words, • where output ≤ r 2 . Space-efficient variant by [Ohlebusch & Gog, 2010] • O ( n + output ) time, 8 n bytes. Finding irreducible overlaps [Simpson & Durbin, 2010] • O ( n + output ) time, 2 nH k + o ( n ) + r log r bits.

Approximate Overlaps Output the “best overlap” (of length ≥ t ) s.t. k -errors: suffix/prefix edit distance is ≤ k , ǫ -errors: suffix/prefix edit distance is ≤ ⌈ ǫℓ ⌉ , where ℓ is the length of the suffix. Overlaps for k = 1 : VÄLIMÄKI VÄLIMÄKI VÄLIMÄKI- |||| ||||| ||||| MÄKINEN -MÄKINEN MÄKINEN How to define the best overlap when indels are allowed?

Least Random Overlaps Let A [ 1 . . a ] and B [ 1 . . b ] denote two random strings from Bernoulli source. [Kececioglu & Myers, 1995] precomputed table Pr σ ( l , d ) , • i.e. the probability that A and B align with d indels and l = ( a + b − d ) / 2 matching symbols. • The best overlap minimizes Pr σ ( l , d ) . • O ( ǫ n 2 ) time, where ǫ > 0 denotes error-rate. [Landau & Myers & Schmidt, 1998] generalized the likelihood: • k -errors in O ( k | T j | ) time for a string-pair T i and T j . • Over all string-pairs in O ( knr ) time.

In Practice: Sequence Assembly Biological sequences have sequencing errors, SNPs... Heuristical methods for overlap-layout-consensus assemby: • ARACHNE [Batzoglou et al. 2002], • Atlas [Havlak et al. 2004], • Celera [Myers et al. 2000], • Phrap [Green, 1994], • UMD Overlapper [Roberts et al. 2004]. Filter based methods with Ω( n 2 ) worst-case: • q -gram filters [Rasmussen & Stoye & Myers, 2006] • suffix filters [Välimäki & Ladra & Mäkinen, 2010 & 2012]

Outline of Our Contributions Method for short strings • Adapt [Gusfield & Landau & Schieber, 1992] for least random overlaps. Method for long strings • Utilizes approximate dictionary matching [Cole et al. 2004] , � � + ( c 2 log r ) k Query time: O m log log r + output k ! � �� Time per suffix Prepr. Mixed length strings • O (( n + output ) polylog ( n )) time, O ( n ) space (for constant k )

Short Strings: Preprocessing Step Assume strings of length ≤ β . 1. Build a generalized suffix T j tree for T 1 , T 2 , . . . , T r . ... Green leaf nodes: r leafs, each spelling out whole T j for each j .

Short Strings: Search Step 2. Approx. search for each T i . add ( T i ) ignore depth < t Search in backward manner to cover all suffixes of T i . Blue nodes: O ( | T i | k + 1 σ k ) nodes whose upward path is within k -errors of one or more suffixes ... of T i . Searching all strings yields O ( n β k σ k ) marks.

Short Strings: Search Step All strings ℓ marked here match T j [1.. ℓ] ... T j

Short Strings: Traversal Step 3. Depth-first traversal. Use r stacks to collect marks [Gusfield & Landau & Schieber, 1992] Blue nodes Push list items to corresponding stacks. ... T j Green leafs Output top-most stack-values.

Short Strings: Linear Space Linear space for marks (in blue nodes): Step 2: Search ⌈ n /β k + 1 σ k ⌉ strings at a time. Step 3: Need to repeat the traversal over disjoint sets of marks. O ( n ) words, time complexity is retained. nH k ( T ) + Θ( n ) bits, time increases with ( log n ) -factor.

Summary “Open problem: longest approximate overlaps”

Summary Earlier methods: • Ω( r 2 ) time regardless of the output size. • O ( knr ) time [Landau & Myers & Schmidt, 1998] We propose: • First output-sensitive algorithms for least random overlaps: O ( n log k n + output ) β ≤ log n log n k < σ k log log n √ k β ≥ ǫ log k r O ( c k log r k ! nr ) k < log log r Any β . O (( n + output ) polylog ( n )) k = O ( 1 ) Kiitos!

Least Random Suffix/Prefix Matches in Output-Sensitive Time Niko - PowerPoint PPT Presentation

Least Random Suffix/Prefix Matches in Output-Sensitive Time Niko Vlimki Department of Computer Science University of Helsinki nvalimak@cs.helsinki.fi 23rd Annual Symposium on Combinatorial Pattern Matching Suffix/Prefix Matching Problem

Suffix Trees Construction and Applications Joo Carreira 2008 Outline Why Suffix Trees?

Algorithms in Bioinformatics: A Practical Introduction Suffix tree Overview What is suffix

Recall from Last Lecture: XPath bib matches a bib element * matches any element CS/INFO 330 /

This week, we are going to look at another prefix. What is a prefix? Choose the right answer. A

This week, we are going to look again at another prefix. What is a prefix? Click on the right

1 Xslt Header Xslt Templates Xslt stylesheets MUST include this body: Xslt

capitalise Suffix terrorise fertilise ise suffix words are usually just created by simply

Approximate Pattern Matching Using Suffix Tries Hendrik Nigul nigulh@math.ut.ee University of

Tra ffi c Management as a Service | Ghent, Belgium INPUT PROCESS OUTPUT INPUT PROCESS OUTPUT

This week, we are going to look at adding words ending in the suffix al. Can you remember what

Suffix tree and Suffix array Karatsuba CS214: Algorithms and Complexity Shanghai Jiao Tong

An Algorithm for Suffix Stripping Evaluation Algorithm Porter (1980) Notations Rules Further

Lecture 15: Suffix trees, suffix arrays, and their applica8ons

Parallel prefix adders Kostas Vitoroulis, 2006. Presented to Dr. A. J. Al-Khalili. Concordia

Recap: Prefix Sums Given A : set of n integers Find B : prefix sums A: 3 1 1 7 2 5

IP Prefix Advertisement in EVPN draft-rabadan-l2vpn-evpn-prefix-advertisement-01 Jorge Rabadan

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Task Force Meeting #8 Agenda Follow-up items from 11/29 meeting Pay for Some Overflow

Board of Education Public Hearing Attendance Area Adjustment October 22, 2015 August 20,

Marcia A. McCutchan, P.E., BCEE Executive Vice President RHMG, Engineers, Inc. Estimated 23,000

ESTIMATION OF PEEL STRESS AT THE OVERLAP END OF SINGLE-LAP JOINT BY USING EMBEDDED FBG H. Murayama

CALTRANS I-8 CONCRETE OVERLAY PILOT PROJECT Mehdi Parvini California Department of

Merritt Road Overlay Project CITY COUNCIL FEBRUARY 3, 2020 Overview Dallas County

Traditional Framework Downtown vs. Unified Shopping Center Downtown Unified Shopping Center