Full Compressed Affix Tree Representations L.I.R.M.M. Universit e - - PowerPoint PPT Presentation
Full Compressed Affix Tree Representations L.I.R.M.M. Universit e - - PowerPoint PPT Presentation
Full Compressed Affix Tree Representations L.I.R.M.M. Universit e de Montpellier Institut Biologie Computationnelle Introduction Basic Concepts A Classification Asynchronous Approaches Synchronous Approaches Results Conclusions &
Introduction Basic Concepts A Classification Asynchronous Approaches Synchronous Approaches Results Conclusions & Future Work
Motivation
Bidirectional Search Example: Harpins
Introduction Basic Concepts A Classification Asynchronous Approaches Synchronous Approaches Results Conclusions & Future Work
Suffix Tree
Suffix Tree Operations
Suffix Arrays and Suffix Tree
Suffix Arrays and Suffix Tree
Burrows and Wheeler Transform (BWT)
BWT: backward search
backwardSearch(c, [i, j]): i′ ← C[c] + Occ(c, i − 1) + 1 j′ ← C[c] + Occ(c, j)
Affix Tree
◮ Combines Suffix Tree of T with the Suffix Tree T r ◮ Introduced by Stoye (2000) and Maaß (2003) ◮ Problem: Complexity of the structures presented and that it
uses about 45n bytes
Introduction Basic Concepts A Classification Asynchronous Approaches Synchronous Approaches Results Conclusions & Future Work
Asynchronous vs Synchronous
◮ Forward Structure (FOS) and the Backward Structure (BAS)
Introduction Basic Concepts A Classification Asynchronous Approaches Synchronous Approaches Results Conclusions & Future Work
Affix Array (AfA)
◮ Proposed by Strothmann (2007) ◮ Suffix Trees are stored using Suffix Arrays in addition with
extra data
◮ Connections between the trees are also stored (Affix links) ◮ Does not support all tree operations ◮ Total: around 18–22n bytes.
Compressed Affix Tree (ACAT)
◮ Compressed Suffix Trees data structure ◮ Supports all tree operations ◮ Connections between the trees are also stored (Affix links)
Affix Link
ALink(v) = Child(Alink(SLink(v)), c)
Affix Link
ALink(v) = Child(Alink(SLink(v)), c)
Affix Link
ALink(v) = Child(Alink(SLink(v)), c)
Sampled Affix Link
Compressed Affix Tree Sampled (ACATS)
◮ Compressed Suffix Trees data structure ◮ Sampled Affix links
Compressed Affix Tree Non-Sampled
◮ Extreme case ACATS ◮ Albrecht and Heun (2012). Optimal computation of Affix
links using binary search
◮ Gog et al. (2014). Faster solution (ACATN)
ACATN
ACATN
RACATN
Introduction Basic Concepts A Classification Asynchronous Approaches Synchronous Approaches Results Conclusions & Future Work
Bidirectional Wavelet Tree (BidWT)
◮ Proposed by Schnattinger et al. (2010 − 2012) and Lam et al.
(2009)
◮ Uses backward index for the input text T and for T r ◮ Easy transition between the data structures ◮ Reduce space in a factor of 23 compared to the Affix Array ◮ Main operation: extend in one character
Bidirectional Wavelet Tree
Bidirectional Wavelet Tree
Bidirectional Wavelet Tree
SCAT
SCAT
Summary
Approach Category Full Tree Description Operations Space AfA Asynchronous No Strothmann’s Affix Array 2 · (SA + LCP + child tables + ALink) ACAT Asynchronous Yes Asynchronous Affix Tree implementation 2 · (CST + ALink) ACATS Asynchronous Yes Asynchronous Affix Tree implementation 2 · (CST + Alinksampled) ACATN Asynchronous Yes Gog et al. Affix Tree 2 · (CST + rminq + rmaxq) RACATN Asynchronous Yes reduced of ACATN 2 · (CST + rminq) BidWT Synchronous No Bidirectional BWT 2 · (FM-Index) SCAT Synchronous Yes Synchronous Affix Tree implementation 2 · (CST)
Table: Compressed Affix Tree approaches studied in this work.
Introduction Basic Concepts A Classification Asynchronous Approaches Synchronous Approaches Results Conclusions & Future Work
Construction
DNA-50MB ENGLISH-50MB
10000 100000 1e+06 1e+07 1 10 100 Time in milliseconds Number of bytes per character AFA ACAT ACATS ACATN RACATN BidWT SCAT 10000 100000 1e+06 1e+07 1 10 100 Time in milliseconds Number of bytes per character AFA ACAT ACATS ACATN RACATN BidWT SCAT
Forward-Backward
DNA-50MB ENGLISH-50MB
0.1 1 10 100 1 10 100 Time in microseconds Number of bytes per character AFA ACAT ACATS ACATN RACATN BidWT SCAT 1 10 100 1 10 100 Time in microseconds Number of bytes per character AFA ACAT ACATS ACATN RACATN BidWT SCAT
Suffix-Children
DNA-50MB ENGLISH-50MB
1 10 100 1000 1 10 100 Time in microseconds Number of bytes per character AFA ACAT ACATS ACATN RACATN BidWT SCAT 1 10 100 1000 1 10 100 Time in microseconds Number of bytes per character AFA ACAT ACATS ACATN RACATN BidWT SCAT
Slink
DNA-50MB ENGLISH-50MB
10 100 1 10 100 Time in microseconds Number of bytes per character ACAT ACATS ACATN RACATN SCAT 1 10 100 1 10 100 Time in microseconds Number of bytes per character ACAT ACATS ACATN RACATN SCAT