String indexing in the Word RAM model, part 4 Pawe Gawrychowski - PowerPoint PPT Presentation

String indexing in the Word RAM model, part 4 Paweł Gawrychowski University of Wrocław & Max-Planck-Institut für Informatik Paweł Gawrychowski String indexing in the Word RAM model IV 1 / 32

We consider a fundamental data structure question: how to represent a tree? (Compacted) Trie A trie is simply a tree with edges labeled by single characters. A compacted trie is created by replacing maximal chains of unary vertices with single edges labeled by (possibly long) words. Navigation queries Given a pattern p , we want to traverse the edges of a compacted trie to find the node corresponding to p . If there is no such node, we would like to compute its longest prefix for which the corresponding node does exist. Paweł Gawrychowski String indexing in the Word RAM model IV 2 / 32

Consider p = wewpxcwrehyzrt and the following compacted trie. qoidkbasdk wewpxc w t r r q e w qtkjdknewnbog povmnxd tovndfed hyugfecvbx n b o g khjkdjd d d f n v m c Paweł Gawrychowski String indexing in the Word RAM model IV 3 / 32

Consider p = wewpxcwrehyzrt and the following compacted trie. qoidkbasdk wewpxc w t r r q e w qtkjdknewnbog povmnxd tovndfed hy n b ugfecvbx o g khjkdjd d d f n v m c Paweł Gawrychowski String indexing in the Word RAM model IV 3 / 32

Splitting an edge Given an edge, we want to split it into two parts by (possibly) creating a node, and adding a new edge outgoing from this middle node. abrakadabra Notice that this covers adding a new edge outgoing from an existing node. Paweł Gawrychowski String indexing in the Word RAM model IV 4 / 32

Splitting an edge Given an edge, we want to split it into two parts by (possibly) creating a node, and adding a new edge outgoing from this middle node. abrakadabra z y x Notice that this covers adding a new edge outgoing from an existing node. Paweł Gawrychowski String indexing in the Word RAM model IV 4 / 32

Static case (yesterday) Given a compacted trie, can we quickly construct a small structure which allows us to execute navigation queries efficiently ? Dynamic case Can we maintain a compacted trie so that: the resulting structure is small , 1 we can execute navigation queries efficiently , 2 we can split any edge efficiently ? 3 There are clearly three parameters: the number of nodes in the compacted trie n , the size of the alphabet σ , and the length of the pattern m . We aim to achieve good bounds in terms of those n , σ, m . Paweł Gawrychowski String indexing in the Word RAM model IV 5 / 32

It seems reasonable to consider the scenario where σ is non-constant, yet (significantly) smaller than n . Hence we get the following question: what are the best possible time bounds in terms of σ ? Gawrychowski and Fischer There exists a deterministic linear-size structure supporting navigation log log log σ ) time and splitting edges in O ( log 2 log σ log 2 log σ in O ( m + log log log σ ) . To make the above result useful, we develop a suffix tree oracle which can be used to locate the edge which should be split after prepending log 2 log σ a letter to the current text in O ( log log n + log log log σ ) time. Paweł Gawrychowski String indexing in the Word RAM model IV 6 / 32

Let us consider the dynamic case, and assume that n = O ( σ ) . Here instead of the simple two-level scheme used in the static case we need to partition the nodes into more groups. Levels of nodes Let f ( ℓ ) = 2 ( 3 2 ) ℓ . We say that a node v is of level ℓ when the number of leaves in its subtree belongs to [ f ( ℓ ) , 2 f ( ℓ + 1 )] . We will maintain an invariant that a level of v doesn’t exceed the level of its parent. A fragment is a part of the tree consisting of nodes at the same level. Paweł Gawrychowski String indexing in the Word RAM model IV 7 / 32

Now, we classify the edges into two types: from a node to a node of the same level, 1 from a node to a node of a smaller level, 2 Paweł Gawrychowski String indexing in the Word RAM model IV 8 / 32

Now, we classify the edges into two types: from a node to a node of the same level, 1 from a node to a node of a smaller level, 2 Those edges are stored in a static dictionary with a constant access time. We already know that such dictionary can be constructed in close-to-linear time, and this turns out to be enough because of the way we defined the levels. More precisely, it cannot happen too often that a level of a node increases. Paweł Gawrychowski String indexing in the Word RAM model IV 8 / 32

Now, we classify the edges into two types: from a node to a node of the same level, 1 from a node to a node of a smaller level, 2 Those edges are stored in a dynamic dictionary structure. For this we develop a weighted variant of the exponential search trees of Andersson and Thorup, which we call the wexponential search trees. Andersson and Thorup 2002 An exponential search tree is a dynamic predecessor structure storing a subset of [ 1 , U ] with O ( log 2 log U log log log U ) time for insertions and predecessor queries. Paweł Gawrychowski String indexing in the Word RAM model IV 8 / 32

Even without the modification, the query complexity is fairly decent, log 3 log σ namely O ( m + log log log σ ) . This is because there are at most t = Θ( log log σ ) edges of type (2) on any path descending from the root. w i ∈ [ f ( i ) , 2 f ( i + 1)] w t w t − 1 w t − 2 w t − 3 Paweł Gawrychowski String indexing in the Word RAM model IV 9 / 32

We want to be faster though. The subsequent accesses to the dynamic dictionary structures are not completely independent, so there is hope! Wexponential search trees There exists a linear-size dynamic structure storing a collection of n weighted elements from [ 1 , U ] with the following bounds: predecessor search takes O ( log log W log log U log log log U ) , where W is the 1 log w current total weight, and w is the weight of the predecessor, inserting a new element of weight 1 takes O ( log log W ) , 2 increasing a weight of an element of weight w by 1 takes 3 O ( log log W log w ) . Paweł Gawrychowski String indexing in the Word RAM model IV 10 / 32

String indexing in the Word RAM model, part 4 Pawe Gawrychowski - PowerPoint PPT Presentation

String indexing in the Word RAM model, part 4 Pawe Gawrychowski University of Wrocaw & Max-Planck-Institut fr Informatik Pawe Gawrychowski String indexing in the Word RAM model IV 1 / 32 We consider a fundamental data structure

CACHING BEYOND RAM CACHING BEYOND RAM memcached.org/blog @dormando WHY RAM? WHY RAM?

String indexing in the Word RAM model, part 3 Pawe Gawrychowski University of Wrocaw &

String indexing in the Word RAM model, part 2 Pawe Gawrychowski University of Wrocaw &

The String Class Trace Code Constructing a String String s = "Java"; String

Cross Ram Support Set Ram accessories 1 Cross Ram Support Set Set composition The Cross

1 2 3+4 2 type Parser = String Tree type Parser = String ( Tree, String) type Parser =

String Matching Inge Li Grtz CLRS 32 String Matching String matching problem: string

String Matching String matching problem: string T (text) and string P (pattern) over an

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

String Objectives Discuss string handling System.String class

EMS RAM PUMPS EMS RAM PUMPS INDUSTRIES LTD INDUSTRIES LTD Press ENTER to continue EMS

Random Access Memory (RAM) Key features RAM is traditionally packaged as a chip.

String Indexing for Patterns with Wildcards Philip Bille 1 , Inge Li Grtz 1 , Hjalte Wedel

HashMap Friday Four Square Today! Outside Gates at 4:15PM Not All Data is Linear

Compression: Other Lossless Compression Algorithms Greg Plaxton Theory in Programming Practice,

Thing Descriptions TD Serialization Carsten Bormann 2017-07-11 1 Objectives

Caribou: Intelligent Distributed Storage Zsolt Istvn, David Sidler, Gustavo Alonso Systems

Multimedia Systems WS 2010/2011 31.01.2011 M. Rahamatullah Khondoker (Room # 36/410 )

Pattern Matching on Compressed T exts II Shunsuke Inenaga Kyushu University, Japan Agenda

SALZA: Algorithmic Information Theory and Universal Classification for Sequences SeqBio 2018,

Inmarsat BGAN Vlad Galu <vlad.galu@inmarsat.com> Oct 9 th 2012 What is BGAN? Worldwide

Characterizing Mote Performance: A Vector-Based Methodology Martin Leopold, Marcus Chang, and

String indexing in the Word RAM model, part 4 Pawe Gawrychowski - PowerPoint PPT Presentation

String indexing in the Word RAM model, part 4 Pawe Gawrychowski University of Wrocaw & Max-Planck-Institut fr Informatik Pawe Gawrychowski String indexing in the Word RAM model IV 1 / 32 We consider a fundamental data structure

CACHING BEYOND RAM CACHING BEYOND RAM memcached.org/blog @dormando WHY RAM? WHY RAM?

String indexing in the Word RAM model, part 3 Pawe Gawrychowski University of Wrocaw &amp;

String indexing in the Word RAM model, part 2 Pawe Gawrychowski University of Wrocaw &amp;

The String Class Trace Code Constructing a String String s = &quot;Java&quot;; String

Cross Ram Support Set Ram accessories 1 Cross Ram Support Set Set composition The Cross

1 2 3+4 2 type Parser = String Tree type Parser = String ( Tree, String) type Parser =

String Matching Inge Li Grtz CLRS 32 String Matching String matching problem: string

String Matching String matching problem: string T (text) and string P (pattern) over an

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

String Objectives Discuss string handling System.String class

EMS RAM PUMPS EMS RAM PUMPS INDUSTRIES LTD INDUSTRIES LTD Press ENTER to continue EMS

Random Access Memory (RAM) Key features RAM is traditionally packaged as a chip.

String Indexing for Patterns with Wildcards Philip Bille 1 , Inge Li Grtz 1 , Hjalte Wedel

HashMap Friday Four Square Today! Outside Gates at 4:15PM Not All Data is Linear

Compression: Other Lossless Compression Algorithms Greg Plaxton Theory in Programming Practice,

Thing Descriptions TD Serialization Carsten Bormann 2017-07-11 1 Objectives

Caribou: Intelligent Distributed Storage Zsolt Istvn, David Sidler, Gustavo Alonso Systems

Multimedia Systems WS 2010/2011 31.01.2011 M. Rahamatullah Khondoker (Room # 36/410 )

Pattern Matching on Compressed T exts II Shunsuke Inenaga Kyushu University, Japan Agenda

SALZA: Algorithmic Information Theory and Universal Classification for Sequences SeqBio 2018,

Inmarsat BGAN Vlad Galu &lt;vlad.galu@inmarsat.com&gt; Oct 9 th 2012 What is BGAN? Worldwide

Characterizing Mote Performance: A Vector-Based Methodology Martin Leopold, Marcus Chang, and

String indexing in the Word RAM model, part 3 Pawe Gawrychowski University of Wrocaw &

String indexing in the Word RAM model, part 2 Pawe Gawrychowski University of Wrocaw &

The String Class Trace Code Constructing a String String s = "Java"; String

Inmarsat BGAN Vlad Galu <vlad.galu@inmarsat.com> Oct 9 th 2012 What is BGAN? Worldwide