Review: summary of the performance of symbol-table implementations - - PowerPoint PPT Presentation

review summary of the performance of symbol table
SMART_READER_LITE
LIVE PREVIEW

Review: summary of the performance of symbol-table implementations - - PowerPoint PPT Presentation

BBM 202 - ALGORITHMS D EPT . OF C OMPUTER E NGINEERING E RKUT E RDEM T RIES Apr. 21, 2015 Acknowledgement: The course slides are adapted from the slides prepared by R. Sedgewick


slide-1
SLIDE 1
  • Apr. 21, 2015

BBM 202 - ALGORITHMS

TRIES

  • DEPT. OF COMPUTER ENGINEERING


 ERKUT ERDEM

Acknowledgement: ¡The ¡course ¡slides ¡are ¡adapted ¡from ¡the ¡slides ¡prepared ¡by ¡R. ¡Sedgewick ¡
 and ¡K. ¡Wayne ¡of ¡Princeton ¡University.

slide-2
SLIDE 2

TODAY

  • Tries
  • R-way tries
  • Ternary search tries
  • Character-based operations
slide-3
SLIDE 3

Review: summary of the performance of symbol-table implementations

Order of growth of the frequency of operations. 
 
 
 
 
 
 
 
 
 
 
 


  • Q. Can we do better?

A. Yes, if we can avoid examining the entire key, as with string sorting.

3

implementation typical case

  • rdered
  • perations
  • perations
  • n keys

search insert delete red-black BST log N log N log N yes compareTo() hash table 1 † 1 † 1 † no equals() hashcode()

  • † under uniform hashing assumption
slide-4
SLIDE 4

String symbol table. Symbol table specialized to string keys. 
 
 
 
 
 
 
 
 
 
 
 
 


  • Goal. Faster than hashing, more flexible than BSTs.

4

String symbol table basic API

public class StringST<Value> StringST()

create an empty symbol table

void put(String key, Value val)

put key-value pair into the symbol table

Value get(String key)

return value paired with given key

void delete(String key)

delete key and corresponding value

slide-5
SLIDE 5

5

String symbol table implementations cost summary


 
 
 
 
 
 
 
 
 
 
 
 
 


  • Challenge. Efficient performance for string keys.

Parameters

  • N = number of strings
  • L = length of string
  • R = radix

file size words distinct moby.txt 1.2 MB 210 K 32 K actors.txt 82 MB 11.4 M 900 K character accesses (typical case) dedup implementation search
 hit search
 miss insert space
 (references) moby.txt actors.txt red-black BST L + c lg 2 N c lg 2 N c lg 2 N 4N 1,4 97,4 hashing (linear probing) L L L 4N to 16N 0,76 40,6

slide-6
SLIDE 6
  • R-way tries
  • Ternary search tries
  • Character-based operations

TRIES

slide-7
SLIDE 7
  • Tries. [from retrieval, but pronounced "try"]
  • Store characters in nodes (not keys).
  • Each node has R children, one for each possible character.
  • Store values in nodes corresponding to last characters in keys.

7

Tries

e r e a l l s b y

  • h

e t

7 5 3 1 6 4

s l l e h s

root link to trie for all keys that start with s link to trie for all keys that start with she value for she in node corresponding to last key character key value by 4 sea 6 sells 1 she shells 3 shore 7 the 5 for now, we do not
 draw null links

slide-8
SLIDE 8

Follow links corresponding to each character in the key.

  • Search hit: node where search ends has a non-null value.
  • Search miss: reach a null link or node where search ends has null value.

8

Search in a trie

e r get("shells") e a l l s b y

  • h

e t

7 5 3 1 6 4

s l l e h h s

return value associated with last key character (return 3) 3

slide-9
SLIDE 9

Follow links corresponding to each character in the key.

  • Search hit: node where search ends has a non-null value.
  • Search miss: reach a null link or node where search ends has null value.

9

Search in a trie

e r get("she") e a l l s b y

  • h

e t

7 5 3 1 6 4

s l l e h h s

search may terminated at an intermediate node (return 0)

slide-10
SLIDE 10

Follow links corresponding to each character in the key.

  • Search hit: node where search ends has a non-null value.
  • Search miss: reach a null link or node where search ends has null value.

10

Search in a trie

e r get("shell") e a l l s b y

  • h

e t

7 5 3 1 6 4

s l l e h h s

no value associated with last key character (return null)

slide-11
SLIDE 11

Follow links corresponding to each character in the key.

  • Search hit: node where search ends has a non-null value.
  • Search miss: reach a null link or node where search ends has null value.

11

Search in a trie

e r get("shelter") e a l l s b y

  • h

e t

7 5 3 1 6 4

s l l e h h s

no link to 't' (return null)

slide-12
SLIDE 12

Follow links corresponding to each character in the key.

  • Encounter a null link: create new node.
  • Encounter the last character of the key: set value in that node.

12

Insertion into a trie

e r put("shore", 7) e a l l s e l s b y l

  • h

e t h h s

7 5 3 1 6 4

slide-13
SLIDE 13

Trie construction demo

13

trie

slide-14
SLIDE 14

e

Trie construction demo

14

put("she", 0)

h s

value is in node corresponding to last character key is sequence

  • f characters from

root to value

slide-15
SLIDE 15

e

Trie construction demo

15

trie

h s

slide-16
SLIDE 16

h e

Trie construction demo

16

trie

s

slide-17
SLIDE 17

h e

Trie construction demo

17

put("sells", 1)

s l l e s

1

slide-18
SLIDE 18

h e

Trie construction demo

18

trie

l l s s e

1

slide-19
SLIDE 19

h e

Trie construction demo

19

trie

l l s s e

1

slide-20
SLIDE 20

h e a

Trie construction demo

20

put("sea", 2)

l l s e s

2 1

slide-21
SLIDE 21

h e a

Trie construction demo

21

trie

l l s s e

1 2

slide-22
SLIDE 22

a

Trie construction demo

22

put("shells", 3)

l l s e s l l e h s

3 1 2

slide-23
SLIDE 23

a

Trie construction demo

23

trie

l l s l s s l h e e

3 1 2

slide-24
SLIDE 24

y b a

Trie construction demo

24

put("by", 4)

l l s l s s l h e e

4 3 1 2

slide-25
SLIDE 25

b y a

Trie construction demo

25

trie

l l s l s s l h e e

3 1 2 4

slide-26
SLIDE 26

b y a

Trie construction demo

26

put("the", 5)

l l s l s s l h e e e h t

5 3 1 2 4

slide-27
SLIDE 27

a

Trie construction demo

27

trie

e l l s l s b y s l h h t e e

5 3 1 2 4

slide-28
SLIDE 28

2

a

Trie construction demo

28

put("sea", 6)

l l s e l s b y l h h e t a e s

6

  • verwrite
  • ld value with

new value 5 3 1 4

slide-29
SLIDE 29

Trie construction demo

29

trie

e a l l s e l s b y s l h h e t

5 3 1 6 4

slide-30
SLIDE 30

Trie construction demo

30

trie

e a l l s e l s b y s l h h e t

3 1 6 4 5

slide-31
SLIDE 31

e r

Trie construction demo

31

put("shore", 7)

e a l l s e l s b y l

  • h

e t h s

7 5 3 1 6 4

slide-32
SLIDE 32

Trie construction demo

32

trie

e a l l s e l s b y s l

  • r

e h t h e

5 7 3 1 6 4

slide-33
SLIDE 33

33

Trie representation: Java implementation

  • Node. A value, plus references to R nodes.

private static class Node { private Object value; private Node[] next = new Node[R]; }

Trie representation each node has an array of links and a value characters are implicitly defined by link index s h e 0 e l l s 1 a

s h e e l l s a

1 2 2

neither keys nor characters are explicitly stored use Object instead of Value since
 no generic array creation in Java

slide-34
SLIDE 34

public class TrieST<Value> { private static final int R = 256; private Node root; private static class Node { /* see previous slide */ } public void put(String key, Value val) { root = put(root, key, val, 0); } private Node put(Node x, String key, Value val, int d) { if (x == null) x = new Node(); if (d == key.length()) { x.val = val; return x; } char c = key.charAt(d); x.next[c] = put(x.next[c], key, val, d+1); return x; } ⋮

34

R-way trie: Java implementation

extended ASCII

slide-35
SLIDE 35

⋮ public boolean contains(String key) { return get(key) != null; } public Value get(String key) { Node x = get(root, key, 0); if (x == null) return null; return (Value) x.val; } private Node get(Node x, String key, int d) { if (x == null) return null; if (d == key.length()) return x; char c = key.charAt(d); return get(x.next[c], key, d+1); } }

35

R-way trie: Java implementation (continued)

cast needed

slide-36
SLIDE 36

Trie performance

Search hit. Need to examine all L characters for equality. 
 Search miss.

  • Could have mismatch on first character.
  • Typical case: examine only a few characters (sublinear).


 


  • Space. R null links at each leaf.


(but sublinear space possible if many short strings share common prefixes) 
 
 Bottom line. Fast search hit and even faster search miss, but wastes space.

36

slide-37
SLIDE 37

To delete a key-value pair:

  • Find the node corresponding to key and set value to null.
  • If that node has all null links, remove that node (and recur).

37

Deletion in an R-way trie

e r delete("shells") e a l l s b y

  • h

e t

7 5 3 1 6 4

s

set value to null

s l l e h s

null value and links (delete node)

slide-38
SLIDE 38

38

String symbol table implementations cost summary

R-way trie.

  • Method of choice for small R.
  • Too much memory for large R.
  • Challenge. Use less memory, e.g., 65,536-way trie for Unicode!

character accesses (typical case) dedup implementation search
 hit search
 miss insert space
 (references) moby.txt actors.txt red-black BST L + c lg 2 N c lg 2 N c lg 2 N 4N 1,4 97,4 hashing (linear probing) L L L 4N to 16N 0,76 40,6 R-way trie L log R N L (R+1) N 1,12

  • ut of

memory

slide-39
SLIDE 39

39

Digression: out of memory?

“ 640 K ought to be enough for anybody. ” 
 — (mis)attributed to Bill Gates, 1981
 (commenting on the amount of RAM in personal computers) “ 64 MB of RAM may limit performance of some Windows XP
 features; therefore, 128 MB or higher is recommended for
 best performance. ” — Windows XP manual, 2002 “ 64 bit is coming to desktops, there is no doubt about that.
 But apart from Photoshop, I can't think of desktop applications
 where you would need more than 4GB of physical memory, which
 is what you have to have in order to benefit from this technology.
 Right now, it is costly. ” — Bill Gates, 2003

slide-40
SLIDE 40

Digression: out of memory?

A short (approximate) history.

40

machine year address bits addressable memory typical actual memory cost PDP-8 1960s 12 6 KB 6 KB $16K PDP-10 1970s 18 256 KB 256 KB $1M IBM S/360 1970s 24 4 MB 512 KB $1M VAX 1980s 32 4 GB 1 MB $1M Pentium 1990s 32 4 GB 1 GB $1K Xeon 2000s 64 enough 4 GB 100 $ ?? future 128+ enough enough 1 $

“ 512-bit words ought to be enough for anybody. ” 
 — Kevin Wayne, 1995

slide-41
SLIDE 41
  • R-way tries
  • Ternary search tries
  • Character-based operations

TRIES

slide-42
SLIDE 42

42

Ternary search tries

  • Store characters and values in nodes (not keys).
  • Each node has three children: smaller (left), equal (middle), larger (right).

Jon L. Bentley* Robert Sedgewick#

Abstract

We present theoretical algorithms for sorting and searching multikey data, and derive from them practical C implementations for applications in which keys are charac- ter strings. The sorting algorithm blends Quicksort and radix sort; it is competitive with the best known C sort

  • codes. The searching algorithm

blends tries and binary search trees; it is faster than hashing and other commonly used search methods. The basic ideas behind the algo- rithms date back at least to the 1960s but their practical utility has been overlooked. We also present extensions to more complex string problems, such as partial-match searching.

  • 1. Introduction

Section 2 briefly reviews Hoare’s [9] Quicksort and binary search trees. We emphasize a well-known isomor- phism relating the two, and summarize other basic facts. The multikey algorithms and data structures are pre- sented in Section 3. Multikey Quicksort orders a set of II vectors with k components each. Like regular Quicksort, it partitions its input into sets less than and greater than a given value; like radix sort, it moves on to the next field

  • nce the current input is known to be equal in the given
  • field. A node in a ternary search tree represents a subset of

vectors with a partitioning value and three pointers: one to lesser elements and one to greater elements (as in a binary search tree) and one to equal elements, which are then pro- cessed on later fields (as in tries). Many of the structures and analyses have appeared in previous work, but typically as complex theoretical constructions, far removed from practical applications. Our simple framework

  • pens the

door for later implementations. The algorithms are analyzed in Section 4. Many of the analyses are simple derivations of old results. Section 5 describes efficient C programs derived from the algorithms. The first program is a sorting algorithm

Fast Algorithms for Sorting and Searching Strings

that is competitive with the most efficient string sorting programs known. The second program is a symbol table implementation that is faster than hashing, which is com- monly regarded as the fastest symbol table implementa- tion. The symbol table implementation is much more space-efficient than multiway trees, and supports more advanced searches. In many application programs, sorts use a Quicksort implementation based on an abstract compare operation, and searches use hashing or binary search trees. These do not take advantage of the properties of string keys, which are widely used in practice. Our algorithms provide a nat- ural and elegant way to adapt classical algorithms to this important class of applications. Section 6 turns to more difficult string-searching prob-

  • lems. Partial-match queries allow “don’t care” characters

(the pattern “so.a”, for instance, matches soda and sofa). The primary result in this section is a ternary search tree implementation

  • f Rivest’s partial-match

searching algo- rithm, and experiments on its performance. “Near neigh- bor” queries locate all words within a given Hamming dis- tance of a query word (for instance, code is distance 2 from soda). We give a new algorithm for near neighbor searching in strings, present a simple C implementation, and describe experiments on its efficiency. Conclusions are offered in Section 7.

  • 2. Background

Quicksort is a textbook divide-and-conquer algorithm. To sort an array, choose a partitioning element, permute the elements such that lesser elements are on one side and greater elements are on the other, and then recursively sort the two subarrays. But what happens to elements equal to the partitioning value? Hoare’s partitioning method is binary: it places lesser elements on the left and greater ele- ments on the right, but equal elements may appear on either side.

* Bell Labs, Lucent Technologies, 700 Mountam Avenue, Murray Hill. NJ 07974; jlb@research.bell-labs.com. # Princeton University. Princeron.

  • NJ. 08514:

rs@cs.princeton.edu.

Algorithm designers have long recognized the desir- irbility and difficulty

  • f a ternary partitioning

method. Sedgewick [22] observes on page 244: “Ideally, we would llke to get all [equal keys1 into position in the file, with all 360

slide-43
SLIDE 43
  • Store characters and values in nodes (not keys).
  • Each node has three children: smaller (left), equal (middle), larger (right).

43

Ternary search tries

TST representation of a trie each node has three links link to TST for all keys that start with s link to TST for all keys that start with a letter before s t h e 8 a r e 12 s h u e 10 e l l s 11 l l s 15 r 0 e l y 13

  • 7

r e b y 4 a 14 t h e 8 a r e 12 s h u e 10 e l l s 11 l l s 15 r 0 e l y 13

  • 7

r e b y 4 a

14

slide-44
SLIDE 44

Follow links corresponding to each character in the key.

  • If less, take left link; if greater, take right link.
  • If equal, take the middle link and move to the next key character.

Search hit. Node where search ends has a non-null value. Search miss. Reach a null link or node where search ends has null value.

44

Search in a TST

return value associated with last key character match: take middle link, move to next char mismatch: take left or right link, do not move to next char t h e 8 a r e 12 s h u e 10 e l l s 11 l l s 15 r e l y 13

  • 7

r e b y 4 a

14 get("sea")

slide-45
SLIDE 45

45

Search in a TST

  • r

e

7

t h e

5

b y

4

a

get("sea")

e h s e l l s

1 6

l s l

3

a l e h s

return value associated with last key character 6

slide-46
SLIDE 46

46

Search in a TST

  • r

e

7

t h e

5

b y

4

a

get("shelter")

e h s e l l s

1 6

l s l

3

e h s l l

no link to 't' (return null)

slide-47
SLIDE 47

Ternary search trie insertion demo

47

ternary search trie

slide-48
SLIDE 48

Ternary search trie insertion demo

48

put("she", 0)

h s

value is in node corresponding to last character key is sequence

  • f characters from

root to value using middle links

e

slide-49
SLIDE 49

Ternary search trie insertion demo

49

put("she", 0)

e h s

slide-50
SLIDE 50

l e

Ternary search trie insertion demo

50

put("sells", 1)

e h s s

1

l h s

slide-51
SLIDE 51

Ternary search trie insertion demo

51

ternary search trie

e h s e l l s

1

slide-52
SLIDE 52

a

2

l l s

1

e

Ternary search trie insertion demo

52

put("sea", 2)

e h s h s

slide-53
SLIDE 53

a

Ternary search trie insertion demo

53

ternary search trie

e h s e l l s

1 2

slide-54
SLIDE 54

s

3

l a

Ternary search trie insertion demo

54

put("shells", 3)

e h s e l l s

1 2

l e h s

slide-55
SLIDE 55

a

Ternary search trie insertion demo

55

ternary search trie

e h s e l l s

1 2

l s l

3

slide-56
SLIDE 56

b y

4

a

Ternary search trie insertion demo

56

put("by", 4)

e h s e l l s

1 2

l s l

3

s

slide-57
SLIDE 57

b y

4

a

Ternary search trie insertion demo

57

ternary search trie

e h s e l l s

1 2

l s l

3

slide-58
SLIDE 58

e

5

h t b y

4

a

Ternary search trie insertion demo

58

put("the", 5)

e h s e l l s

1 2

l s l

3

s

slide-59
SLIDE 59

t h e

5

b y

4

a

Ternary search trie insertion demo

59

ternary search trie

e h s e l l s

1 2

l s l

3

slide-60
SLIDE 60

a l l s

1 2 6

  • verwrite
  • ld value with

new value

a l e t h e

5

b y

4

Ternary search trie insertion demo

60

put("sea", 6)

e h s l s l

3

h s

slide-61
SLIDE 61

t h e

5

b y

4

a

Ternary search trie insertion demo

61

ternary search trie

e h s e l l s

1 6

l s l

3

slide-62
SLIDE 62

e

7

r

  • t

h e

5

b y

4

a

Ternary search trie insertion demo

62

put("shore", 7)

e h s e l l s

1 6

l s l

3

e h s

slide-63
SLIDE 63
  • r

e

7

t h e

5

b y

4

a

Ternary search trie insertion demo

63

ternary search trie

e h s e l l s

1 6

l s l

3

slide-64
SLIDE 64

Ternary search trie insertion demo

64

ternary search trie

e a l l s e l s b y h l

  • r

e t h e s

5 7 3 1 6 4

slide-65
SLIDE 65

26-way trie. 26 null links in each leaf.

  • TST. 3 null links in each leaf.

65

26-way trie vs. TST

26-way trie (1035 null links, not shown) TST (155 null links)

now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay

  • wl

joy rap gig wee was cab wad caw cue fee tap ago tar jam dug and

slide-66
SLIDE 66

A TST node is five fields:

  • A value.
  • A character c.
  • A reference to a left TST.
  • A reference to a middle TST.
  • A reference to a right TST.

66

TST representation in Java

private class Node { private Value val; private char c; private Node left, mid, right; }

Trie node representations s e h u link for keys that start with s link for keys that start with su h u e

standard array of links (R = 26) ternary search tree (TST)

s

slide-67
SLIDE 67

67

TST: Java implementation

public class TST<Value> { private Node root; private class Node { /* see previous slide */ } public void put(String key, Value val) { root = put(root, key, val, 0); } private Node put(Node x, String key, Value val, int d) { char c = key.charAt(d); if (x == null) { x = new Node(); x.c = c; } if (c < x.c) x.left = put(x.left, key, val, d); else if (c > x.c) x.right = put(x.right, key, val, d); else if (d < key.length() - 1) x.mid = put(x.mid, key, val, d+1); else x.val = val; return x; } ⋮

slide-68
SLIDE 68

68

TST: Java implementation (continued)

⋮ public boolean contains(String key) { return get(key) != null; } public Value get(String key) { Node x = get(root, key, 0); if (x == null) return null; return x.val; } private Node get(Node x, String key, int d) { if (x == null) return null; char c = key.charAt(d); if (c < x.c) return get(x.left, key, d); else if (c > x.c) return get(x.right, key, d); else if (d < key.length() - 1) return get(x.mid, key, d+1); else return x; } }

slide-69
SLIDE 69

69

String symbol table implementation cost summary

  • Remark. Can build balanced TSTs via rotations to achieve L + log N


worst-case guarantees. Bottom line. TST is as fast as hashing (for string keys), space efficient.

character accesses (typical case) dedup implementation search
 hit search
 miss insert space
 (references) moby.txt actors.txt red-black BST L + c lg 2 N c lg 2 N c lg 2 N 4 N 1,4 97,4 hashing (linear probing) L L L 4 N to 16 N 0,76 40,6 R-way trie L log R N L (R + 1) N 1,12

  • ut of

memory TST L + ln N ln N L + ln N 4 N 0,72 38,7

slide-70
SLIDE 70

70

TST vs. hashing

Hashing.

  • Need to examine entire key.
  • Search hits and misses cost about the same.
  • Performance relies on hash function.
  • Does not support ordered symbol table operations.


 TSTs.

  • Works only for strings (or digital keys).
  • Only examines just enough key characters.
  • Search miss may involve only a few characters.
  • Supports ordered symbol table operations (plus others!).


 Bottom line. TSTs are:

  • Faster than hashing (especially for search misses).


More flexible than red-black BSTs. [stay tuned]

slide-71
SLIDE 71

Tries.

  • Store characters in nodes (not keys).
  • Each node has R children, one for each possible character.
  • Store values in nodes corresponding to last characters in keys.

Ternary Search Trees (TSTs)

  • Store characters and values in nodes (not keys).
  • Each node has three children: smaller (left), equal (middle), larger (right).

71

Previously on BBM202..

TST representation of a trie each node has three links link to TST for all keys that start with s link to TST for all keys that start with a letter before s t h e 8 a r e 12 s h u e 10 e l l s 11 l l s 15 r 0 e l y 13

  • 7

r e b y 4 a 14 t h e 8 a r e 12 s h u e 10 e l l s 11 l l s 15 r 0 e l y 13

  • 7

r e b y 4 a

14

slide-72
SLIDE 72
  • R-way tries
  • Ternary search tries
  • Character-based operations

TRIES

slide-73
SLIDE 73

Character-based operations. The string symbol table API supports several useful character-based operations. 
 
 
 
 
 
 
 
 Prefix match. Keys with prefix "sh": "she", "shells", and "shore". 
 Wildcard match. Keys that match ".he": "she" and "the". 
 Longest prefix. Key that is the longest prefix of "shellsort": "shells".

73

String symbol table API

key value by 4 sea 6 sells 1 she shells 3 shore 7 the 5

slide-74
SLIDE 74
  • Remark. Can also add other ordered ST methods, e.g., floor() and rank().

74

String symbol table API

public class StringST<Value> StringST() create a symbol table with string keys void put(String key, Value val) put key-value pair into the symbol table Value get(String key) value paired with key void delete(String key) delete key and corresponding value

Iterable<String> keys() all keys Iterable<String> keysWithPrefix(String s) keys having s as a prefix Iterable<String> keysThatMatch(String s) keys that match s (where . is a wildcard) String longestPrefixOf(String s) longest key that is a prefix of s

slide-75
SLIDE 75

To iterate through all keys in sorted order:

  • Do inorder traversal of trie; add keys encountered to a queue.
  • Maintain sequence of characters on path from root to node.

75

Warmup: ordered iteration

  • r

e 7 t h e 5 s h e 0 e l l s 1 l l s 3 b y 4 a 6

b by s se sea sel sell sells sh she shell shells sho shor shore t th the by by sea by sea sells by sea sells she by sea sells she shells by sea sells she shells shore by sea sells she shells shore the key q keysWithPrefix("");

slide-76
SLIDE 76

To iterate through all keys in sorted order:

  • Do inorder traversal of trie; add keys encountered to a queue.
  • Maintain sequence of characters on path from root to node.

76

Ordered iteration: Java implementation

public Iterable<String> keys() { Queue<String> queue = new Queue<String>(); collect(root, "", queue); return queue; } private void collect(Node x, String prefix, Queue<String> q) { if (x == null) return; if (x.val != null) q.enqueue(prefix); for (char c = 0; c < R; c++) collect(x.next[c], prefix + c, q); }

sequence of characters


  • n path from root to x
slide-77
SLIDE 77

Find all keys in symbol table starting with a given prefix.

  • Ex. Autocomplete in a cell phone, search bar, text editor, or shell.
  • User types characters one at a time.
  • System reports all matching strings.

77

Prefix matches

slide-78
SLIDE 78

Find all keys in symbol table starting with a given prefix.

78

Prefix matches

  • r

e 7 t h e 5 s h e 0 e l l s 1 l l s 3 b y 4 a 6 find subtrie for all keys beginning with "sh"

  • r

e 7 t h e 5 s h e 0 e l l s 1 l l s 3 b y 4 a 6 collect keys in that subtrie

keysWithPrefix("sh");

Prefjx match in a trie

sh she shel shell shells sho shor shore she she shells she shells shore key q

public Iterable<String> keysWithPrefix(String prefix) { Queue<String> queue = new Queue<String>(); Node x = get(root, prefix, 0); collect(x, prefix, queue); return queue; }

root of subtrie for all strings
 beginning with given prefix

slide-79
SLIDE 79

Use wildcard . to match any character in alphabet.

79

Wildcard matches

co....er coalizer coberger codifier cofaster cofather cognizer cohelper colander coleader ... compiler ... composer computer cowkeper .c...c. acresce acroach acuracy

  • ctarch

science scranch scratch scrauch screich scrinch scritch scrunch scudick scutock

slide-80
SLIDE 80

Search as usual if character is not a period;
 go down all R branches if query character is a period.

80

Wildcard matches

public Iterable<String> keysThatMatch(String pat)
 { Queue<String> queue = new Queue<String>(); collect(root, "", 0, pat, queue); return queue; } private void collect(Node x, String prefix, String pat, Queue<String> q) { if (x == null) return; int d = prefix.length(); if (d == pat.length() && x.val != null) q.enqueue(prefix); if (d == pat.length()) return; char next = pat.charAt(d); for (char c = 0; c < R; c++) if (next == '.' || next == c) collect(x.next[c], prefix + c, pat, q); }

slide-81
SLIDE 81

81

Longest prefix

Find longest key in symbol table that is a prefix of query string. 


  • Ex. To send packet toward destination IP address, router chooses IP address

in routing table that is longest prefix match. 
 
 
 
 
 
 
 
 
 


  • Note. Not the same as floor:

represented as 32-bit binary number for IPv4
 (instead of string)

floor("128.112.100.16") = "128.112.055.15"

"128" "128.112" "128.112.055" "128.112.055.15" "128.112.136" "128.112.155.11" "128.112.155.13" "128.222" "128.222.136"

longestPrefixOf("128.112.136.11") = "128.112.136" longestPrefixOf("128.112.100.16") = "128.112" longestPrefixOf("128.166.123.45") = "128"

slide-82
SLIDE 82

82

Longest prefix

Find longest key in symbol table that is a prefix of query string.

  • Search for query string.
  • Keep track of longest key encountered.

Possibilities for longestPrefixOf() s h e 0 e l l s 1 l l s 3 a 2

"she" "shell"

search ends at end of string value is not null return she s h e 0 e l l s 1 l l s 3 a 2 search ends at end of string value is null return she (last key on path)

"shellsort"

s h e 0 e l l s 1 l l s 3 a 2 search ends at null link return shells (last key on path)

slide-83
SLIDE 83

83

Longest prefix: Java implementation

Find longest key in symbol table that is a prefix of query string.

  • Search for query string.
  • Keep track of longest key encountered.

public String longestPrefixOf(String query) { int length = search(root, query, 0, 0); return query.substring(0, length); } private int search(Node x, String query, int d, int length) { if (x == null) return length; if (x.val != null) length = d; if (d == query.length()) return length; char c = query.charAt(d); return search(x.next[c], query, d+1, length); }

slide-84
SLIDE 84

84

T9 texting

  • Goal. Type text messages on a phone keypad.


 Multi-tap input. Enter a letter by repeatedly pressing a key until the desired letter appears. 
 T9 text input.

  • Find all words that correspond to given sequence of numbers.
  • Press 0 to see all completion options.


 


  • Ex. hello
  • Multi-tap: 4 4 3 3 5 5 5 5 5 5 6 6 6
  • T9: 4 3 5 5 6

www.t9.com "a much faster and more fun way to enter text"

slide-85
SLIDE 85

85

A world without “s” ??

To: "'Kevin Wayne'" <wayne@CS.Princeton.EDU> Date: Tue, 25 Oct 2005 12:44:42 -0700 Thank you Kevin. I am glad that you find T9 o valuable for your

  • cla. I had not noticed thi before. Thank for

writing in and letting u know. Take care, Brooke nyder OEM Dev upport AOL/Tegic Communication 1000 Dexter Ave N. uite 300 eattle, WA 98109 ALL INFORMATION CONTAINED IN THIS EMAIL IS CONSIDERED
 CONFIDENTIAL AND PROPERTY OF AOL/TEGIC COMMUNICATIONS

slide-86
SLIDE 86

86

Patricia trie

Patricia trie. [Practical Algorithm to Retrieve Information Coded in Alphanumeric]

  • Remove one-way branching.
  • Each node represents a sequence of characters.
  • Implementation: one step beyond this course.


 
 Applications.

  • Database search.
  • P2P network search.
  • IP routing tables: find longest prefix match.
  • Compressed quad-tree for N-body simulation.
  • Efficiently storing and querying XML documents.


 
 
 Also known as: crit-bit tree, radix tree.

1 1 2 2 put("shells", 1); put("shellfish", 2);

h e l f i s h l s s s shell fish internal

  • ne-way

branching external

  • ne-way

branching

standard trie no one-way branching

slide-87
SLIDE 87

87

Suffix tree

Suffix tree.

  • Patricia trie of suffixes of a string.
  • Linear-time construction: beyond this course.


 
 
 
 
 
 
 
 Applications.

  • Linear-time: longest repeated substring, longest common substring,


longest palindromic substring, substring search, tandem repeats, ….

  • Computational biology databases (BLAST, FASTA).

BANANAS A NA S NA S S NAS NAS S suffjx tree for BANANAS

slide-88
SLIDE 88

88

String symbol tables summary

A success story in algorithm design and analysis. 
 Red-black BST.

  • Performance guarantee: log N key compares.
  • Supports ordered symbol table API.


 Hash tables.

  • Performance guarantee: constant number of probes.
  • Requires good hash function for key type.

  • Tries. R-way, TST.
  • Performance guarantee: log N characters accessed.
  • Supports character-based operations.


 Bottom line. You can get at anything by examining 50-100 bits (!!!)