The Digital T ree: Analysis and Applications Philippe Flajolet, - - PowerPoint PPT Presentation

the digital t ree
SMART_READER_LITE
LIVE PREVIEW

The Digital T ree: Analysis and Applications Philippe Flajolet, - - PowerPoint PPT Presentation

Sminaire de Probabilits, Paris June 2010 The Digital T ree: Analysis and Applications Philippe Flajolet, INRIA Rocquencourt Tuesday, June 22, 2010 1 A (finite) tree associated with a (finite) set of words over an alphabet A. Equipped


slide-1
SLIDE 1

The Digital T ree:

Analysis and Applications

Philippe Flajolet, INRIA Rocquencourt Séminaire de Probabilités, Paris June 2010

1 Tuesday, June 22, 2010

slide-2
SLIDE 2

A (finite) tree associated with a (finite) set of words over an alphabet A. Equipped with a randomness model on words, we get a random tree, indexed by the number n of words. Characterize its probabilistic properties, mostly with COMPLEX ANAL YSIS.

2 Tuesday, June 22, 2010

slide-3
SLIDE 3
  • 1. Digital T

rees & Algorithms

3 Tuesday, June 22, 2010

slide-4
SLIDE 4

word <--> branch set of words <--> partial tree infinite tree

4 Tuesday, June 22, 2010

slide-5
SLIDE 5

DIGITAL TREE aka “TRIE”:= STOP descent by pruning long one-way branches. ~Only places corresponding to 2+ words (and their immediate descendants) are kept. ~The digital tree is finite as soon as built

  • ut of distinct words.

E={a..., bba..., bbb...}

5 Tuesday, June 22, 2010

slide-6
SLIDE 6

TOP-DOWN construction: Set E is separated into Ea,...,Ez according to initial letter; continue with next letter... INCREMENTAL construction: start with the empty tree and insert elements of E one after the other... (Split leaves as the need arises.)

E={a..., bba..., bbb...}

6 Tuesday, June 22, 2010

slide-7
SLIDE 7

SUMMARY:

Memoryless (Bernoulli) p,q; Markov, CF

7 Tuesday, June 22, 2010

slide-8
SLIDE 8

Manage dynamically dictionaries; hope for O(log n) depth? Save space by “factoring” common prefixes; hope for O(n) size? However, worst-case is unbounded...

Algorithms: 1 - Dictionaries

Analysis?

(Fredkin, de la Briandais ~1960)

“TRIE”=tree+retrieval

8 Tuesday, June 22, 2010

slide-9
SLIDE 9

A random trie on n=500 uniform binary sequences; size =741 internal nodes; height=18

n

9 Tuesday, June 22, 2010

slide-10
SLIDE 10

Data may be highly structured and share long prefixes. Use a transformation h: W -> W’ called “hashing” (akin to random number generators.) Uniform binary data are meaningful!

Algorithms: 2 -Hashing

Analysis?

10 Tuesday, June 22, 2010

slide-11
SLIDE 11

Data may be accessible by blocks, e.g., pages

  • n disc. Stop recursion as soon as “b”

elements are isolated (standard: b=1). Combine with hashing = get index structure.

Algorithms: 3 -Paging

Analysis?

Index Pages ......

11 Tuesday, June 22, 2010

slide-12
SLIDE 12

Data may be multidimensional & numeric/ geometric.

Algorithms: 4-MultiDim

Analysis?

quad-trie

12 Tuesday, June 22, 2010

slide-13
SLIDE 13

Data may be distributed and accessible only via a common channel (network). Everybody speaks at the same time; if noise, then SPLIT according to individual coin flips.

Algorithms: 5-Communication

Analysis?

B A C AC ABC AC

  • leader

tree protocol

13 Tuesday, June 22, 2010

slide-14
SLIDE 14
  • 2. Expectations

Bernoulli vs Poisson models Mellin technology Fluctuations and error terms

14 Tuesday, June 22, 2010

slide-15
SLIDE 15

(Proof in a “modernized” version follows....)

n Sn

15 Tuesday, June 22, 2010

slide-16
SLIDE 16

Algebra...

p q

[ ]

16 Tuesday, June 22, 2010

slide-17
SLIDE 17

Algebra...

17 Tuesday, June 22, 2010

slide-18
SLIDE 18

With Sn the expected tree size when the tree contains n elements and S(x) the Poisson expectation: S(x) =

  • n≥0

Sne−x xn n! . The Poisson expectation S(x) is like a generating function of {Sn}. Go back —“depoissonize”— by Taylor expansion. E.g.: Sn =

  • k
  • 1 −
  • 1 − 1

2k n − n 2k

  • 1 − 1

2k n−1 , p = q = 1 2. Many variants are possible and one can justify that Sn = S(x) + small when x = n.

(elementary)

18 Tuesday, June 22, 2010

slide-19
SLIDE 19

The Mellin transform

f (x)

M

  • f ⋆(s) :=

∞ f (x)xs−1 dx (It exists in strips of C determined by growth of f (x) at 0, +∞.) Property 1. Factors harmonic sums:

  • (λ,µ)

λf (µx)

M

  • (λ,µ)

λµ−s

  • · f ⋆(x).

Property 2. Maps asymptotics of f on singularities of f ⋆: f ⋆ ≈ 1 (s − s0)m = ⇒ f (x) ≈ x−s0(log x)m−1.

Proof of P2 is from Mellin inversion + residues: f (x) = 1 2iπ Z c+i∞

c−i∞

f ⋆(s)x−s ds.

Analysis...

19 Tuesday, June 22, 2010

slide-20
SLIDE 20

Mellin and Tries

p = q = 1/2 : S(x) =

  • k

2kg(x/2k), with g(x) = 1 − (1 + x)e−x. Harmonic sum property: S⋆(s) =

  • 2k2ks

· (s + 1)Γ(s) = Γ(s) 1 − 21+s . Mapping properties: S⋆ exists in −2 < ℜ(s) < −1. Poles at sk = −1 + 2ikπ/ log 2, for k ∈ Z. Location of pole (s0)

  • Asymptotics of f (x) ≈ x−s0

s0 = σ + iτ

  • x−σeiτ log x

20 Tuesday, June 22, 2010

slide-21
SLIDE 21

21 Tuesday, June 22, 2010

slide-22
SLIDE 22

Memoryless sources (I)

Correspond to p = q. Dirichlet series is 1 1 − p−s − q−s . Theorem (Knuth 1973; Fayolle, F., Hofri 1986, . . . ) Let H := p log p−1 + q log q−1 be the entropy.

  • In the periodic case, log p

log q ∈ Q, there are fluctuations in Sn.

  • In the aperiodic case, log p

log q ∈ Q:

Sn ∼ n H and Dn ∼ 1 H log n, Philippe Robert & Hanene Mohamed relate this to the periodic/aperiodic dichotomy of renewal theory (2005+).

22 Tuesday, June 22, 2010

slide-23
SLIDE 23

[Lapidus & van Frankenhuijsen 2006]

(pi, e, tan(1), log2, z(3), ...)

23 Tuesday, June 22, 2010

slide-24
SLIDE 24
  • 3. Distributions

Analytic depoissonization & Saddle-points Gaussian laws ...

24 Tuesday, June 22, 2010

slide-25
SLIDE 25

2h

= Throw n balls into 2h buckets, each of capacity b Text

25 Tuesday, June 22, 2010

slide-26
SLIDE 26

E[2H]

  • ->

26 Tuesday, June 22, 2010

slide-27
SLIDE 27

[2001]

27 Tuesday, June 22, 2010

slide-28
SLIDE 28

DISTRIBUTIONS: size, depth, and path-length

28 Tuesday, June 22, 2010

slide-29
SLIDE 29

Start with bivariate generating function F(z,u). Analyse log Analyse perturbation near u=1. Use analytic depoissonization Conclude by continuity theorem for characteristic fns.

(case of size, p=q=1/2) (p=q=1/2)

29 Tuesday, June 22, 2010

slide-30
SLIDE 30

Profile of tries, after Szpankowski et al. + Cesaratto-Vallée 2010+

30 Tuesday, June 22, 2010

slide-31
SLIDE 31
  • 4. General sources

Comparing and sorting real numbers Continued fractions Fundamental intervals...

31 Tuesday, June 22, 2010

slide-32
SLIDE 32

Comparing numbers & sorting by continued fractions

sign a b − c d

  • = sign(ad − bc).

Requires double precision and/or is unstable with floats.

(Computational geometry, Knuth’s Metafont,. . . )

Hakmem Algorithm (Gosper, 1972)

36 113 = 1 3 + 1 7 + 1 5 , 113 355 = 1 3 + 1 7 + 1 16 .

Theorem (Cl´ ement, F., Vall´ ee 2000+) Sorting with continued fractions: mean path length of trie is

K0n log n + K1n + Q(n) + K2 + o(1), K0 = 6 log 2 π2 , K1 = 18γ log 2 π2 + 9(log 2)2 π2 − 72log 2ζ′(2) π4 − 1 2. and Q(n) ≈ n1/4 is equivalent to Riemann Hypothesis.

32 Tuesday, June 22, 2010

slide-33
SLIDE 33

View source model in terms of fundamental intervals: w -> pw Revisit the analysis

  • f tries (e.g, size)

Mellinize:

(0) (1) [Vallée 1997++]

33 Tuesday, June 22, 2010

slide-34
SLIDE 34

For expanding maps T, fundamental intervals are generated by a transfer operator. For binary system (+Markov) and continued fractions, simplifications occur.

Vallée 1997-2001, Baladi-Vallée 2005+, ...

34 Tuesday, June 22, 2010

slide-35
SLIDE 35

...and Nörlund integrals complete the job! Poisson + Mellin = Newton

  • > Nörlund

= fixed-n model

Q.E.D.

cf [F . Sedgewick 1995]

  • 35

Tuesday, June 22, 2010

slide-36
SLIDE 36
  • 5. Other trie algorithms

Leader election The tree communication protocol “Patricia” trees Data compression: Lempel-Ziv... Probabilistic counting Quicksort is O(n (log n)2)...

36 Tuesday, June 22, 2010

slide-37
SLIDE 37

B A C AC ABC AC

  • leader

Leader election = leftmost boundary of a random trie (1/2,1/2). Proof: tree decompositions + Mellin...

37 Tuesday, June 22, 2010

slide-38
SLIDE 38

B A C AC ABC AC

  • tree protocol =

trie with arrivals

(non-commutative iteration semigroup)

38 Tuesday, June 22, 2010

slide-39
SLIDE 39
  • 0.249999999999999999999999999999999999999999999999999

999999999999999999999999999999999999999999999999999999 999999999999999999999999999999999999999999999999999999 9999999999999999999999999999999999999999999999999998211

(= -1/2+10-211 : there are 208 consecutive nines)

= !!

A curiosity (cf Mellin):

39 Tuesday, June 22, 2010