Comptage probabiliste : entre math ematique et informatique - PowerPoint PPT Presentation

LIPN NOV 2006 Comptage probabiliste : entre math´ ematique et informatique Philippe Flajolet, INRIA, Rocquencourt http://algo.inria.fr/flajolet 1

Routers in the range of Terabits/sec ( 10 14 b/s). Google indexes 6 billion pages & prepares to index 100 Petabytes of data ( 10 17 B). What about content ? Message : Can get a few key characteristics, QUICK and EASY Combinatorics +algorithms +probabilities +analysis are useful! 2

From Estan-Varghese-Fisk: traces of attacks Need number of active connections in time slices. Incoming/Outgoing flows at 40Gbits/second. Code Red Worm: 0.5GBytes of compressed data per hour (2001). CISCO: in 11 minutes, a worm infected 500,000,000 machines. 3

The situation is like listening to a play of Shakespeare and at the end estimate the number of different words . Rules: Very little computation per element scanned, very little auxiliary memory. From Durand-Flajolet, L OG L OG Counting (ESA-2003): Whole of Shakespeare, m = 256 small “bytes” of 4 bits each = 128bytes ghfffghfghgghggggghghheehfhfhhgghghghhfgffffhhhiigfhhffgfiihfhhh igigighfgihfffghigihghigfhhgeegeghgghhhgghhfhidiigihighihehhhfgg hfgighigffghdieghhhggghhfghhfiiheffghghihifgggffihgihfggighgiiif fjgfgjhhjiifhjgehgghfhhfhjhiggghghihigghhihihgiighgfhlgjfgjjjmfl Estimate n ◦ ≈ 30 , 897 vs n = 28 , 239 distinct words. Error: +9 . 4 % w/ 128 bytes! 4

Uses: — Routers: intrusion, flow monitoring & control — Databases: Query optimization, cf M ∪ M ′ for multisets ; Esti- mating the size of queries & “sketches”. — Statistics gathering: on the fly, fast and with little memory even on “unclean” data ≃ layer 0 of “ data mining ”. 5

This talk: • Estimating characteristics of large data streams — sampling; size & cardinality & nonuniformity index [ F 1 , F 0 , F 2 ] ❀ power of randomization via hashing ⋄ Gains by a factor of > 400 [Palmer et al. ] • Analysis of algorithms — generating functions, complex asymptotics, Mellin transforms ⋄ Nice problems for theoreticians. • Theory and Practice — Interplay of analysis and design ❀ super-optimized algorithms. 6

Problems on Streams Given: S = a large stream S = ( r 1 , r 2 , . . . , r ℓ ) with duplicates — | | S | | = length or size: total # of records ( ℓ ) — | S | = cardinality: # of distinct records ( c ) ♦ How to estimate size, cardinality, etc? X ( f v ) p . More generally, if f v is frequency of value v : F p := v ∈ D Cardinality is F 0 ; size is F 1 ; F 2 is indicator of nonuniformity of distribution; “ F ∞ ” is most frequent element [Alon, Matias, Szegedy, STOC96] ♦ How to sample? — with or without multipicity ♦ How to find icebergs, mice, elephants? 7

1 ICEBERGS— COMBINATORICS HELPS! Definition: A k –iceberg is present in proportion > 1 /k . One pass detection of icebergs for k = 2 using 1 registers is possible . — Trigger a gang war: equip each individual with a gun. — Each guy shoots a guy from a different gang, then commits suicide: Majority gang survives. — Implement sequentially & adapt to k ≥ 2 with k − 1 registers. [Karp et al. 2003] 8

2 APPROXIMATE COUNTING —THE DYADIC PARADIGM How to find an integer while posing few questions? — Ask if in [1—2], [2—4], [4—8], [8–16], etc? — Conclude by binary search: cost is 2 log 2 n . A paradigm for unbounded search: • Ethernet proceeds by period doubling + randomization. • Wake up process for mobile communication [ O CAD : Lavault + ] • Adaptive data structures: e.g., extendible hashing tables. ♥ Approximate Counting 9

Approximate counting —probabilities help! The oldest algorithm [Morris CACM:1977], analysis [F , 1985]. Maintain F 1 , i.e., counter subject to C := C + 1 . 3/4 1/2 7/8 1 1/8 1/2 1/4 A LG : Approximate Couting Initialize: X := 1 ; Increment: do X := X + 1 with probability 2 − X ; Output: 2 X − 2 . Theorem: Count till n probabilistically using log 2 log n + δ bits, with accuracy about 0 . 59 · 2 − δ/ 2 . Beats information theory(!?): 8 bits for counts ≤ 2 16 w/ accuracy ≈ 15 %. 10

10 runs of of APCO: value of X ( n = 10 3 ) 10 8 6 4 2 0 200 400 600 800 1000 11

a1 a2 a3 Methodology : b b b 1 2 3 X f n z n . ♥ Paths in graphs �→ Generating Functions: ( f n ) �→ f ( z ) := n Here: Symbolicly describe all paths: ( a 1 ) ⋆ b 1 ( a 2 ) ⋆ b 2 ( a 3 ) ⋆ 1 1 − f = 1 + f + f 2 + · · · ≃ ( f ) ⋆ . since 1 1 1 1 − a 1 b 1 1 − a 2 b 2 1 − a 3 Perform probabilistic valuation a j �→ q j ; b j �→ 1 − q j : q 1+2 z 2 H 3 ( z ) = (1 − (1 − q ) z )(1 − (1 − q 2 ) z )(1 − (1 − q 3 z )) . q ( k 2 ) ξ k − 1 ♥ [Prodinger’94] Euler transform ξ := z/ (1 − z ) : zH k ( z ) = (1 − ξq ) · · · (1 − ξq k ) . Exact moments of X and estimate q X via Heine’s transformation of q -calculus: mean is unbiased, variance ❀ 0 . 59 . 12

♥ Partial fraction expansions ❀ asymptotic distribution = quantify typical behaviour + risk! ( Exponential tails ≫ Chebyshev ineq.) We have P n ( X = ℓ ) ∼ φ ( n/ 2 ℓ ) , where [ ( q ) j := (1 − q ) · · · (1 − q j ) .] phi(x) 0.4 ( − 1) j q ( j 2 ) e − xq − j X φ ( x ) := 0.3 ( q ) ∞ ( q ) j j ≥ 0 0.2 0.1 0 1 2 3 4 5 x ♣ Fluctuations: . . . , n 2 L , . . . , n 4 , . . . depend on L = ⌊ log 2 n ⌋ . cf. Szpankowski, Mahmoud, Fill, Prodinger, . . . Analyse storage utilization via Mellin transform 13

E(X)-log2(n) Approximate Counting –0.273946 –0.273948 Mean ( X ) − log 2 n : –0.27395 –0.273952 –0.273954 200 400 600 800 1000 x The Mellin transform [F . R´ egnier Sedgewick 1985]; [FlGoDu 1995] Z ∞ f ( x ) x s − 1 dx. f ⋆ ( s ) := 0 → Singularities ( f ⋆ ) . Mapping properties (complex analysis): Asympt ( f ) ← “ x ” X ❀ F ∗ ( s ) = f ⋆ ( s ) ♣ dyadic superpositions of models: F ( x ) = 1 − 2 s . φ 2 ℓ = ⇒ Standard asymptotic terms + (small) fluctuations. 14

Cultural flashes — Morris [1977]: Couting a large number of events in small memory. — The power of probabilistic machines & approximation [Freivalds IFIP 1977] — The FTP protocol: Additive Increase Multiplicative Decrease (AIMD) leads to similar functions [Robert et al, 2001] — Probability theory: Exponentials of Poisson processes [Yor et al, 2001] 15

Randomization and hashing Theme: randomization is a major algorithmic paradigm. • Cryptography (implementation, attacks) • Combinatorial optimization (smoothing, random rounding). • Hashing and direct access methods — Produce (seemingly) uniform data from actual ones; — Provide reproducible chance 16

Can get random bits from nonrandom data: Works fine! To be or not to be. . . ❀ Justifies: Angel Daemon —— The Model 17

3 COUPON COLLECTOR COUNTING Let a flow of people enter a room. — Birthday Paradox: It takes on average 23 to get a birthday collision — Coupon Collector : After 365 persons have entered, expect a partial collection of ∼ 231 different days in the year; it would take more than 2364 to reach a full collection. B n C 1st birthday coll. complete coll. r πn ≈ ne − 1 E n ( B ) ∼ E n ( C ) = nH n ∼ n log n 2 BP : Suppose we didn’t know the number N of days in the year but could identify people with the same birthday. Could we estimate N ? 18

Coupon Collector Counting First Counting Algorithm: Estimate cardinalities ≡ # of distinct elements. Motivated by query optimization in data bases. [Whang + , ACM TODS 1990] x h(x) T[1 . . m] A LG : Coupon Collector Counting Set up a table T [1 . . m ] of m bit-cells. — for x in S do mark cell T [ h ( x )] ; Return − m log V , where V :=fraction of empty cells. Alg. is indep. of replications. 19

Let n be sought cardinality. Then α := n/m is filling ratio . Expect V ≈ e − α empty cells by classical analysis of occupancy. Distri- bution is concentrated. Invert! 1 Count cardinalities till N max using 10 N max bits, for accuracy (standard error) = 2%. Generating functions for occupancy; Stirling numbers; basic depois- sonization. 20

4 SAMPLING Classical sampling [Vitter, ACM TOMS 1985] .... a .... u x b x d d A LG : Reservoir Sampling ( with multiplicities ) Sample m elements from S = ( s 1 , . . . , s N ) ; [ N unknown a priori] Maintain a cache (reservoir) of size m ; — for each coming s t +1 : place it in cache with probability m/ ( t +1) ; drop random element; 21

Can we sample values (i.e., without multiplicity)? Algorithm due to [Wegman, ca 1984, unpub.], analysed by [F .1990]. 0 0 c x a s d Sample of size ≤ b : 0 c s d depth d = 0 , 1 , 2 , . . . h(x)=0... s d f h h(x)=00... A LG : Adaptive Sampling ( without multiplicities ) Get a sample of size m from S ’s values. Set b := 4 m (bucket capacity); — Oversample by adaptive method; – Get sample of m elements from the ( b ≡ 4 m ) bucket. 22

Analysis. View collection of records as a set of bitstrings. Digital tree aka trie, paged version:  Trie( ω ) ≡ ω if card( ω ) ≤ b  •  � �� if card( ω ) > b Trie( ω ) = Trie( ω \ 0) Trie( ω \ 1) (Underlies dynamic and extendible hashing, paged DS, etc) Depth in Adaptive Sampling is length of leftmost branch; Bucket size is # of elements in leftmost page. 23

Comptage probabiliste : entre math ematique et informatique - PowerPoint PPT Presentation

LIPN NOV 2006 Comptage probabiliste : entre math ematique et informatique Philippe Flajolet, INRIA, Rocquencourt http://algo.inria.fr/flajolet 1 Routers in the range of Terabits/sec ( 10 14 b/s). Google indexes 6 billion pages &

C5 C5 75 C5- 75: C ENTRE FOR FAMILY MEDICINE ENTRE FOR FAMILY MEDICINE C ASE

Mod elisation math ematique des vagues David Lannes Institut de Math ematiques de

Vrification probabiliste de proprits de modles AltaRica 3.0 Benjamin Aupetit

Logic, Complexity, and Infinite Computations Olivier Finkel Equipe de Logique Math ematique

Logic, Complexity, and Infinite Computations Olivier Finkel Equipe de Logique Math ematique

GUST e-Foundry MATH FONTS Latin Modern Math, ver. 1.959 T EX Gyre Bonum Math, ver. 1.005 T EX

Math 211 Math 211 Lecture #1 August 29, 2000 2 Welcome to Math 211 Welcome to Math 211 Math

Convex hull of a random point set Pierre Calka Journ ees nationales 2016 GdR Informatique

First half-year report 2005-2007 I mma cu la te C o nceptio n H ea lth C entre - Ka leo The

Revolutionising the process of software development Justyna Petke C entre for R esearch in E

UIA - 55me congrs, Miami 31 octobre - 4 novembre 2011 LA RELATION ENTRE EL ABOGATO DE LA

MOTIFS DISTRIBUTION IN DNA SEQUENCES St ephane ROBIN robin@inapg.inra.fr UMR INA-PG / INRA,

GRADUATION REQUIREMENTS English 4 Credits -I, II, III, IV Math 4 Credits Math I, Math II,

Math 211 Math 211 Lecture #1 Introduction August 26, 2002 2 Welcome to Math 211 Welcome to

Math Fun For Everyone! 1 Mini Math Attitude Inventory 1. I liked Math... A. A Lot B. A

Math 211 Math 211 Lecture #1 Introduction August 27, 2001 2 Welcome to Math 211 Welcome to

st ss qrts

Poltica de Formao da SEDUC A escola como lcus da formao A qualidade da aprendizagem

solar spectral irradiance measurements R. Galleano, W. Zaaiman, D. Alonso Alvarez, A. Minuto, N.

Media Gateway Control and the Softswitch Architecture Outline Introduction Softswitch

USO DA HIPERMIDIA NO PROCESSO DE APRENDIZAGEM USO DA HIPERMIDIA NO PROCESSO DE APRENDIZAGEM

Entropy, continued UNIT 4 Day 7 Demonstration Stretched vs. Relaxed Rubber Bands POLL: iClicker

Lecture 11: Coding and Entropy. David Aldous March 9, 2016 [show xkcd] This lecture looks at a

Entropy and The Second Law of Thermodynamics Entropy (S)