Theory and Practice of (some) Probabilistic Counting Algorithms - PowerPoint PPT Presentation

ICALP-2004, Turku, JUL 2004 Theory and Practice of (some) Probabilistic Counting Algorithms Philippe Flajolet, INRIA, Rocquencourt http://algo.inria.fr/flajolet 1

From Estan-Varghese-Fisk: traces of attacks Need number of active connections in time slices. Incoming/Outgoing flows at 40Gbits/second. Code Red Worm: 0.5GBytes of compressed data per hour (2001). CISCO: in 11 minutes, a worm infected 500,000,000 machines. 2

The situation is like listening to a play of Shakespeare and at the end estimate the number of different words . Rules: Very little computation per element scanned, very little auxiliary memory. From Durand-Flajolet, L OG L OG Counting (ESA-2003): Whole of Shakespeare, m = 256 small “bytes” of 4 bits each = 128bytes ghfffghfghgghggggghghheehfhfhhgghghghhfgffffhhhiigfhhffgfiihfhhh igigighfgihfffghigihghigfhhgeegeghgghhhgghhfhidiigihighihehhhfgg hfgighigffghdieghhhggghhfghhfiiheffghghihifgggffihgihfggighgiiif fjgfgjhhjiifhjgehgghfhhfhjhiggghghihigghhihihgiighgfhlgjfgjjjmfl Estimate n ◦ ≈ 30 , 897 vs n = 28 , 239 distinct words. Error: +9 . 4 % w/ 128 bytes! 3

Uses: — Routers: intrusion, flow monitoring & control — Databases: Query optimization, cf M ∪ M ′ for multisets ; Esti- mating the size of queries & “sketches”. — Statistics gathering: on the fly, fast and with little memory even on “unclean” data ≃ layer 0 of “ data mining ”. 4

This talk: • Estimating characteristics of large data streams — sampling; size & cardinality & nonuniformity index [ F 1 , F 0 , F 2 ] ❀ power of randomization via hashing ⋄ Gains by a factor of > 400 [Palmer et al. ] • Analysis of algorithms — generating functions, complex asymptotics, Mellin transforms ⋄ Nice problems for theoreticians. • Theory and Practice — Interplay of analysis and design ❀ super-optimized algorithms. 5

1 PROB. ALG. ON STREAMS Given: S = a large stream S = ( r 1 , r 2 , . . . , r ℓ ) with duplicates — | | = length or size: total # of records ( ℓ ) | S | — | S | = cardinality: # of distinct records ( c ) ♦ How to estimate size, cardinality, etc? X ( f v ) p . More generally, if f v is frequency of value v : F p := v ∈ D Cardinality is F 0 ; size is F 1 ; F 2 is indicator of nonuniformity of distribution; “ F ∞ ” is most frequent element [Alon, Matias, Szegedy, STOC96] ♦ How to sample? — with or without multipicity 6

Angel Daemon —— The Model Pragmatic assumptions/ Engineer’s point of view: Can get random bits from data: Works fine! (A1) There exists a “good” hash function D �→ B ≡ { 0 , 1 } L h : Data domain �→ Bits Typically: L = 30 – 32 (more or less, maybe). h ( x ) := λ · � x in base B � mod p Sometimes, also: (A2) There exists a “good” pseudo-random number gen. T : B �→ B , s.t. iterates T y 0 , T (2) y 0 , T (3) y 0 , . . . look random. [ T ( y ) := ( a · y mod p ) ] 7

Two preparatory examples. Let a flow of people enter a room. — Birthday Paradox: It takes on average 23 to get a birthday collision — Coupon Collector : After 365 persons have entered, expect a partial collection of ∼ 231 different days in the year; it would take more than 2364 to reach a full collection. B n C 1st birthday coll. complete coll. r πn ≈ ne − 1 E n ( B ) ∼ E n ( C ) = nH n ∼ n log n 2 Suppose we didn’t know the number N of days in the year but could identify people with the same birthday. Could we estimate N ? 8

1.1 Birthday paradox counting • A warm-up “abstract” example due to Brassard-Bratley [Book 1996] = a Gedanken experiment. How to weigh an urn by shaking it? ? Urn contains unknown number N of balls. ♠ Deterministic: Empty it one by one: cost is O ( N ) . 9

√ N ) : [shake, draw, paint] ⋆ ; stop! ♥ Probabilistic O ( A LG : Birthday Paradox Counting Shake, pull out a ball, mark it with paint; repeat until draw an already marked ball. Infer N from T = number of steps. 10

� We have E ( T ) ∼ πN/ 2 by Birthday Paradox. • Invert and try X := 2 π T 2 . Estimate is biased , find E ( T 2 ) ∼ 2 N and propose X := T 2 / 2 . •• Analyse 2nd moment of BP Estimate is now (asymptotically) unbiased. • • • Wonder about accuracy: Standard Error := Std Deviation of estimate ( X ) . Exact value ( N ) ❀ Need to analyse fourth moment E ( T 4 ) . Do maths: r π 2 N r + 1 E N ( T 2 r ) = 2 r r ! N r , E N ( T 2 r +1 ) = (1 · 3 · · · (2 r − 1)) 2 . ⇒ E ( T 4 ) ∼ 8 N 2 . Standard error = ⇒ Estimate ∈ (0 , 3 N ) . [ N = 10 6 ]: 384k; = 3,187k; 635k; 29k; 2,678k; 796k; 981k, . . . • • •• Improve algorithm. Repeat m times and average . √ “ ” 1 ❀ Time cost: O ( m N ) for accuracy O . √ m Shows usefulness of maths: Ramanujan’s Q ( n ) function, Laplace’s method for sums or integrals (cf Knuth, Vol 1); singularity analysis. . . 11

1.2 Coupon Collector Counting First Counting Algorithm: Estimate cardinalities ≡ # of distinct elements. This is real CS, motivated by query optimization in data bases. [Whang et al, ACM TODS 1990] x h(x) T[1 . . m] A LG : Coupon Collector Counting Given multiset S = ( s 1 , . . . , s ℓ ) ; Estimate card( S ) ? Set up a table T [1 . . m ] of m bit-cells. — for x in S do mark cell T [ h ( x )] ; Return − m log V , where V :=fraction of empty cells. Simulate hashing table; Alg. is indep. of replications. 12

Let n be sought cardinality. Then α := n/m is filling ratio . Expect V ≈ e − α empty cells by classical analysis of occupancy. Distribution is concen- trated. Invert! 1 Count cardinalities till N max using 10 N max bits, for accuracy (standard error) = 2%. Generating functions for occupancy; Stirling numbers; basic depois- sonization. 13

2 SAMPLING A very classical problem [Vitter, ACM TOMS 1985] .... a .... u x b x d d A LG : Reservoir Sampling ( with multiplicities ) Sample m elements from S = ( s 1 , . . . , s N ) ; [ N unknown a priori] Maintain a cache (reservoir) of size m ; — for each coming s t +1 : place it in cache with probability m/ ( t +1) ; drop random element; 14

Math: Need analysis of skipping probabilities. Complexity of Vitter’s best alg. is O ( m log N ) . Useful for building “sketches”, order-preserving H-fns & DS. 15

Can we sample values (i.e., without multiplicity)? Algorithm due to [Wegman, ca 1984, unpub.], analysed by [F .1990]. 0 0 c x a s d Sample of size ≤ b : 0 c s d depth d = 0 , 1 , 2 , . . . h(x)=0... s d f h h(x)=00... A LG : Adaptive Sampling ( without multiplicities ) Get a sample of size m from S ’s values. Set b := 4 m (bucket capacity); — Oversample by adaptive method; – Get sample of m elements from the ( b ≡ 4 m ) bucket. 16

Analysis. View collection of records as a set of bitstrings. Digital tree aka trie, paged version:  Trie( ω ) ≡ ω if card( ω ) ≤ b  •  � �� Trie( ω ) = Trie( ω \ 0) Trie( ω \ 1) if card( ω ) > b (Underlies dynamic and extendible hashing, paged DS, etc) Refs: [Knuth Vol 3], [Sedgewick, Algorithms], Books by Mahmoud, Sz- pankowski. General analysis by [Cl´ ement-F-Vall´ ee, Alg. 2001], etc. Depth in Adaptive Sampling is length of leftmost branch; Bucket size is # of elements in leftmost page. 17

For recursively defined parameters: α [ ω ] = β [ ω \ 0] : ! n X 1 n E n ( α ) := E k ( β ) . 2 n k k =0 Introduce exponential generating functions (EGF) : ` z ´ A ( z ) := P n E n ( α ) z n n ! &c . Then A ( z ) = e z/ 2 B . 2 ` z ´ For recursive parameter φ : Φ( z ) = e z/ 2 Φ + Init ( z ) 2 Solve by iteration, extract coefficients; Mellin-ize ❀ later! 18

Bonus: Second Counting Algorithm for cardinalities . Let d := sampling depth ; ξ :=sample size. Theorem [F90] : X := 2 d ξ estimates the cardinality of S using b words of memory, in a way that is unbiased and with standard √ error ≈ 1 . 20 / b . = 1 / √ log 2 : with b = 1 , 000 W, get 4 % accuracy. • 1 . 20 . • Distributional analysis by [Louchard RSA 1997]. • Related to folk algorithm for leader election on channel: “Talk, flip coin if noisy; sleep if Tails; repeat! • Related to “tree protocols with counting” ≫ Ethernet. Cf [Greenberg-F-Ladner JACM 1987]. 19

3 APPROXIMATE COUNTING The oldest algorithm [Morris CACM:1977], analysis [F , 1985]. Maintain F 1 , i.e., counter subject to C := C + 1 . Theorem: Count till n probabilistically using log 2 log n + δ bits, with accuracy about 0 . 59 · 2 − δ/ 2 . Beats information theory(!?): 8 bits for counts ≤ 2 16 w/ accuracy ≈ 15 %. 3/4 1/2 7/8 1 1/8 1/2 1/4 A LG : Approximate Couting Initialize: X := 1 ; Increment: do X := X + 1 with probability 2 − X ; Output: 2 X − 2 . In base q < 1 : increment with probability qX ; output ( q − x − q − 1) / ( q − 1 − 1) ; use q = 2 − 2 − δ ≈ 1 . 20

10 runs of of APCO: value of X ( n = 10 3 ) 10 8 6 4 2 0 200 400 600 800 1000 21

Theory and Practice of (some) Probabilistic Counting Algorithms - PowerPoint PPT Presentation

ICALP-2004, Turku, JUL 2004 Theory and Practice of (some) Probabilistic Counting Algorithms Philippe Flajolet, INRIA, Rocquencourt http://algo.inria.fr/flajolet 1 From Estan-Varghese-Fisk: traces of attacks Need number of active connections

Counting is Hard: Probabilistically Counting Views at Reddit Krishnan Chandra, Data Engineer

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

44 Days And Counting 44 Days And Counting 2010 World Equestrian Games Overview September 25

Counting and Probability Whats to come? Counting and Probability Whats to come?

Counting CS1200, CSE IIT Madras Meghana Nasre April 2, 2020 CS1200, CSE IIT Madras Meghana

Counting CS1200, CSE IIT Madras Meghana Nasre March 26, 2020 CS1200, CSE IIT Madras Meghana

Counting Basic 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 of 1 10/02/2003 04:00 PM 1

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Polyas Theory of Counting Generating Functions Polyas Theory of Counting Example 1 A disc

Interative Hybrid Probabilistic Model Counting Steffen Michels, Arjen Hommersom, and Peter Lucas

3/31/14 Counting counting is hard with only 10 fingers How many ways to do X ? X = Choose an

Computing Lecture 6b: Step Counting & Activity Recognition Emmanuel Agu Step Counting (How

Counting with automorphisms Lectures for CO 430 / 630 March 24 April 2, 2020 1. Counting

Triangle Counting in Large Sparse Graph Meng-Tsung Tsai r95065@cise.ntu.edu.tw Triangle Counting

Theory or Practice? Theory : Without theory, practice is but routine born out of habit.

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Thesis

Nature or Nurture: Evidence from Indonesia Methodology Data Results Cara Ebert 1 and Erik Plug 2

Invariants de Tutte et triangulations avec mod` ele dIsing Marie Albenque (CNRS and LIX) joint

Markov Decision Processes Case example sow replacement Anders Ringgaard Kristensen Presented

Causality and Algebraic Geometry Andrew Critch UC Berkeley September, 2012 Causality and

Michael J. Conroy Background and motivation (brief) Background and motivation (brief)

Introduction to MATLAB Chapter 1 Attaway MATLAB 4E Introduction to MATLAB Very powerful

Online aggrega*on & Sampling from Joins CompSci 590.02

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Theory and Practice of (some) Probabilistic Counting Algorithms - PowerPoint PPT Presentation

ICALP-2004, Turku, JUL 2004 Theory and Practice of (some) Probabilistic Counting Algorithms Philippe Flajolet, INRIA, Rocquencourt http://algo.inria.fr/flajolet 1 From Estan-Varghese-Fisk: traces of attacks Need number of active connections

Counting is Hard: Probabilistically Counting Views at Reddit Krishnan Chandra, Data Engineer

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

44 Days And Counting 44 Days And Counting 2010 World Equestrian Games Overview September 25

Counting and Probability Whats to come? Counting and Probability Whats to come?

Counting CS1200, CSE IIT Madras Meghana Nasre April 2, 2020 CS1200, CSE IIT Madras Meghana

Counting CS1200, CSE IIT Madras Meghana Nasre March 26, 2020 CS1200, CSE IIT Madras Meghana

Counting Basic 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 of 1 10/02/2003 04:00 PM 1

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Polyas Theory of Counting Generating Functions Polyas Theory of Counting Example 1 A disc

Interative Hybrid Probabilistic Model Counting Steffen Michels, Arjen Hommersom, and Peter Lucas

3/31/14 Counting counting is hard with only 10 fingers How many ways to do X ? X = Choose an

Computing Lecture 6b: Step Counting &amp; Activity Recognition Emmanuel Agu Step Counting (How

Counting with automorphisms Lectures for CO 430 / 630 March 24 April 2, 2020 1. Counting

Triangle Counting in Large Sparse Graph Meng-Tsung Tsai r95065@cise.ntu.edu.tw Triangle Counting

Theory or Practice? Theory : Without theory, practice is but routine born out of habit.

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Thesis

Nature or Nurture: Evidence from Indonesia Methodology Data Results Cara Ebert 1 and Erik Plug 2

Invariants de Tutte et triangulations avec mod` ele dIsing Marie Albenque (CNRS and LIX) joint

Markov Decision Processes Case example sow replacement Anders Ringgaard Kristensen Presented

Causality and Algebraic Geometry Andrew Critch UC Berkeley September, 2012 Causality and

Michael J. Conroy Background and motivation (brief) Background and motivation (brief)

Introduction to MATLAB Chapter 1 Attaway MATLAB 4E Introduction to MATLAB Very powerful

Online aggrega*on &amp; Sampling from Joins CompSci 590.02

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Computing Lecture 6b: Step Counting & Activity Recognition Emmanuel Agu Step Counting (How

Online aggrega*on & Sampling from Joins CompSci 590.02