On the Biased Partial Word Collector Problem Philippe Duchon and - PowerPoint PPT Presentation

On the Biased Partial Word Collector Problem Philippe Duchon and Cyril Nicaud LIGM Universit´ e Paris-Est & CNRS February 8, 2018 1 / 16

Coupon collector The classical coupon collector problem is the following: ◮ There are n different pictures ◮ Each chocolate bar contains one picture, uniformly at random Coupon collector How many chocolate bars are required to complete the collection? Answer: in expectation, around n log n chocolate bars are needed. 2 / 16

Birthday paradox The classical birthday paradox problem is the following (assuming that all birthdays are uniformly random (365 possibilities)): Birthday problem In a room with m people, what is the probability that at least two people have the same birthday? Answer: for m = 23, the probability is just above 50% 3 / 16

Birthday problem The birthday problem is the following: Birthday problem How many chocolate bars until the first duplicate? � π n Answer: in expectation it is ∼ 2 4 / 16

biased partial word collector problem (2) Our problem (1) Draw N random words of length L , independently (2) Remove duplicates (3) Select a word uniformly at random The words are generated using a memoryless source S : Each letter is chosen independently following a fixed probability on the alphabet. For instance p a = 1 3 and p b = 2 3 and the probability of aabba � 2 = � 3 � � 1 2 4 is 3 3 243 5 / 16

biased partial word collector problem (2) Our problem (1) Draw N random words of length L , independently (partial) (2) Remove duplicates (collector problem) (3) Select a word uniformly at random The words are generated using a memoryless source S : Each letter is chosen independently following a fixed probability on the alphabet. For instance p a = 1 3 and p b = 2 3 and the probability of aabba � 2 = � 3 � � 1 2 4 is 243 (biased) 3 3 6 / 16

Related works and motivations Birthday paradox, coupon collectors, caching algorithms and self-organizing search. Ph. Flajolet, D. Gardy, and L. Thimonier (DAM’92) The weighted words collector. J. Du Boisberranger, D. Gardy, and Y. Ponty (AofA’12) On Correlation Polynomials and Subword Complexity. I. Gheorghiciuc and M. D. Ward (AofA’07) The number of distinct subpalindromes in random words. M. Rubinchik and A. M. Shur. (Fundam. Inform. 16) Subword complexity It is the number of distinct factors (of a given length or not) in a string. In our settings: N ≈ | u | and L ≈ | factor | , not completely independent. 7 / 16

Getting started Our problem (1) Draw N random words of length L , independently using S (2) Remove duplicates (3) Select a word uniformly at random Some remarks: ◮ There are | A | L distinct words ◮ If S is uniform, then the output is a uniform random word ◮ If N is small, the output looks like a word generated by S ◮ If N is large, the output looks like a uniform random word By looks like we mean that the number of occurrences of the letters are approximatively the same. 8 / 16

Full statement ◮ The alphabet is A = { a 1 , . . . , a k } ◮ The probabilities for S are p = ( p 1 , . . . , p k ), with p i = p ( a i ) ◮ The random variable U N , L denote the output of our process ◮ H ( x ) = − � i x i log x i is the classical entropy function ◮ Freq( u ) = ( f 1 , . . . , f k ) is the frequency vector of u with f i = | u | i | u | . Theorem [Duchon & N., LATIN’18] − k 1 Let ℓ 0 = log p i and ℓ 1 = H ( p ) . For L sufficiently large and for any � L N ≥ 2, there are three different behaviors depending on ℓ = log N : (a) If ℓ ≤ ℓ 0 , then Freq( U N , L ) ≈ ( 1 k , . . . , 1 k ) (b) If ℓ 0 ≤ ℓ ≤ ℓ 1 , then Freq( U N , L ) ≈ x ℓ , for some fully caracterized x ℓ (c) If ℓ 1 ≤ ℓ , then Freq( U N , L ) ≈ p Freq( U N , L ) ≈ y means P ( � Freq( U N , L ) − y � 2 ≥ log L L ) ≤ L − λ log L √ 9 / 16

Simplified statement ◮ The probabilities for S are p = ( p 1 , . . . , p k ), with p i = p ( a i ) ◮ The random variable U N , L denote the output of our process ◮ Freq( u ) = ( f 1 , . . . , f k ) is the frequency vector of u with f i = | u | i | u | . Theorem [Duchon & N., LATIN’18] There exist two thresholds ℓ 0 < ℓ 1 , which depend on p only, s.t. for L sufficiently large and for any N ≥ 2, there are three different behaviors L depending on ℓ = log N : (a) If ℓ ≤ ℓ 0 , then Freq( U N , L ) is almost uniform (b) If ℓ 0 ≤ ℓ ≤ ℓ 1 , then Freq( U N , L ) ≈ x ℓ , for some fully caracterized x ℓ (c) If ℓ 1 ≤ ℓ , then Freq( U N , L ) is almost p 10 / 16

Interpolation between the uniform distribution and p Theorem [Duchon & N., LATIN’18] L For ℓ = log N : (a) If ℓ ≤ ℓ 0 , then Freq( U N , L ) is almost uniform (b) If ℓ 0 ≤ ℓ ≤ ℓ 1 , then Freq( U N , L ) ≈ x ℓ , for some fully caracterized x ℓ (c) If ℓ 1 ≤ ℓ , then Freq( U N , L ) is almost p We have � p c Φ( c ) , . . . , p c � 1 k x ℓ = Φ( c ) Where Φ( t ) = � k i =1 p t i , and c is the unique solution in [0 , 1] of ℓ Φ ′ ( c ) + Φ( c ) = 0 . 11 / 16

Interpolation between the uniform distribution and p � p c Φ( c ) , . . . , p c � 1 k x ℓ = Φ( c ) Where Φ( t ) = � k i =1 p t i , and c is the unique solution in [0 , 1] of ℓ Φ ′ ( c ) + Φ( c ) = 0 . ◮ If ℓ = ℓ 0 then c = 0 and � � Φ(0) , . . . , p 0 p 0 � 1 k , . . . , 1 � 1 k x ℓ 0 = = Φ(0) k ◮ If ℓ = ℓ 1 then c = 1 and � � Φ(1) , . . . , p 1 p 1 1 k x ℓ 1 = = p Φ(1) 12 / 16

Proof sketch 1/3 ◮ Let W L ( x ) be the words of length L whose frequency vector is x ◮ All the words of W L ( x ) have the same probability p ( x ) of being generated by the source S , with p ( x ) = � p x i L = N ℓ � x i log p i i ◮ Hence the probability q ( x ) that the set contains a given word of frequency vector x is q ( x ) = 1 − (1 − p ( x )) N ◮ We approximate q ( x ) with q ( x ) ≈ min( N p ( x ) , 1) = N min(0 , 1+ ℓ � x i log p i ) � ≈ N ℓ H ( x ) words in W L ( x ), the expected L ◮ Since there are � x 1 L ,..., x k L number of such words in the collection is roughly N ℓ min( H ( x ) , K ℓ ( x )) , with K ℓ ( x ) = H ( x ) + 1 � ℓ + x i log p i 13 / 16

Proof sketch 2/3 Goal Find the probability vector x that maximises min( H ( x ) , K ℓ ( x )) , with K ℓ ( x ) = H ( x ) + 1 � ℓ + x i log p i It is the minimum of two strictly concave functions. But we have to do some analysis in several variables x 1 , . . . , x k 14 / 16

Proof sketch 3/3 ◮ For this proof sketch, lets consider that there is only one variable (i.e. two letters) ◮ Maximizing the minimum of two concave functions, two cases: • • (a) The maximum of one function is smaller than the other function. It is the maximum of the min (b) Otherwise, the maximum is on the intersection of the two curves (which can be complicated in several dimensions) ◮ For our problem, the function as sufficiently nice to work with explicitly and we have ◮ Case (a) appears for the two extremal ranges (uniform and p ) ◮ Case (b) appears for the middle range (interpolation), and the maximum is found using standard analysis in several variables on the hyperplan of intersection of H ( x ) and K ℓ ( x ) 15 / 16

Conclusions ◮ There are two thresholds, fully characterized, for our problem ◮ A typical output word goes from uniformly random to distributed as an output of the source S ◮ The interpolation between the two distributions is fully understood ◮ We focused on the distribution of letters, can we say more? ◮ More general sources (Markovian)? ◮ Distinct subpalindromes for memoryless sources? Thanks! 16 / 16

On the Biased Partial Word Collector Problem Philippe Duchon and - PowerPoint PPT Presentation

On the Biased Partial Word Collector Problem Philippe Duchon and Cyril Nicaud LIGM Universit e Paris-Est & CNRS February 8, 2018 1 / 16 Coupon collector The classical coupon collector problem is the following: There are n different

A Semi Preemptive Garbage A Semi Preemptive Garbage Collector for Solid State Collector

Bipolar Junction Transistors Emitter p n p Collector Emitter n p n Collector Base Base

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 14, 2019 1 / 125

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 13, 2019 1 / 125

San Mateo County San Mateo County Treasurer- -Tax Collector Tax Collector Treasurer Lee

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Overview Partial Constituent Fronting in German The phenomenon: Partial constituent fronting

Extreme Event-Size Extreme Event-Size Fluctuations in Biased Fluctuations in Biased Random

RDA Dissolution Update Oscar Valdez Auditor-Controller/Treasurer/Tax Collector September 13,

TOY FAIR 2010 CELEBRATION 2010 COLLECTOR EVENT COLLECTOR EVENT PRESENTATION PRESENTATION o

ESTABLISHING GOALS & PERFORMANCE MEASURES Jordan Kaufman, Kern County Treasurer-Tax Collector

CS 574: Randomized Algorithms Lecture 5. Coupon Collector Problems September 8, 2015 Lecture 5.

Partial Word Representation F . Blanchet-Sadri Fields Institute Workshop This material is based

Partial Functions and Categories of Partial Maps Science Atlantic at Acadia University Darien

Partial Orders on the integers. In this case ( a , b ) R if a b . a a so R is reflexive. a b

JUST THE MATHS SLIDES NUMBER 14.1 PARTIAL DIFFERENTIATION 1 (Partial derivatives of the

Foundations of Data Science, Fall 2015, UC Berkeley Example from Lecture 9/2/15 A. Adhikari The

Combinatorics (2.6) The Birthday Problem (2.7) Prof. Tesler Math 186 Winter 2020 Prof. Tesler

Bayesian model selection in graphs by using BDgraph package A. Mohammadi and E. Wit March 26,

Ciberseguridad Probability, Random Processes and Inference Dr. Ponciano Jorge Escamilla Ambrosio

Introduction to the Course Peter Chi Assistant Professor of Statistics Villanova University

Foundations of Computer Science Lecture 17 Independent Events Independence is a Powerful

Unit 2: Probability and distributions Lecture 4: Binomial distribution Statistics 101 Thomas

Probability: Part II Cunsheng Ding HKUST, Hong Kong October 23, 2015 Cunsheng Ding (HKUST, Hong

Sambuz

Useful Links

Newsletter

Mail Us

On the Biased Partial Word Collector Problem Philippe Duchon and - PowerPoint PPT Presentation

On the Biased Partial Word Collector Problem Philippe Duchon and Cyril Nicaud LIGM Universit e Paris-Est & CNRS February 8, 2018 1 / 16 Coupon collector The classical coupon collector problem is the following: There are n different

A Semi Preemptive Garbage A Semi Preemptive Garbage Collector for Solid State Collector

Bipolar Junction Transistors Emitter p n p Collector Emitter n p n Collector Base Base

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 14, 2019 1 / 125

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 13, 2019 1 / 125

San Mateo County San Mateo County Treasurer- -Tax Collector Tax Collector Treasurer Lee

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Overview Partial Constituent Fronting in German The phenomenon: Partial constituent fronting

Extreme Event-Size Extreme Event-Size Fluctuations in Biased Fluctuations in Biased Random

RDA Dissolution Update Oscar Valdez Auditor-Controller/Treasurer/Tax Collector September 13,

TOY FAIR 2010 CELEBRATION 2010 COLLECTOR EVENT COLLECTOR EVENT PRESENTATION PRESENTATION o

ESTABLISHING GOALS &amp; PERFORMANCE MEASURES Jordan Kaufman, Kern County Treasurer-Tax Collector

CS 574: Randomized Algorithms Lecture 5. Coupon Collector Problems September 8, 2015 Lecture 5.

Partial Word Representation F . Blanchet-Sadri Fields Institute Workshop This material is based

Partial Functions and Categories of Partial Maps Science Atlantic at Acadia University Darien

Partial Orders on the integers. In this case ( a , b ) R if a b . a a so R is reflexive. a b

JUST THE MATHS SLIDES NUMBER 14.1 PARTIAL DIFFERENTIATION 1 (Partial derivatives of the

Foundations of Data Science, Fall 2015, UC Berkeley Example from Lecture 9/2/15 A. Adhikari The

Combinatorics (2.6) The Birthday Problem (2.7) Prof. Tesler Math 186 Winter 2020 Prof. Tesler

Bayesian model selection in graphs by using BDgraph package A. Mohammadi and E. Wit March 26,

Ciberseguridad Probability, Random Processes and Inference Dr. Ponciano Jorge Escamilla Ambrosio

Introduction to the Course Peter Chi Assistant Professor of Statistics Villanova University

Foundations of Computer Science Lecture 17 Independent Events Independence is a Powerful

Unit 2: Probability and distributions Lecture 4: Binomial distribution Statistics 101 Thomas

Probability: Part II Cunsheng Ding HKUST, Hong Kong October 23, 2015 Cunsheng Ding (HKUST, Hong

Sambuz

Useful Links

Newsletter

Mail Us

ESTABLISHING GOALS & PERFORMANCE MEASURES Jordan Kaufman, Kern County Treasurer-Tax Collector