HOW MANY TRIALS IT TAKES TO FIND ALL DIFFERENT SPECIES OF A KIND - PDF document

HOW MANY TRIALS IT TAKES TO FIND ALL DIFFERENT SPECIES OF A KIND Vassilis G. Papanicolaou Department of Mathematics National Technical University of Athens Zografou Campus 157 80, Athens, GREECE e-mail : papanico@math.ntua.gr 1

� Consider a population whose members are of N different types (e.g. colors). � For 1 ≤ j ≤ N we denote by p j the probability that a member of the population is of type j . � The members of the population are sampled independently with replacement and their types are recorded. � Our main object of study is the number T N of trials it takes until all N types are detected (at least once). � Of course, P { T N ≥ k } = 1 , 1 ≤ k ≤ N. if 2

It is convenient to introduce the events A k j , 1 ≤ j ≤ N , that the type j is not detected until trial k (included). Then A k − 1 ∪ · · · ∪ A k − 1 � � P { T N ≥ k } = P , k = 1 , 2 , ... . 1 N By invoking the inclusion-exclusion principle one gets A k − 1 A k − 1 � � � � P { T N ≥ k } = P + · · · + P 1 N A k − 1 A k − 1 A k − 1 N − 1 A k − 1 � � � � − P − · · · − P 1 2 N . . . +( − 1) N − 1 P A k − 1 · · · A k − 1 � � , 1 N or, in a more compact notation �� ( − 1) | J |− 1 P � A k − 1 P { T N ≥ k } = , (1) j J ⊂{ 1 ,...,N } j ∈ J J � = ∅ where the sum extends over all 2 N − 1 nonempty subsets J of { 1 , ..., N } , while | J | denotes the cardinality of J . 3

Now = (1 − p j ) k − 1 , = [1 − ( p j + p i )] k − 1 , A k − 1 A k − 1 A k − 1 � � � � P P j j i and, in general, if J ⊂ { 1 , ..., N } , then �� k − 1 �� A k − 1 P = 1 − p j . (2) j j ∈ J j ∈ J Combining formulas (2) and (1) yields �� k − 1 � �� ( − 1) | J |− 1 P { T N ≥ k } = 1 − p j , k = 1 , 2 , ... (3) J ⊂{ 1 ,...,N } j ∈ J J � = ∅ (valid also for the trivial case k = 1 under the convention 0 0 = 1 ). 4

Remark. A side result of the above analysis is the (somehow nontrivial) algebraic formula �� n � �� ( − 1) | J | 1 − p j = 0 , n = 0 , 1 , ..., ( N − 1) . for J ⊂{ 1 ,...,N } j ∈ J 5

Lemma 0. Let X be a random variable taking values in N = { 0 , 1 , 2 , ... } . If g : N → R is a function such that E [ g ( X )] < ∞ , then ∞ � E [ g ( X )] = g (0) + [ g ( k ) − g ( k − 1)] P { X ≥ k } . (4) k =1 In particular ( g ( k ) = k ), ∞ � E [ X ] = P { X ≥ k } . (5) k =1 6

Proof. We have ∞ � E [ g ( X )] = g ( k ) P { X = k } k =0 = g (0) P { X = 0 } + g (1) P { X = 1 } + g (2) P { X = 2 } + g (3) P { X = 3 } + · · · . The above sum can be rewriten as g (0) [ P { X = 0 } + P { X = 1 } + P { X = 2 } + P { X = 3 } + · · · ] + [ g (1) − g (0)] [ P { X = 1 } + P { X = 2 } + P { X = 3 } + · · · ] + [ g (2) − g (1)] [ P { X = 2 } + P { X = 3 } + · · · ] + [ g (3) − g (2)] [ P { X = 3 } + · · · ] . . . � which is the right-hand side of (4). Remark. If g : N → R is increasing, then (4) is valid even if E [ g ( X )] = ∞ (similarly if g is decreasing). 7

Combination of (3) in (5) yields �� k − 1 ∞ � �� ( − 1) | J |− 1 � E [ T N ] = 1 − p j . J ⊂{ 1 ,...,N } k =1 j ∈ J J � = ∅ Summation of the geometric series gives � − 1 �� ( − 1) | J |− 1 E [ T N ] = p j , (6) J ⊂{ 1 ,...,N } j ∈ J J � = ∅ or N 1 � � ( − 1) m − 1 E [ T N ] = . p j 1 + · · · + p j m m =1 1 ≤ j 1 < ··· <j m ≤ N 8

We proceed by noticing that N � � ( − 1) | J |− 1 exp � � � 1 − e − p j t � � 1 − = − t p j . j =1 j ∈ J J ⊂{ 1 ,...,N } J � = ∅ Thus, � ∞ � N � ( − 1) | J |− 1 � � 1 − e − p j t � � 1 − dt = � . �� j ∈ J p j 0 j =1 J ⊂{ 1 ,...,N } J � = ∅ and hence � ∞ � N � � 1 − e − p j t � � E [ T N ] = 1 − dt, (7) 0 j =1 or, by substituting x = e − t in the integral, � 1 � N � dx � (1 − x p j ) E [ T N ] = 1 − x . (8) 0 j =1 Formulas (6), (7), and (8) are well-known. 9

Likewise, for the generating function of T N z − T N � � F ( z ) := E , we have the formulas ( − 1) | J |− 1 � F ( z ) = 1 − ( z − 1) � , �� z − 1 + j ∈ J p j J ⊂{ 1 ,...,N } J � = ∅ � ∞ � N � � 1 − e − p j t � e − ( z − 1) t dt, � F ( z ) = 1 − ( z − 1) 1 − 0 j =1 and � 1 � N � � (1 − x p j ) x z − 2 dx. F ( z ) = 1 − ( z − 1) 1 − 0 j =1 10

And for the second moment of T N we have � − 2 �� ( − 1) | J |− 1 E [ T N ( T N + 1)] = 2 p j j ∈ J J ⊂{ 1 ,...,N } J � = ∅ � ∞ � N � � 1 − e − p j t � � E [ T N ( T N + 1)] = 2 1 − t dt, 0 j =1 and � 1 � N � ln x � (1 − x p j ) E [ T N ( T N + 1)] = − 2 1 − x dx. 0 j =1 11

Naturally, the simplest case regarding the previous formulas occurs when one takes p 1 = · · · = p N = 1 N . It is easy to check (e.g. by taking logarithms) that, for any fixed t > 0 , the maximun of the quantity N � 1 − e − p j t � � , j =1 subject to the constraint p 1 + · · · + p N = 1 , occurs when all p j ’s are equal. It follows that E [ T N ] attains its minimum value when all p j ’s are equal (see also M.V. Hildebrand [11]). The same is true for E [ T N ( T N + 1)] . As for z − T N � � F ( z ) = E , z > 1 , where it follows that it attains its maximum value , when all p j ’s are equal. 12

Let p 1 = · · · = p N = 1 /N . Then � 1 1 − x 1 /N � N � dx N � ( − 1) m − 1 � N � � � E [ T N ] = N = 1 − x . m m 0 m =1 The substituting u = 1 − x 1 /N in the integral yields � N � 1 � 1 � 1 − u N � u m − 1 E [ T N ] = N 1 − u du = N du = NH N , 0 0 m =1 where H N is “the N -th harmonic number” N 1 � H N = m. m =1 13

In a similar way we get N ( − 1) m − 1 � N � � F ( z ) = 1 − ( z − 1) m z − 1 + ( m/N ) m =1 � 1 � 1 − x 1 /N � N � x z − 2 dx, � = 1 − ( z − 1) 1 − 0 and N � N � H m 1 E [ T N ( T N + 1)] = 2 N 2 � m = N 2 H 2 � N + , m 2 m =1 m =1 which also implies N 1 � V [ T N ] = N 2 m 2 − NH N . m =1 14

Reminder (The Euler-Maclaurin sum formula) . If N � S ( N ) = f ( m ) , m =0 then, as N → ∞ , � N ∞ S ( N ) ∼ 1 ( − 1) k +1 B k +1 � ( k + 1)! f ( k ) ( N ) , 2 f ( N ) + f ( x ) dx + C + 0 k =1 where C is a constant and B k is the k -th Bernoulli number defined by the formula ∞ z B k � k ! z k e z − 1 = k =0 (e.g. B 0 = 0 , B 1 = − 1 / 2 , B 2 = 1 / 6 , B 4 = − 1 / 30 , B k = 0 , for all odd k ≥ 3 ). For example N ∞ m ∼ ln N + γ + 1 1 B k � � H N = 2 N − kN k , m =1 k =2 where γ = 0 . 5772 ... is Euler’s constant. Also N ∞ m 2 ∼ π 2 1 6 − 1 1 B k � � N + 2 N 2 − N k +1 . m =1 k =2 15

If p 1 = · · · = p N = 1 /N , then by using the Euler-Maclaurin sum formula one obtains ∞ E [ T N ] ∼ N ln N + γN + 1 B k � 2 − kN k − 1 , k =2 (ln N ) 2 + 2 γ ln N + γ 2 + π 2 � � ln N �� E [ T N ( T N + 1)] = N 2 6 + O , N and V [ T N ] = π 2 N 2 � ln N � − N ln N − ( γ + 1) N + O , 6 N as N → ∞ . 16

QUIZ. The town F has population 1825 ( = 5 × 365 ) while the town S has population 2190 ( = 6 × 365 ). Let f and s be the probabilities that all 365 birthdays are represented in F and S respectively. Estimate s f . 17

Answer. s f ≈ 4 . 8 . In fact, f ≈ 0 . 085 s ≈ 0 . 4051 and Hint. It can be shown (see R. Durrett [8]) that, as N → ∞ , − e − x � � P { T N − N ln N ≤ Nx } → exp . 18

Some Applications. The above formulas are associated to what is usually called the “Coupon Collector Problem” (CCP) , where N different coupons (of arbitrary occurring fre- quencies) are sampled independently with replacement. We now mention some applications. The first three examples introduce probabilistic computational algorithms which can be modeled by the CCP. 1. Constraint classification in mathematical programming. In 1983, Karwan et al. [14] described a class of randomized algorithms for classifying all the constraints in a mathematical programming problem as necessary or redundant. The basic algorithm, also known as PREDUCE (Probabilistic REDUCE), can be briefly described as follows: Given an interior feasible point, each iteration consists of generating a ray in a random direction, and recording the first constraint it intersects. Such a constraint is a necessary one. The algorithm gener- ates rays until a stopping rule is satisfied. Then, all the constraints which were not hit at all are classified as redundant—possibly erroneously. Each iteration corresponds to drawing one coupon, with N being the number of necessary constraints. Thus, the CCP model can help to determine an efficient stopping rule. 19

HOW MANY TRIALS IT TAKES TO FIND ALL DIFFERENT SPECIES OF A KIND - PDF document

HOW MANY TRIALS IT TAKES TO FIND ALL DIFFERENT SPECIES OF A KIND Vassilis G. Papanicolaou Department of Mathematics National Technical University of Athens Zografou Campus 157 80, Athens, GREECE e-mail : papanico@math.ntua.gr 1 Consider a

ASEAN Airline Takes Wing ASEAN Airline Takes Wing ASEAN Airline Takes Wing ASEAN Airline Takes

Network Initiatives for Cardiovascular Trials But we already do CVD Trials? Strong history

Collision Detection 1 2 Many Different Situations Many Different Situations Thin moving

Species Status Assessment What is the SSA? Species Status Assessment BIG PICTURE Species Status

Non-commercial clinical trials (Investigator initiated trials (IIT), research-lead trials)

22 March 2019 22 March 2019 Lakeland Clinical Trials Lakeland Clinical Trials Mike Williams,

Reasons we face trials in life (1 Peter 1:7-12) #1 v.7 Trials test the genuineness of our

Toward Efficient Many-to-Many Broadcast in Dynamic Wireless Networks Fabian Mager , Carsten

Taming the Beast Workshop Bayesian inference of species tree Species & gene trees *BEAST

Ring Species and the Museum Mike Seward OEB 275br May 7 th , 2013 Biological Species Concept

Native species: Native species: Squirreltail Squirreltail Squirreltail Squirreltail ( Elymus

Tolerance and distribution in Encelia species They are all closely related, how are they

Comparing P2P Systems Anthony D. Joseph John Kubiatowicz CS294-4 Why so many systems? Many

Find Us! Find Us! Find Us! Find Us! Like Us! Like Us! Interact With Us! Interact With Us!

Love Takes Practice: Christian Worship as a Pedagogy of Desire James K.A. Smith, Calvin College

It Takes One Line (of SAS) to Analyze Your Data, but it Takes Many Lifelines to See it (Through)

Computability in Timed Sets in Opetaa, Estonia Robin Cockett Joaqu n D az-Bo ls

India Water Supply improved basic access but decline in household level services 2 Basic

Interactive Environments context and task theory interaction techniques in/output technologies

Relational algebra: a Kleene algebra central to the mathematics of program construction J.N.

Graph Traversals Autumn 2018 Shrirang (Shri) Mare shri@cs.washington.edu Thanks to Kasey

CS 147: Computer Systems Performance Analysis Introduction to Graphical Presentation 1 / 25

ANN Ally MELINGER BUNIN CE CEO Global Internal Communications Leader CULTURE IN THE TIME

Best Practices in Data Visualization Jodie Jenkinson, Associate Professor + Director Biomedical

HOW MANY TRIALS IT TAKES TO FIND ALL DIFFERENT SPECIES OF A KIND - PDF document

HOW MANY TRIALS IT TAKES TO FIND ALL DIFFERENT SPECIES OF A KIND Vassilis G. Papanicolaou Department of Mathematics National Technical University of Athens Zografou Campus 157 80, Athens, GREECE e-mail : papanico@math.ntua.gr 1 Consider a

ASEAN Airline Takes Wing ASEAN Airline Takes Wing ASEAN Airline Takes Wing ASEAN Airline Takes

Network Initiatives for Cardiovascular Trials But we already do CVD Trials? Strong history

Collision Detection 1 2 Many Different Situations Many Different Situations Thin moving

Species Status Assessment What is the SSA? Species Status Assessment BIG PICTURE Species Status

Non-commercial clinical trials (Investigator initiated trials (IIT), research-lead trials)

22 March 2019 22 March 2019 Lakeland Clinical Trials Lakeland Clinical Trials Mike Williams,

Reasons we face trials in life (1 Peter 1:7-12) #1 v.7 Trials test the genuineness of our

Toward Efficient Many-to-Many Broadcast in Dynamic Wireless Networks Fabian Mager , Carsten

Taming the Beast Workshop Bayesian inference of species tree Species &amp; gene trees *BEAST

Ring Species and the Museum Mike Seward OEB 275br May 7 th , 2013 Biological Species Concept

Native species: Native species: Squirreltail Squirreltail Squirreltail Squirreltail ( Elymus

Tolerance and distribution in Encelia species They are all closely related, how are they

Comparing P2P Systems Anthony D. Joseph John Kubiatowicz CS294-4 Why so many systems? Many

Find Us! Find Us! Find Us! Find Us! Like Us! Like Us! Interact With Us! Interact With Us!

Love Takes Practice: Christian Worship as a Pedagogy of Desire James K.A. Smith, Calvin College

It Takes One Line (of SAS) to Analyze Your Data, but it Takes Many Lifelines to See it (Through)

Computability in Timed Sets in Opetaa, Estonia Robin Cockett Joaqu n D az-Bo ls

India Water Supply improved basic access but decline in household level services 2 Basic

Interactive Environments context and task theory interaction techniques in/output technologies

Relational algebra: a Kleene algebra central to the mathematics of program construction J.N.

Graph Traversals Autumn 2018 Shrirang (Shri) Mare shri@cs.washington.edu Thanks to Kasey

CS 147: Computer Systems Performance Analysis Introduction to Graphical Presentation 1 / 25

ANN Ally MELINGER BUNIN CE CEO Global Internal Communications Leader CULTURE IN THE TIME

Best Practices in Data Visualization Jodie Jenkinson, Associate Professor + Director Biomedical

Taming the Beast Workshop Bayesian inference of species tree Species & gene trees *BEAST