Analysis of Approximate Median Selection M. Hofri Department of - PDF document

Analysis of Approximate Median Selection M. Hofri Department of Computer Science, WPI Collaborators: Domenico Cantone & students Universit` a di Catania, Dipartimento di Matematica Svante Janson Department of Mathematics, Uppsala University

2 Finding the median efficiently — a difficult problem. A deterministic algorithm for the exact median was improved in 5/99 by Dor & Zwick, requiring (in the worst case) ≈ 2 . 942 n . Extremely involved . . . For expected number of comparisons: Floyd & Rivest showed (1975) it can be done in ( 1 . 5 + o ( 1 )) n . Cunto & Munro (1989): this bound is tight. Our algorithm was developed in 1998 by Cantone — and only much later we discovered that several formulated various analogues earlier — as early as 1978! Deterministic, uses at most 1 . 5 n comparisons, and the expected number is 4 / 3 n . Major virtue: extremely easy to implement (and understand) — but it only approximates the median.

Sicilian Median Selection 3 12 22 26 13 21 7 10 2 16 5 11 27 9 17 25 23 1 14 20 3 8 24 15 18 19 4 6 s ✰ 22 13 10 11 17 14 8 18 6 ❯ ☛ 13 14 8 ❄ 13 This is performed in situ. Essentially the same algorithm can be done “on- line:” processing a stream of values and using work- area of 4log 3 n positions.

4 Analysis — Cost of search Finding median of three requires 2 comparisons in 2 permutations, 3 comparisons in 4 permutations, — out of the 6 possible permutations. Hence E [ C 3 ] = 8 / 3 . The expected total number of comparisons when looking in a list of size n : C 3 ( n ) = n 3 · 8 3 + C 3 ( n 3 ) , C 3 ( 1 ) = 0 Result: C 3 ( n ) = 4 3 ( n − 1 ) . The number of elements that are moved is similarly E 3 ( n ) = 1 3 ( n − 1 ) . 1 2 ( n − 1 ) . The number of three-medians computed:

Sicilian Median Selection 5 Analysis — Probabilities of selection To show that the selected median – X n – is likely to be close to the true median we need to compute the distribution of the rank of the selected entry, X n . Let n = 3 r . The key quantity is q ( r ) def = the number of permuta- a , b tions, out of the n ! possible ones, in which the entry which is the a th smallest in the array is: ( i ) selected, and ( ii ) has rank b ( = is the b th smallest) in the next set, 3 = 3 r − 1 entries. that has n The counting is performed in two steps: 1. Count permutations in which a is chosen in the b th triplet, and all the entries chosen in the first b − 1 triplets are smaller than a , and all the items chosen in the rightmost n / 3 − b triplets are larger that a . 2. Compensate for this restriction: multiply the result of step one by the number of rearrangements of

6 � n ( n / 3 ) ! 3 − 1 3 − b ) ! = n � such permutations: . ( b − 1 ) ! ( n b − 1 3 The first step is not that simple, and it produces the following expression, � 1 n � b − 1 �� 3 − b 2 n ( a − 1 ) ! ( n − a ) !3 a − b ∑ 9 i . a − 2 b − i i i We find: � n � 3 − 1 q ( r ) 3 a − b − 1 a , b = 2 n ( a − 1 ) ! ( n − a ) ! b − 1 � 1 n � b − 1 �� 3 − br × ∑ 9 i . a − 2 b − i i i The related probability: p ( r ) a , b = q ( r ) a , b / n ! : � 1 3 − b � n 3 − 1 � n � b − 1 �� 3 − b a , b = 2 p ( r ) b − 1 � × ∑ 3 · 3 − a � n − 1 a − 2 b − i 9 i i i a − 1 3 − b � n 3 − 1 � = 2 � × [ z a − 2 b ]( 1 + z n b − 1 9 ) b − 1 ( 1 + z ) 3 − b . 3 · 3 − a � n − 1 a − 1

Sicilian Median Selection 7 Finally, P ( r ) a : the probability that the algorithm chooses 1 , ..., n = 3 r . a from an array holding P ( r ) = ∑ b r p ( r ) a , b r P ( r − 1 ) p ( r ) a , b r p ( r − 1 ) b r , b r − 1 ··· p ( 2 ) ∑ = a b 3 , 2 b r b r , b r − 1 , ··· , b 3 2 j − 1 ≤ b j ≤ 3 j − 1 − 2 j − 1 + 1 . For � r 3 a − 1 � 2 P ( r ) = a � n − 1 � 3 a − 1 � 1 3 j − 1 − b j r � b j − 1 �� ∑ ∏ j = 2 ∑ × 9 i j b j + 1 − 2 b j − i j i j i j ≥ 0 b r , b r − 1 , ··· , b 3 b j ∈ [ 2 j − 1 . . 3 j − 1 − 2 j − 1 + 1 ] , b 2 = 2 and b r + 1 ≡ a . No known reduction . . . Numerical calculations produced:

8 σ d / n 2 / 3 n r = log 3 n σ d Avg. 9 2 0.428571 0.494872 0.114375 27 3 1.475971 1.184262 0.131585 81 4 3.617240 2.782263 0.148619 243 5 8.096189 6.194667 0.159079 729 6 17.377167 13.282273 0.163979 2187 7 36.427027 27.826992 0.165158 Variance ratios for the median selection as function of array size d is the error of the approximation: � � � X n − n + 1 � � d ≡ � � 2 � What can we expect when n grows?

Sicilian Median Selection 9 0.25 0.2 0.15 0.1 0.05 0 8 10 12 14 16 18 20 Plot of the median probability distribution for n=27

10 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 20 40 60 80 100 120 140 160 180 200 220 Plot of the median probability distribution for n=243

Sicilian Median Selection 11 To answer the last question we look at a “similar” situation, where we look at n independent random variables: Ξ = ( ξ 1 , ξ 2 ,..., ξ n ) , ξ j ∼ U ( 0 , 1 ) . Ξ is a permutation of their sorted order, S ( Ξ ) : S ( Ξ ) = ( ξ ( 1 ) ≤ ξ ( 2 ) ≤ ··· ≤ ξ ( n ) ) . Observation: If the Sicilian algorithm operates on this permutation of N n , and returns X n = k , then sicking it on Ξ would return Y n = ξ ( k ) . The idea: Y n tracks X n n , but—due to the indpendence of the n variables ξ i —it has a simpler distribution.

12 How good is the tracking? Condition on the sampled value: �� 2 �� X n − n + 1 � 2 � � Y n − X n − 1 / 2 Y n − 1 2 − = E S E S 2 n n � 2 � ξ ( k ) − k − 1 / 2 � 1 = E k 4 n . n And the variance of | D n | / n is larger, and decreases more slowly! We said Y n is simpler. . . How simple is it? n = 3 r . F r ( x ) ≡ Pr ( Y n − 1 / 2 ≤ x ) , − 1 / 2 ≤ x ≤ 1 / 2 , F 0 ( x ) = x + 1 / 2 . Now we need a recurrence: is the median of 3 independent values ∼ Y n , Y 3 n hence F r + 1 ( x ) = Pr ( Y 3 n ≤ x + 1 / 2 ) = 3 F 2 r ( x )( 1 − F r ( x ))+ F 3 r ( x ) = 3 F 2 r ( x ) − 2 F 3 r ( x ) .

Sicilian Median Selection 13 A simpler form is obtained by shifting F r ( · ) by 1/2; G r ( x ) ≡ F r ( x ) − 1 / 2 = ⇒ G 0 ( x ) = x , We get our first key equation: G r + 1 ( x ) = 3 2 G r ( x ) − 2 G 3 r ( x ) . But it is not interesting! it is satisfied by  − 1 x < a 2   G r ( x ) = x = a 0  1 x > 0  2 def = X n − n + 1 This says: D n n → 0 , 2 . D n Need change of scale. We showed, √ � 2 → 0 Y n − 1 µ 2 r E �� − D n / n ∀ µ ∈ [ 0 , 3 ) . 2 Hence we can track µ r ( D n / n ) with µ r ( Y n − 1 / 2 ) . We pick a convenient value, µ = 3 / 2 and show:

14 Theorem [Svante Janson] Let n = 3 r , r ∈ N . X n — approximate median of random permutation of N n . Then a random variable X exists, such that � r X n − n + 1 � 3 2 − → X , 2 n where X has the distribution F ( · ) ; with the same shift F ( x ) ≡ G ( x )+ 1 / 2 , we get the equation G ( 3 2 x ) = 3 2 G ( x ) − 2 G 3 ( x ) , − ∞ < x < ∞ Moreover: The distribution function F ( · ) is strictly increasing throughout. The value 3/2 is inherent in the problem!

Sicilian Median Selection 15 The proof of the Theorem uses the technical lemma Let a ∈ ( 0 , ∞ ) and φ that maps [ 0 , a ] into [ 0 , a ] Lemma For x > a we define φ ( x ) = x . Assume φ ( 0 ) = 0 ( i ) ( ii ) φ ( a ) = a ( iii ) φ ( x ) > x , for all x ∈ ( 0 , a ) . ( iv ) φ ′ ( 0 ) = µ > 1, and continuous there; φ ( · ) is continuous and strictly increasing on [ 0 , a ) . φ ( x ) < µx , x ∈ ( 0 , a ) . ( v ) Let φ r ( t ) = φ ( φ r − 1 ( t )) , the r th iterate of φ ( · ) . Then φ r ( x / µ r ) − as r − → ∞ , → ψ ( x ) , x ≥ 0 . ψ ( x ) is well defined, strictly monotonic increasing for all x , increases from 0 to a , and satisfies the equation ψ ( µx ) = φ ( ψ ( x )) . Proof: φ ( x / µ r + 1 ) < x / µ r , From Property ( v ) : Since iteration preserves monotonicity, φ r + 1 ( x / µ r + 1 ) = φ r ( φ ( x / µ r + 1 )) < φ r ( x / µ r ) . Hence a limit ψ ( · ) exists.

16 The properties of ψ ( x ) depend on the behavior of φ ( · ) near x = 0. Since φ ′ ( x ) is continuous at x = 0, ψ ( · ) is continuous throughout. Since it is bounded, the convergence is uniform on [ 0 , ∞ ] . Hence, since φ ( · ) and all its iterates are strictly monotonic, so is ψ ( · ) itself. We have then the equation G ( 3 2 x ) = 3 2 G ( x ) − 2 G 3 ( x ) , − ∞ < x < ∞ but we have no explicit solution for it. What can we do? Several things. We can calculate a power expansion for it; From G 0 ( · ) and the iteration, all G r ( · ) are odd, hence we can write G ( x ) = ∑ b k x 2 k − 1 . k ≥ 1 b 1 is avaiable from the iteration: The derivatives of G r ( x / µ r ) are all 1, hence this is also the derivative there of G ( x ) . Successive calculations are easy:

Analysis of Approximate Median Selection M. Hofri Department of - PDF document

Analysis of Approximate Median Selection M. Hofri Department of Computer Science, WPI Collaborators: Domenico Cantone & students Universit` a di Catania, Dipartimento di Matematica Svante Janson Department of Mathematics, Uppsala

the nerves sensory radial median ulnar median median sensory median median ulnar radial

I - -75 Median Cable Barrier 75 Median Cable Barrier 75 Median Cable Barrier I 75 Median Cable

Linear-time Median Def: Median of elements A=a 1 , a 2 , , a n is the (n/2)-th smallest element

Spartanburg Nation Median Value of a $115,900 $184,700 Home Median Gross Rent $705 $950

African American Strategy Equitable Access to Homeownership Presentation April 16, 2018

Median Finding Test Cases What's Next 1. Median finding, part 2 2. Why we write test cases 3.

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

Business Statistics CONTENTS Hypotheses on the median The sign test The Wilcoxon signed ranks

W4231: Analysis of Algorithms Definition of median 9/14/1999 Let A = a 1 a n be a

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

Events MedIAN Jobs Contact The Network Background About MedIAN UKs national Medical

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

CS221: Algorithms and Data Structures Priority Queues and Heaps Alan J. Hu (Borrowing slides

All-Pairs Shortest Path Problem

An Application of Ramseys Theorem to Proving Programs Terminate: An Exposition William

Formal Specification and Verification Viorica Sofronie-Stokkermans e-mail:

VERCORS: VERIFICATION OF CONCURRENT SYSTEMS MARIEKE HUISMAN UNIVERSITY OF TWENTE, NETHERLANDS

Categorical Liveness Checking by Corecursive Algebras Natsuki Urabe, Masaki Hara &

Usage Aware Average-Clicks Kalyan Beemanapalli University of Minnesota Ramya Rangarajan

Using Graph Theory to Analyze Gene Network Coherence Francisco A. Gmez-Vela Norberto

Analysis of Approximate Median Selection M. Hofri Department of - PDF document

Analysis of Approximate Median Selection M. Hofri Department of Computer Science, WPI Collaborators: Domenico Cantone & students Universit` a di Catania, Dipartimento di Matematica Svante Janson Department of Mathematics, Uppsala

the nerves sensory radial median ulnar median median sensory median median ulnar radial

I - -75 Median Cable Barrier 75 Median Cable Barrier 75 Median Cable Barrier I 75 Median Cable

Linear-time Median Def: Median of elements A=a 1 , a 2 , , a n is the (n/2)-th smallest element

Spartanburg Nation Median Value of a $115,900 $184,700 Home Median Gross Rent $705 $950

African American Strategy Equitable Access to Homeownership Presentation April 16, 2018

Median Finding Test Cases What's Next 1. Median finding, part 2 2. Why we write test cases 3.

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

Business Statistics CONTENTS Hypotheses on the median The sign test The Wilcoxon signed ranks

W4231: Analysis of Algorithms Definition of median 9/14/1999 Let A = a 1 a n be a

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

Events MedIAN Jobs Contact The Network Background About MedIAN UKs national Medical

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

CS221: Algorithms and Data Structures Priority Queues and Heaps Alan J. Hu (Borrowing slides

All-Pairs Shortest Path Problem

An Application of Ramseys Theorem to Proving Programs Terminate: An Exposition William

Formal Specification and Verification Viorica Sofronie-Stokkermans e-mail:

VERCORS: VERIFICATION OF CONCURRENT SYSTEMS MARIEKE HUISMAN UNIVERSITY OF TWENTE, NETHERLANDS

Categorical Liveness Checking by Corecursive Algebras Natsuki Urabe, Masaki Hara &amp;

Usage Aware Average-Clicks Kalyan Beemanapalli University of Minnesota Ramya Rangarajan

Using Graph Theory to Analyze Gene Network Coherence Francisco A. Gmez-Vela Norberto

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Categorical Liveness Checking by Corecursive Algebras Natsuki Urabe, Masaki Hara &