optimal interval clustering application to bregman
play

Optimal interval clustering: Application to Bregman clustering and - PowerPoint PPT Presentation

Optimal interval clustering: Application to Bregman clustering and statistical mixture learning Frank Nielsen 1 Richard Nock 2 1 Sony Computer Science Laboratories/Ecole Polytechnique 2 UAG-CEREGMIA/NICTA May 2014 c 2014 Frank Nielsen, Sony


  1. Optimal interval clustering: Application to Bregman clustering and statistical mixture learning Frank Nielsen 1 Richard Nock 2 1 Sony Computer Science Laboratories/Ecole Polytechnique 2 UAG-CEREGMIA/NICTA May 2014 c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/26

  2. Hard clustering: Partitioning the data set ◮ Partition X = { x 1 , ..., x n } ⊂ X into k clusters C 1 ⊂ X , ..., C k ⊂ X : k � X = C i i =1 ◮ Center-based (prototype) hard clustering: k -means [2], k -medians, k -center, ℓ r -center [10], etc. ◮ Model-based hard clustering: statistical mixtures maximizing the complete likelihood (prototype=model paramter). ◮ k -means: NP-hard when d > 1 and k > 1 [11, 7, 1]. ◮ k -medians and k -centers: NP-hard [12] (1984) ◮ In 1D, k -means is polynomial [3, 15]: O ( n 2 k ). c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/26

  3. Euclidean 1D k -means ◮ 1D k -means [8] has contiguous partition. � n − 1 � ◮ Solved by enumerating all partitions in 1D (1958). k − 1 Better than Stirling numbers of the second kind S ( n , k ) that count all partitions. ◮ Polynomial in time O ( n 2 k ) using Dynamic Programming (DP) [3] (sketched in 1973 in two pages). ◮ R package Ckmeans.1d.dp [15] (2011). c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/26

  4. Interval clustering: Structure Sort X ∈ X with respect to total order < on X in O ( n log n ). Output represented by: ◮ k intervals I i = [ x l i , x r i ] such that C i = I i ∩ X . ◮ or better k − 1 delimiters l i ( i ∈ { 2 , ..., k } ) since r i = l i +1 − 1 ( i < k and r k = n ) and l 1 = 1. [ x 1 ... x l 2 − 1 ] [ x l 2 ... x l 3 − 1 ] ... [ x l k ... x n ] � �� � � �� � � �� � C 1 C 2 C k c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/26

  5. Objective function for interval clustering Scalars x 1 < ... < x n are partitioned contiguously into k clusters: C 1 < ... < C k . Clustering objective function: k � min e k ( X ) = e 1 ( C j ) j =1 c 1 ( · ): intra-cluster cost/energy ⊕ : inter-cluster cost/energy (commutative, associative) n = kp + 1 1D points equally distributed → k different optimal clustering partitions c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/26

  6. Examples of objective functions In arbitrary dimension X = R d : ◮ ℓ r -clustering ( r ≥ 1): � = � ⎛ ⎞ ⎝� d ( x , p ) r e 1 ( C j ) = min ⎠ p ∈ X x ∈C j (argmin=prototype p j is the same whether we take power of 1 r of sum or not) Euclidean ℓ r -clustering: r = 1 median, r = 2 means. ◮ k -center (lim r →∞ ): � = max e 1 ( C i ) = min p ∈ X max x ∈C j d ( x , p ) ◮ Discrete clustering: Search space in min is C j instead of X . Note that in 1D, ℓ s -norm distance is always d ( p , q ) = | p − q | , independent of s ≥ 1. c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/26

  7. Optimal interval clustering by Dynamic Programming X j , i = { x j , ..., x i } ( j ≤ i ) X i = X 1 , i = { x 1 , ..., x i } E = [ e i , j ]: n × k cost matrix, O ( n × k ) memory e i , m = e m ( X i ) Optimality equation: e i , m = min m ≤ j ≤ i { e j − 1 , m − 1 ⊕ e 1 ( X j , i ) } Associative/commutative operator ⊕ (+ or max). Initialize with c i , 1 = c 1 ( X i ) E : compute from left to right column, from bottom to top. Best clustering solution cost is at e n , k . Time: n × k × O ( n ) × T 1 ( n )= O ( n 2 kT 1 ( n )), O(nk) memory c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/26

  8. Retrieving the solution: Backtracking Use an auxiliary matrix S = [ s i . j ] for storing the argmin. Backtrack in O ( k ) time. ◮ Left index l k of C k stored at s n , k : l k = s n , k . ◮ Iteratively retrieve the previous left interval indexes at entries l j − 1 = s l j − 1 , i for j = k − 1 , ..., j = 1. Note that l j − 1 = n − � k l = j n l and l j − 1 = � j − 1 l =1 n l . c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/26

  9. Optimizing time with a Look Up Table (LUT) Save time when computing e 1 ( X j , i ) since we perform n × k × O ( n ) such computations. Look Up Table (LUT): Add extra n × n matrix E 1 with E 1 [ j ][ i ] = e 1 ( X j , i ). Build in O ( n 2 T 1 ( n ))... Then DP in O ( n 2 k )= O ( n 2 T 1 ( n ))). → quadratic amount of memory ( n > 10000...) c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/26

  10. DP solver with cluster size constraints n − and n + i : lower/upper bound constraints on n i = |C i | i � k i ≤ n and � k l =1 = n − l =1 = n + i ≥ n . When no constraints: add dummy constraints n − i = 1 and n + i = n − k + 1. n m = |C m | = i − j + 1 such that n − m ≤ n m ≤ n + m . → j ≤ i + 1 − n − m and j ≥ i + 1 − n + m . e i , m = min { e j − 1 , m − 1 ⊕ e 1 ( X j , i ) } , max { 1+ � m − 1 l =1 n − l , i +1 − n + m }≤ j j ≤ i +1 − n − m c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/26

  11. Model selection from the DP table m ( k ) = e k ( X ) e 1 ( X ) decreases with k and reaches minimum when k = n . Model selection: trade-off choose best model among all the models with k ∈ [1 , n ]. Regularized objective function: e ′ k ( X ) = e k ( X ) + f ( k ), f ( k ) related to model complexity. Compute the DP table for k = n , ..., 1 and avoids redundant computations. Then compute the criterion for the last line (indexed by n ) and choose the argmin of e ′ k . c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/26

  12. A Voronoi cell condition for DP optimality elements → interval clusters → prototypes interval clusters ← prototypes Voronoi diagram: Partition X wrt. P = { p 1 , ..., p k } . Voronoi cell: V ( p j ) = { x ∈ X : d r ( x , p j ) ≤ d r ( x , p l ) ∀ l ∈ { 1 , ..., k }} . x r is a monotonically increasing function on R + , equivalent to V ′ ( p j ) = { x ∈ X : d ( x : p j ) < d ( x : p l ) } DP guarantees optimal clustering when ∀P , V ′ ( p j ) is an interval 2-clustering exhibits the Voronoi bisector. c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/26

  13. 1-mean (centroid): O ( n ) time n � ( x i − p ) 2 min p i =1 D ( x , p ) = ( x − p ) 2 , D ′ ( x , p ) = 2( x − p ) , D ′′ ( x , p ) = 2 Convex optimization (existence and unique solution) n n � � D ′ ( x , p ) = 0 ⇒ x i − np = 0 i =1 i =1 � n Center of mass p = 1 i =1 x i − (barycenter) n Extends to Bregman divergence: D F ( x , p ) = F ( x ) − F ( p ) − ( x − p ) F ′ ( p ) c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/26

  14. 2-means: O ( n log n ) time Find x l 2 ( n − 1 potential locations for x l : from x 2 to x n ): min { e 1 ( C 1 ) + e 1 ( C 2 ) } x l 2 Browse from left to right l 2 = x 2 , ...., x n . Update cost in constant time E 2 ( l + 1) from E 2 ( l ) (SATs also O (1)): E 2 ( l ) = e 2 ( x 1 ... x l − 1 | x l ... x n ) µ 1 ( l + 1) = ( l − 1) µ 1 ( l ) + x l µ 2 ( l + 1) = ( n − l + 1) µ 2 ( l ) − x l , l n − l l l � � ( x i − µ 1 ( l + 1)) 2 = x 2 i − l µ 2 v 1 ( l + 1) = 1 ( l + 1) i =1 i =1 ∆ E 2 ( l ) = l − 1 � µ 1 ( l ) − x l � 2 + n − l + 1 � µ 2 ( l ) − x l � 2 l n − l c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/26

  15. 2-means: Experiments Intel Win7 i7-4800 n Brute force SAT Incremental 300000 155 . 022 0 . 010 0 . 0091 1000000 1814 . 44 0 . 018 0 . 015 Do we need sorting and Ω( n log n ) time? ( k = 1 is linear time) Note that MaxGap does not yield the separator (because centroid is sum of squared distance minimizer) c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/26

  16. Optimal 1D Bregman k -means Bregman information [2] e 1 (generalizes cluster variance): e 1 ( C j ) = min x l ∈C j w l B F ( x l : p j ) . (1) Expressed as [14]: �� � �� � ( p j F ′ ( p j ) − F ( p j )) + e 1 ( C j ) = x l ∈C j w l F ( x l ) − x l ∈C j w l �� � F ′ ( p j ) x ∈C j w l x process using Summed Area Tables [6] (SATs) S 1 ( j ) = � j l =1 w l , S 2 ( j ) = � j l =1 w l x l , and S 3 ( j ) = � j l =1 w l F ( x l ) in O ( n ) time at preprocessing stage. Evaluate the Bregman information e 1 ( X j , i ) in constant time O (1). For example, � i l = j w l F ( x l ) = S 3 ( i ) − S 3 ( j − 1) with S 3 (0) = 0. Bregman Voronoi diagrams have connected cells [4] thus DP yields optimal interval clustering. c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/26

  17. Exponential families in statistics Family of probability distributions: F = { p F ( x ; θ ) : θ ∈ Θ } Exponential families [13]: p F ( x | θ ) = exp ( t ( x ) θ − F ( θ ) + k ( x )) , For example: univariate Rayleigh R ( σ ), t ( x ) = x 2 , k ( x ) = log x , θ = − 1 2 σ 2 , η = − 1 θ , F ( θ ) = log − 1 2 θ and F ∗ ( η ) = − 1 + log 2 η . c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/26

  18. Uniorder exponential families: MLE Maximum Likelihood Estimator (MLE) [13]: i 1 � e 1 ( X j , i ) = ˆ l ( x j , ..., x i ) = F ∗ (ˆ η j , i ) + k ( x l ) . i − j + 1 l = j � i 1 with ˆ η j , i = l = j t ( x l ). i − j +1 By making a change of variable y l = t ( x l ), and not accounting the � k ( x l ) terms that are constant for any clustering, we get ⎛ ⎞ i 1 � e 1 ( X j , i ) ≡ F ∗ y l ⎝ ⎠ i − j + 1 l = j c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend