on the approximability of information theoretic clustering
play

On the Approximability of Information Theoretic Clustering - PowerPoint PPT Presentation

1 H(X) 0,75 0,5 0,25 0 0,25 0,5 0,75 1 On the Approximability of Information Theoretic Clustering Ferdiando Cicalese, U. Verona Eduardo Laber, PUC-RIO Lucas Murtinho, PUC-RIO POSTER 165, Pacific Ballroom Impurity Measures Maps a


  1. 1 H(X) 0,75 0,5 0,25 0 0,25 0,5 0,75 1 On the Approximability of Information Theoretic Clustering Ferdiando Cicalese, U. Verona Eduardo Laber, PUC-RIO Lucas Murtinho, PUC-RIO POSTER 165, Pacific Ballroom

  2. Impurity Measures • Maps a vector v in R d into a non-negative value • The more homogeneous v with respect to its components the larger the impurity – (1,0,0,19): small impurity – (5,5,5,5) : large impurity • Well known impurity measures g log k v k 1 v i Entropy X I Ent ( v ) = k v k 1 , k v k 1 v i i =1 g ✓ ◆ v i v i Gini X I Gini ( v ) = k v k 1 1 � k v k 1 k v k 1 i =1

  3. Clustering with minimum impurity Input • V : set of non-negative vectors in R d • I : impurity measure • k : number of clusters Goal P Partition V into k groups partition P = ( V (1) , . . . , V ( k ) ) so that impurity of a partition P then I ( P ) = P k i =1 I ( V ( i ) ) . the minimum possible impurity is minimized P P : impurity of the sum of the vectors in =1 I ( V ( i ) ) . I ( V ( i ) ) possible impurity possible impurity

  4. Applications/ Motivations • Generalizes clustering using KL-divergence – Entropy impurity and KL-divergence of a clustering differ by a constant factor • Clustering probability distributions • Clustering nominal attributes in decision tree/ random forest construction • Channel Quantizer Design [Inf. Theory]

  5. Our Contributions Approximation Algorithms • 3-approximation for Gini in linear time (arbitrary k) • O(log 2 (min{d,k}))- approximation for Entropy in polytime – First algorithm with approximation independent of n that does make assumption on the input domain

  6. Our Contributions Approximation Algorithms Project vectors in dimension k incur small additive loss • 3-approximation for Gini in linear time (arbitrary k) • O(log 2 (min{d,k}))- approximation for Entropy in polytime – First algorithm with approximation independent of n that does make assumption on the input domain

  7. Our Contributions Approximation Algorithms Project vectors in dimension k incur small additive loss • 3-approximation for Gini in linear time (arbitrary k) Each cluster is pure : all vectors have the same largest component • O(log 2 (min{d,k}))- approximation for Entropy in polytime – First algorithm with approximation independent of n that does make assumption on the input domain

  8. Our Contributions Approximation Algorithms Project vectors in dimension k incur small additive loss • 3-approximation for Gini in linear time (arbitrary k) Each cluster is pure : all vectors have the same largest component • O(log 2 (min{d,k}))- approximation for Entropy in polytime – First algorithm with There is a clustering with exactly one non-pure cluster and impurity approximation independent of O(log 2 d) ・ OPT n that does make assumption Find this clustering in a 2-dim projection using DP on the input domain

  9. Our Contributions APX-Hardness for Entropy • Reduction from c-gap vertex cover in cubic graphs • Solves open question from [Chaudhuri and McGregor, COLT08] and [Ackermann et al., ECCC11]

  10. Our Contributions APX-Hardness for 0..010 … 010 ... 00 Entropy Edges to vectors with two 1’s 0..000 … 010 ... 01 • Reduction from c-gap Theorem vertex cover in cubic k’(G,k) = 3 log 3|E|+6(1-log3)k graphs MinVertexCover ≤ k ⇒ Opt-Impurity ≤ k’(G,k) • MinVertexCover > ck ⇒ Opt-Impurity > c’k’(G,k) • • Solves open question from [Chaudhuri and McGregor, COLT08] and [Ackermann et al., ECCC11]

  11. Our Contributions APX-Hardness for 0..010 … 010 ... 00 Entropy Edges to vectors with two 1’s 0..000 … 010 ... 01 • Reduction from c-gap Theorem vertex cover in cubic k’(G,k) = 3 log 3|E|+6(1-log3)k graphs MinVertexCover ≤ k ⇒ Opt-Impurity ≤ k’(G,k) • MinVertexCover > ck ⇒ Opt-Impurity > c’k’(G,k) • • Solves open question Lemma. G cubic and min-VertexCover <= k from [Chaudhuri and ⇒ G decomposes into stars of sizes 2 and 3. McGregor, COLT08] and [Ackermann et al., ECCC11]

  12. Our Contributions Ratio-Greedy Algorithm • Built on top of the theoretical ideas • Promising preliminary experimental comparisons – much faster than a k-means based method – close impurity

  13. Our Contributions Ratio-Greedy Algorithm • Built on top of the theoretical ideas • Promising preliminary experimental comparisons – much faster than a k-means based method – close impurity

  14. Our Contributions Ratio-Greedy Algorithm • Built on top of the theoretical ideas • Promising preliminary experimental comparisons – much faster than a k-means based method – close impurity

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend