a rate distortion one class model and its applications to
play

A Rate-Distortion One-Class Model and its Applications to Clustering - PowerPoint PPT Presentation

A Rate-Distortion One-Class Model and its Applications to Clustering Fernando Pereira 1 Koby Crammer Partha Pratim Talukdar University Of Pennsylvania 1Currently at Google, Inc. A Rate-Distortion One-Class Model and its Applications to


  1. A Rate-Distortion One-Class Model and its Applications to Clustering Fernando Pereira 1 Koby Crammer Partha Pratim Talukdar University Of Pennsylvania 1Currently at Google, Inc. A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

  2. One Class Prediction • Problem Statement • Predict a coherent superset of a small set of positive instances. • Applications • Document Retrieval • Information Extraction • Gene Expression • Prefer high precision over high recall. A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

  3. Previous Approaches (Ester et al. 1996) : Density based non-exhaustive clustering algorithm. Unfortunately, density analysis is hard in high dimension. (Tax & Duin 1999) : Find a small ball that contains as many of the seed examples as possible. Most of the points are considered relevant, a few outliers are dropped. (Crammer & Chechik 2004) : Identify a small subset of relevant examples, leaving out most less relevant ones. (Gupta & Ghosh 2006) : Modified version of (Crammer & Chechik 2004). A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

  4. Our Approach: A Rate-Distortion One-Class Model • Express the one-class problem as lossy coding of each instance into instance-dependent codewords (clusters). • In contrast to previous methods, use more codewords than instances. • Regularization via sparse coding: each instance has to be assigned to one of only two codewords. A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

  5. Coding Scheme p o i n t 1 c w 1 p o i n t 2 c w 2 p o i n t 3 c w 3 p o i n t 4 c w 4 p o i c w n t 5 5 j o i n t c w 0 • Instances can be coded as themselves, or as a shared codeword (“ 0 ”) represented by the vector w . A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

  6. Notation c w x q ( x | x ) p o i n t x q ( 0 | x ) j o i n t c w 0 p ( x ) Prior on point x . q (0 | x ) Probability of x being encoded by the joint code (“0”). q ( x | x ) Probability of self-coding point x . v x Vector representation of point x . w Centroid vector of the single class. D ( v x � w ) Cost (distortion) suffered when point x is assigned to the one class whose centroid is w . A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

  7. Rate & Distorition Tradeoff p o i n t 1 c p o i n t 1 c w 1 w 1 p o i n t p o i n t 2 c w 2 2 c w 2 p o i n t 3 c w p o i n t 3 c w 3 3 p o i n t p o i n t 4 c w 4 4 c w 4 p o i n t 5 c w p o i n t 5 c w 5 5 j o i n t c w 0 j o i n t c w 0 All in one All alone High Compression (Low Rate) Low Compression (High Rate) High Distortion Low Distorition A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

  8. Rate-Distortion Optimization Random variables: • X : instance to be coded; • T : code for an instance, either T = 0 (shared codeword) or T = x > 0 (instance-specific codeword). Rate: Amount of compression from the source X to the code T , measured by the mutual information I ( T ; X ) Distortion: How well on average the centroid w serves as a proxy to the instances v x . Objective ( β > 0 tradeoff parameter): min Rate + β × Distortion w , { q (0 | x ) } A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

  9. Self-Consistent Equations Solving the Rate-Distortion optimization in the OC setting, we get the following three self-consistent equations, as in IB. � q (0) = p ( x ) q (0 | x ) (1) x � � q (0) e − β D ( v x � w ) q (0 | x ) = min , 1 (2) p ( x ) � w = q ( x | 0) v x (3) x A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

  10. One Class Rate Distortion Algorithm (OCRD) We optimize the rate-distortion tradeoff following the Blahut-Arimoto and Information bottleneck (IB) algorithms, alternating between the following two steps: 1 Compute the centroid location w as the weighted average of instances v x with weights proportional to q (0 | x ) p ( x ) � w = q ( x | 0) v x x 2 Fix w and optimize for the coding policy q (0 | x ) , q (0) A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

  11. Step 2: Finding a Coding Policy Let C = { x : q (0 | x ) = 1 } be the set of points assigned to the one class. Lemma Let s ( x ) = βd x + log( p ( x )) then there is θ such that x ∈ C if and only if s ( x ) < θ The lemma allows us to develop a deterministic algorithm to solve for q (0 | x ) for x = 1 , . . . , m simultaneously in time complexity O ( m log m ) A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

  12. Phase Transitions in the Optimal Solution 1 0.8 0.6 q(0|x) 0.4 0.2 0 5 4 3.5 3 3 2.5 2 2 1.5 1 1 x id temp=1/ β A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

  13. Multiclass Extension A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

  14. Multiclass Coding Scheme • We have m points and k centroids. The natural extension doesn’t work because 1 − q ( x | x ) does not specify which centroid x should be assigned to. A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

  15. Multiclass Coding Scheme • We have m points and k centroids. The natural extension doesn’t work because 1 − q ( x | x ) does not specify which centroid x should be assigned to. • Our Multiclass Coding Scheme: 0 cw 1 point 1 cw k1 cw 2 point 2 cw 3 point 3 cw k2 A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

  16. Multiclass Rate-Distortion Algorithm (MCRD) MCRD alternates between the following two steps: 1 Use the OCRD algorithm to decide whether we want to self-code a point or not. A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

  17. Multiclass Rate-Distortion Algorithm (MCRD) MCRD alternates between the following two steps: 1 Use the OCRD algorithm to decide whether we want to self-code a point or not. 2 Use a hard clustering algorithm (sIB) to clusters the points which we decided not to self-code in the first step. Then iterate. A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

  18. Experimental Results 1 One Class Document Classification. 2 Multiclass Clustering of synthetic data. 3 Multiclass Clustering of real-world data. A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

  19. One Class Document Classification Category: crude (#578) Category: acq (#2369) 1 1 OC−Convex OC−Convex OC−IB OC−IB 0.8 0.8 OCRD−BA OCRD−BA 0.6 0.6 Precision Precision 0.4 0.4 0.2 0.2 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Recall PR plots for two categories of the Reuters-21678 data set using OCRD and two previously proposed methods (OC-IB & OC-Convex). During training, each of the algorithms searched for a meaningful subset of the training data and generated a centroid. The centroid was then used to label the test data, and to compute recall and precision. A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

  20. Multiclass: Synthetic Data Clustering beta=100 , coded = 585 / 900 ( 45 132 132 135 141 ) beta=140 , coded = 490 / 900 ( 0 119 121 122 128 ) 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 −0.2 −0.2 −0.4 −0.4 −0.6 −0.6 −0.8 −0.8 −1 −1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Clusterings produced by MCRD on a synthetic data set for two values of β with k = 5. There were 900 points, 400 sampled from four Gaussian distributions, 500 sampled from a uniform distribution. Self-coded points are marked by black dots, coded points by colored dots and cluster centroids by bold circles. A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

  21. Multiclass: Unsupervised Document Clustering Total Points: 500, Clusters: 5 1 sIB 0.95 MCRD 0.9 Precision 0.85 0.8 0.75 0.7 0 0.2 0.4 0.6 0.8 1 Recall PR plots for sIB and MCRD ( β = 1.6) on the Multi5 1 dataset (2000 word vocabulary). These plots show that better clustering can be obtained if the algorithm is allowed to selectively leave out data points (through self-coding). A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

  22. Conclusion • We have cast the problem of identifying a small coherent subset of data as an optimization problem that trades off between class size (compression) and accuracy (distortion). A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

  23. Conclusion • We have cast the problem of identifying a small coherent subset of data as an optimization problem that trades off between class size (compression) and accuracy (distortion). • We also show that our method allows us to move from one-class to standard clustering, but with background noise left out (the ability to “give up” some points). A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend