A Rate-Distortion One-Class Model and its Applications to Clustering - PowerPoint PPT Presentation

A Rate-Distortion One-Class Model and its Applications to Clustering Fernando Pereira 1 Koby Crammer Partha Pratim Talukdar University Of Pennsylvania 1Currently at Google, Inc. A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

One Class Prediction • Problem Statement • Predict a coherent superset of a small set of positive instances. • Applications • Document Retrieval • Information Extraction • Gene Expression • Prefer high precision over high recall. A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

Previous Approaches (Ester et al. 1996) : Density based non-exhaustive clustering algorithm. Unfortunately, density analysis is hard in high dimension. (Tax & Duin 1999) : Find a small ball that contains as many of the seed examples as possible. Most of the points are considered relevant, a few outliers are dropped. (Crammer & Chechik 2004) : Identify a small subset of relevant examples, leaving out most less relevant ones. (Gupta & Ghosh 2006) : Modified version of (Crammer & Chechik 2004). A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

Our Approach: A Rate-Distortion One-Class Model • Express the one-class problem as lossy coding of each instance into instance-dependent codewords (clusters). • In contrast to previous methods, use more codewords than instances. • Regularization via sparse coding: each instance has to be assigned to one of only two codewords. A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

Coding Scheme p o i n t 1 c w 1 p o i n t 2 c w 2 p o i n t 3 c w 3 p o i n t 4 c w 4 p o i c w n t 5 5 j o i n t c w 0 • Instances can be coded as themselves, or as a shared codeword (“ 0 ”) represented by the vector w . A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

Notation c w x q ( x | x ) p o i n t x q ( 0 | x ) j o i n t c w 0 p ( x ) Prior on point x . q (0 | x ) Probability of x being encoded by the joint code (“0”). q ( x | x ) Probability of self-coding point x . v x Vector representation of point x . w Centroid vector of the single class. D ( v x � w ) Cost (distortion) suffered when point x is assigned to the one class whose centroid is w . A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

Rate & Distorition Tradeoff p o i n t 1 c p o i n t 1 c w 1 w 1 p o i n t p o i n t 2 c w 2 2 c w 2 p o i n t 3 c w p o i n t 3 c w 3 3 p o i n t p o i n t 4 c w 4 4 c w 4 p o i n t 5 c w p o i n t 5 c w 5 5 j o i n t c w 0 j o i n t c w 0 All in one All alone High Compression (Low Rate) Low Compression (High Rate) High Distortion Low Distorition A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

Rate-Distortion Optimization Random variables: • X : instance to be coded; • T : code for an instance, either T = 0 (shared codeword) or T = x > 0 (instance-specific codeword). Rate: Amount of compression from the source X to the code T , measured by the mutual information I ( T ; X ) Distortion: How well on average the centroid w serves as a proxy to the instances v x . Objective ( β > 0 tradeoff parameter): min Rate + β × Distortion w , { q (0 | x ) } A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

Self-Consistent Equations Solving the Rate-Distortion optimization in the OC setting, we get the following three self-consistent equations, as in IB. � q (0) = p ( x ) q (0 | x ) (1) x � � q (0) e − β D ( v x � w ) q (0 | x ) = min , 1 (2) p ( x ) � w = q ( x | 0) v x (3) x A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

One Class Rate Distortion Algorithm (OCRD) We optimize the rate-distortion tradeoff following the Blahut-Arimoto and Information bottleneck (IB) algorithms, alternating between the following two steps: 1 Compute the centroid location w as the weighted average of instances v x with weights proportional to q (0 | x ) p ( x ) � w = q ( x | 0) v x x 2 Fix w and optimize for the coding policy q (0 | x ) , q (0) A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

Step 2: Finding a Coding Policy Let C = { x : q (0 | x ) = 1 } be the set of points assigned to the one class. Lemma Let s ( x ) = βd x + log( p ( x )) then there is θ such that x ∈ C if and only if s ( x ) < θ The lemma allows us to develop a deterministic algorithm to solve for q (0 | x ) for x = 1 , . . . , m simultaneously in time complexity O ( m log m ) A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

Phase Transitions in the Optimal Solution 1 0.8 0.6 q(0|x) 0.4 0.2 0 5 4 3.5 3 3 2.5 2 2 1.5 1 1 x id temp=1/ β A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

Multiclass Extension A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

Multiclass Coding Scheme • We have m points and k centroids. The natural extension doesn’t work because 1 − q ( x | x ) does not specify which centroid x should be assigned to. A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

Multiclass Coding Scheme • We have m points and k centroids. The natural extension doesn’t work because 1 − q ( x | x ) does not specify which centroid x should be assigned to. • Our Multiclass Coding Scheme: 0 cw 1 point 1 cw k1 cw 2 point 2 cw 3 point 3 cw k2 A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

Multiclass Rate-Distortion Algorithm (MCRD) MCRD alternates between the following two steps: 1 Use the OCRD algorithm to decide whether we want to self-code a point or not. A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

Multiclass Rate-Distortion Algorithm (MCRD) MCRD alternates between the following two steps: 1 Use the OCRD algorithm to decide whether we want to self-code a point or not. 2 Use a hard clustering algorithm (sIB) to clusters the points which we decided not to self-code in the first step. Then iterate. A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

Experimental Results 1 One Class Document Classification. 2 Multiclass Clustering of synthetic data. 3 Multiclass Clustering of real-world data. A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

One Class Document Classification Category: crude (#578) Category: acq (#2369) 1 1 OC−Convex OC−Convex OC−IB OC−IB 0.8 0.8 OCRD−BA OCRD−BA 0.6 0.6 Precision Precision 0.4 0.4 0.2 0.2 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Recall PR plots for two categories of the Reuters-21678 data set using OCRD and two previously proposed methods (OC-IB & OC-Convex). During training, each of the algorithms searched for a meaningful subset of the training data and generated a centroid. The centroid was then used to label the test data, and to compute recall and precision. A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

Multiclass: Synthetic Data Clustering beta=100 , coded = 585 / 900 ( 45 132 132 135 141 ) beta=140 , coded = 490 / 900 ( 0 119 121 122 128 ) 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 −0.2 −0.2 −0.4 −0.4 −0.6 −0.6 −0.8 −0.8 −1 −1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Clusterings produced by MCRD on a synthetic data set for two values of β with k = 5. There were 900 points, 400 sampled from four Gaussian distributions, 500 sampled from a uniform distribution. Self-coded points are marked by black dots, coded points by colored dots and cluster centroids by bold circles. A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

Multiclass: Unsupervised Document Clustering Total Points: 500, Clusters: 5 1 sIB 0.95 MCRD 0.9 Precision 0.85 0.8 0.75 0.7 0 0.2 0.4 0.6 0.8 1 Recall PR plots for sIB and MCRD ( β = 1.6) on the Multi5 1 dataset (2000 word vocabulary). These plots show that better clustering can be obtained if the algorithm is allowed to selectively leave out data points (through self-coding). A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

Conclusion • We have cast the problem of identifying a small coherent subset of data as an optimization problem that trades off between class size (compression) and accuracy (distortion). A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

Conclusion • We have cast the problem of identifying a small coherent subset of data as an optimization problem that trades off between class size (compression) and accuracy (distortion). • We also show that our method allows us to move from one-class to standard clustering, but with background noise left out (the ability to “give up” some points). A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

A Rate-Distortion One-Class Model and its Applications to Clustering - PowerPoint PPT Presentation

A Rate-Distortion One-Class Model and its Applications to Clustering Fernando Pereira 1 Koby Crammer Partha Pratim Talukdar University Of Pennsylvania 1Currently at Google, Inc. A Rate-Distortion One-Class Model and its Applications to

Labor Classification Yrs Rate 1 Rate 2 Rate 3 Rate 4 Rate 5 Rate 6 Rate 7 Rate 8 Rate 9

Chapter 10 Rate Distortion Theory Peng-Hua Wang Graduate Inst. of Comm. Engineering National

Mode-dependent Rate-distortion Optimized Transforms Using Graph Signal Processing Methods

Non Linear Distortion and Dynamic Range Issues Non Linear Distortion and Dynamic Range Issues in

Magnetic Distortion Distortion of of Magnetic HPD Images HPD Images smund Skjveland

Using Distortion in 3D Using Distortion in 3D Sheelagh Carpendale Sheelagh Carpendale

Digital Pre-Distortion Derek Kozel What is Digital Pre-Distortion (DPD) A technique for

CMB Spectral Distortion Computations using the Greens function package of CosmoTherm Primordial

Temporal Distortion Temporal Distortion Perspective) Perspective) t t Blue view Blue view y

Resolving Profile Distortion Resolving Profile Distortion for Electron-based IPMs for

Variable Rate Debt Options: Auction Rate Securities Auction Rate Securities What are Auction Rate

Rate Distortion for Model Compression: From Theory To Practice Weihao Gao , Yu-Han Liu ,

Bounded Distortion Mapping and Shape Deformation

District 211 One-to-One Program One-to-One: Program Background 2012-2013 2016-2017 One-to-One

Interest Rate Swap and Interest Rate Swap and Variable Rate Debt Programs Variable Rate Debt

Optimal Communication-Distortion Tradeoff in Voting Debmalya Mandal (Columbia), Nisarg Shah

FormAPI + Drupal 8 Form and AJAX Mikhail Kraynuk Mikhail Kraynuk Drupal Senior Developer About

The computational nature of phonological generalizations Jeffrey Heinz Linguistics Department

Approximate Tree Matching with pq-Grams Nikolaus Augsten a , Michael B ohlen, Johann Gamper DIS

Q3 earnings presentation September 2018 Forward-Looking Statements From time to time Home

Darin Lovett, M.Phil, M.A, M.Sc, B.E Space Support, AWC, WGCDR (Reserve) Director Space, South

The Additive Marker in complementizer pronominal Conversational Persian: A Case 1. ina

Review on Lattice Muon g-2 HVP Calculation Kohtaroh Miura (GSI Helmholtz-Instute Mainz) Lattice

Discussion of Liquidity At Risk: Joint Stress Testing of Solvency and Liquidity by Cont,