A Rate-Distortion One-Class Model and its Applications to Clustering - - PowerPoint PPT Presentation

a rate distortion one class model and its applications to
SMART_READER_LITE
LIVE PREVIEW

A Rate-Distortion One-Class Model and its Applications to Clustering - - PowerPoint PPT Presentation

A Rate-Distortion One-Class Model and its Applications to Clustering Fernando Pereira 1 Koby Crammer Partha Pratim Talukdar University Of Pennsylvania 1Currently at Google, Inc. A Rate-Distortion One-Class Model and its Applications to


slide-1
SLIDE 1

A Rate-Distortion One-Class Model and its Applications to Clustering

Koby Crammer Partha Pratim Talukdar Fernando Pereira1

University Of Pennsylvania

1Currently at Google, Inc. A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-2
SLIDE 2

One Class Prediction

  • Problem Statement
  • Predict a coherent superset of a small set of positive instances.
  • Applications
  • Document Retrieval
  • Information Extraction
  • Gene Expression
  • Prefer high precision over high recall.

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-3
SLIDE 3

Previous Approaches

(Ester et al. 1996) : Density based non-exhaustive clustering

  • algorithm. Unfortunately, density analysis is hard in high

dimension. (Tax & Duin 1999) : Find a small ball that contains as many of the seed examples as possible. Most of the points are considered relevant, a few outliers are dropped. (Crammer & Chechik 2004) : Identify a small subset of relevant examples, leaving out most less relevant ones. (Gupta & Ghosh 2006) : Modified version of (Crammer & Chechik 2004).

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-4
SLIDE 4

Our Approach: A Rate-Distortion One-Class Model

  • Express the one-class problem as lossy coding of each instance into

instance-dependent codewords (clusters).

  • In contrast to previous methods, use more codewords than

instances.

  • Regularization via sparse coding: each instance has to be assigned

to one of only two codewords.

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-5
SLIDE 5

Coding Scheme

j

  • i

n t c w p

  • i

n t 1 c w 1 p

  • i

n t 2 c w 2 p

  • i

n t 3 c w 3 p

  • i

n t 4 c w 4 p

  • i

n t 5 c w 5

  • Instances can be coded as themselves, or as a shared codeword

(“0”) represented by the vector w.

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-6
SLIDE 6

Notation

j

  • i

n t c w p

  • i

n t x

q ( | x )

c w x

q ( x | x )

p(x) Prior on point x. q(0|x) Probability of x being encoded by the joint code (“0”). q(x|x) Probability of self-coding point x. vx Vector representation of point x. w Centroid vector of the single class. D (vxw) Cost (distortion) suffered when point x is assigned to the

  • ne class whose centroid is w.

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-7
SLIDE 7

Rate & Distorition Tradeoff

j

  • i

n t c w p

  • i

n t 1 c w 1 p

  • i

n t 2 c w 2 p

  • i

n t 3 c w 3 p

  • i

n t 4 c w 4 p

  • i

n t 5 c w 5 j

  • i

n t c w p

  • i

n t 1 c w 1 p

  • i

n t 2 c w 2 p

  • i

n t 3 c w 3 p

  • i

n t 4 c w 4 p

  • i

n t 5 c w 5

All in one All alone High Compression (Low Rate) Low Compression (High Rate) High Distortion Low Distorition

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-8
SLIDE 8

Rate-Distortion Optimization

Random variables:

  • X: instance to be coded;
  • T: code for an instance, either T = 0 (shared codeword) or T = x > 0

(instance-specific codeword).

Rate: Amount of compression from the source X to the code T, measured by the mutual information I(T; X) Distortion: How well on average the centroid w serves as a proxy to the instances vx. Objective (β > 0 tradeoff parameter): min

w,{q(0|x)}

Rate + β × Distortion

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-9
SLIDE 9

Self-Consistent Equations

Solving the Rate-Distortion optimization in the OC setting, we get the following three self-consistent equations, as in IB. q(0) =

  • x

p(x)q(0|x) (1) q(0|x) = min

  • q(0)e−βD(vxw)

p(x) , 1

  • (2)

w =

  • x

q(x|0)vx (3)

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-10
SLIDE 10

One Class Rate Distortion Algorithm (OCRD)

We optimize the rate-distortion tradeoff following the Blahut-Arimoto and Information bottleneck (IB) algorithms, alternating between the following two steps:

1 Compute the centroid location w as the weighted average of

instances vx with weights proportional to q(0|x)p(x) w =

  • x

q(x|0)vx

2 Fix w and optimize for the coding policy q(0|x), q(0)

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-11
SLIDE 11

Step 2: Finding a Coding Policy

Let C = {x : q(0|x) = 1} be the set of points assigned to the one class.

Lemma

Let s(x) = βdx + log(p(x)) then there is θ such that x ∈ C if and only if s(x) < θ The lemma allows us to develop a deterministic algorithm to solve for q(0|x) for x = 1, . . . , m simultaneously in time complexity O(m log m)

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-12
SLIDE 12

Phase Transitions in the Optimal Solution

1 1.5 2 2.5 3 3.5 1 2 3 4 5 0.2 0.4 0.6 0.8 1 temp=1/β x id q(0|x)

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-13
SLIDE 13

Multiclass Extension

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-14
SLIDE 14

Multiclass Coding Scheme

  • We have m points and k centroids. The natural extension doesn’t

work because 1 − q(x|x) does not specify which centroid x should be assigned to.

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-15
SLIDE 15

Multiclass Coding Scheme

  • We have m points and k centroids. The natural extension doesn’t

work because 1 − q(x|x) does not specify which centroid x should be assigned to.

  • Our Multiclass Coding Scheme:

point 1 cw 1 cw k1 cw k2 point 2 cw 2 point 3 cw 3

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-16
SLIDE 16

Multiclass Rate-Distortion Algorithm (MCRD)

MCRD alternates between the following two steps:

1 Use the OCRD algorithm to decide whether we want to self-code a

point or not.

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-17
SLIDE 17

Multiclass Rate-Distortion Algorithm (MCRD)

MCRD alternates between the following two steps:

1 Use the OCRD algorithm to decide whether we want to self-code a

point or not.

2 Use a hard clustering algorithm (sIB) to clusters the points which

we decided not to self-code in the first step. Then iterate.

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-18
SLIDE 18

Experimental Results

1 One Class Document Classification. 2 Multiclass Clustering of synthetic data. 3 Multiclass Clustering of real-world data.

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-19
SLIDE 19

One Class Document Classification

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Recall Precision Category: crude (#578) OC−Convex OC−IB OCRD−BA 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Recall Precision Category: acq (#2369) OC−Convex OC−IB OCRD−BA

PR plots for two categories of the Reuters-21678 data set using OCRD and two previously proposed methods (OC-IB & OC-Convex). During training, each of the algorithms searched for a meaningful subset of the training data and generated a centroid. The centroid was then used to label the test data, and to compute recall and precision.

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-20
SLIDE 20

Multiclass: Synthetic Data Clustering

−1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 beta=100 , coded = 585 / 900 ( 45 132 132 135 141 ) −1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 beta=140 , coded = 490 / 900 ( 0 119 121 122 128 )

Clusterings produced by MCRD on a synthetic data set for two values of β with k = 5. There were 900 points, 400 sampled from four Gaussian distributions, 500 sampled from a uniform distribution. Self-coded points are marked by black dots, coded points by colored dots and cluster centroids by bold circles.

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-21
SLIDE 21

Multiclass: Unsupervised Document Clustering

0.2 0.4 0.6 0.8 1 0.7 0.75 0.8 0.85 0.9 0.95 1 Precision Recall Total Points: 500, Clusters: 5 sIB MCRD

PR plots for sIB and MCRD (β = 1.6) on the Multi5 1 dataset (2000 word vocabulary). These plots show that better clustering can be obtained if the algorithm is allowed to selectively leave out data points (through self-coding).

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-22
SLIDE 22

Conclusion

  • We have cast the problem of identifying a small coherent subset of

data as an optimization problem that trades off between class size (compression) and accuracy (distortion).

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-23
SLIDE 23

Conclusion

  • We have cast the problem of identifying a small coherent subset of

data as an optimization problem that trades off between class size (compression) and accuracy (distortion).

  • We also show that our method allows us to move from one-class to

standard clustering, but with background noise left out (the ability to “give up” some points).

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-24
SLIDE 24

Conclusion

  • We have cast the problem of identifying a small coherent subset of

data as an optimization problem that trades off between class size (compression) and accuracy (distortion).

  • We also show that our method allows us to move from one-class to

standard clustering, but with background noise left out (the ability to “give up” some points).

  • Extend to more general instance spaces and distortions: graphs,

manifolds.

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)

slide-25
SLIDE 25

Thanks!

A Rate-Distortion One-Class Model and its Applications to Clustering (Crammer et al.)