PAC-Bayesian Analysis of Co-clustering, Graph Clustering and - - PowerPoint PPT Presentation

pac bayesian analysis of co clustering graph clustering
SMART_READER_LITE
LIVE PREVIEW

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and - - PowerPoint PPT Presentation

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin Motivation Example Clustering cannot be analyzed without specifying what it will be used for! be used for! Example Cluster then pack


slide-1
SLIDE 1

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering

Yevgeny Seldin

slide-2
SLIDE 2

Motivation

  • Clustering cannot be

analyzed without specifying what it will be used for! Example be used for!

slide-3
SLIDE 3

Example

  • Cluster then pack
  • Cluster then pack
  • Clustering by shape is

preferable

  • Evaluate the amount
  • f time saved
slide-4
SLIDE 4

How to define a clustering problem?

  • Common pitfall: the goal is defined in terms
  • f the solution

– Graph cut – Spectral clustering – Spectral clustering – Information-theoretic approaches

  • Which one to choose??? How to compare?
  • Our goal: suggest problem formulation

which is independent of the way of solution

slide-5
SLIDE 5

Outline

  • Two problems behind co-clustering

– Discriminative prediction – Density estimation

  • PAC-Bayesian analysis of discriminative
  • PAC-Bayesian analysis of discriminative

prediction with co-clustering

  • PAC-Bayesian analysis of graph clustering
slide-6
SLIDE 6

Discriminative Prediction with Co-clustering

  • Example: collaborative filtering
  • Goal: find discriminative

prediction rule q(Y|X1,X2)

Y Y

X2 (movies) viewers)

Y

X1 (vie

slide-7
SLIDE 7

Discriminative Prediction with Co-clustering

  • Example: collaborative filtering
  • Goal: find discriminative

prediction rule q(Y|X1,X2)

  • Evaluation:

Y Y

X2 (movies) viewers)

  • Evaluation:

Y

X1 (vie

) ' , ( ) (

) , '| ( ) , , (

2 1 2 1

Y Y l E E q L

X X Y q Y X X p

=

Expectation w.r.t. the true distribution p(X1,X2,Y) Expectation w.r.t. the classifier q(Y|X1,X2) Given loss l(Y,Y’)

slide-8
SLIDE 8

Co-occurrence Data Analysis

  • Example: words-documents co-
  • ccurrence data
  • Goal: find an estimator q(X1,X2)

for the joint distribution p(X ,X )

X2 (words) cuments)

for the joint distribution p(X1,X2)

X1 (docu

slide-9
SLIDE 9

Co-occurrence Data Analysis

  • Example: words-documents co-
  • ccurrence data
  • Goal: find an estimator q(X1,X2)

for the joint distribution p(X ,X )

X2 (words) cuments)

for the joint distribution p(X1,X2)

  • Evaluation:

X1 (docu

) , ( ln ) (

2 1 ) , (

2 1

X X q E q L

X X p

− =

The true distribution p(X1,X2)

slide-10
SLIDE 10

Outline

  • Two problems behind co

Two problems behind co Two problems behind co-clustering clustering clustering

– Discriminative prediction Discriminative prediction Discriminative prediction – Density estimation Density estimation Density estimation

  • PAC-Bayesian analysis of discriminative
  • PAC-Bayesian analysis of discriminative

prediction with co-clustering

  • PAC

PAC PAC-Bayesian analysis of graph clustering Bayesian analysis of graph clustering Bayesian analysis of graph clustering

slide-11
SLIDE 11

Discriminative prediction based on co-clustering

=

2 1,

2 2 1 1 2 1 2 1

) | ( ) | ( ) , | ( ) , | (

C C

X C q X C q C C Y q X X Y q

Model:

Y X1 X2 C1 C2

Denote: Q = {q(C1|X1), q(C2|X2), q(Y|C1,C2)}

slide-12
SLIDE 12

Generalization Bound

  • With probability ≥ 1-δ:

N K C X I X Q L Q L kl

i i i i

+ ≤ ∑ ) ; ( )) ( || ) ( ˆ (

=

2 1,

2 2 1 1 2 1 2 1

) | ( ) | ( ) , | ( ) , | (

C C

X C q X C q C C Y q X X Y q

{ }

) | ( ), | ( ), , | (

2 2 1 1 2 1

X C q X C q C C Y q Q =

  • A looser, but simpler form of the bound:

) ( ) ( ˆ ln )) ( ˆ 1 ( ) ( ) ( ˆ ln ) ( ˆ )) ( || ) ( ˆ ( Q L Q L Q L Q L Q L Q L Q L Q L kl − + =

N K C X I X N K C X I X Q L Q L Q L

i i i i i i i i

      + +       + + ≤

∑ ∑

) ; ( 2 ) ; ( ) ( ˆ 2 ) ( ˆ ) (

slide-13
SLIDE 13

Generalization Bound

  • With probability ≥ 1-δ:

N K C X I X Q L Q L kl

i i i i

+ ≤ ∑ ) ; ( )) ( || ) ( ˆ (

=

2 1,

2 2 1 1 2 1 2 1

) | ( ) | ( ) , | ( ) , | (

C C

X C q X C q C C Y q X X Y q

{ }

) | ( ), | ( ), , | (

2 2 1 1 2 1

X C q X C q C C Y q Q =

δ ln 2 / ) 4 ln( ln ln − +       + =

∏ ∑

N Y C X C K

i i i i i

Logarithmic in |Xi| Number of partition cells PAC-Bayesian bound part

slide-14
SLIDE 14

Generalization Bound

  • With probability ≥ 1-δ:

N K C X I X Q L Q L kl

i i i i

+ ≤ ∑ ) ; ( )) ( || ) ( ˆ (

=

2 1,

2 2 1 1 2 1 2 1

) | ( ) | ( ) , | ( ) , | (

C C

X C q X C q C C Y q X X Y q

{ }

) | ( ), | ( ), , | (

2 2 1 1 2 1

X C q X C q C C Y q Q =

Low Complexity I(Xi;Ci) = 0 High Complexity I(Xi;Ci) = ln|Xi|

slide-15
SLIDE 15

Generalization Bound

  • With probability ≥ 1-δ:

N K C X I X Q L Q L kl

i i i i

+ ≤ ∑ ) ; ( )) ( || ) ( ˆ (

=

2 1,

2 2 1 1 2 1 2 1

) | ( ) | ( ) , | ( ) , | (

C C

X C q X C q C C Y q X X Y q

{ }

) | ( ), | ( ), , | (

2 2 1 1 2 1

X C q X C q C C Y q Q =

Optimization tradeoff: Empirical loss vs. “Effective” partition complexity Lower Complexity Higher Complexity

slide-16
SLIDE 16

Practice

  • With probability ≥ 1-δ:

N K C X I X Q L Q L kl

i i i i

+ ≤ ∑ ) ; ( )) ( || ) ( ˆ (

  • Replace with a trade-off:

+ =

i i i i

C X I X Q L N Q F ) ; ( ) ( ˆ ) ( β

slide-17
SLIDE 17

Application

  • MovieLens dataset

– 100,000 ratings on 5-star scale – 80,000 train ratings, 20,000 test ratings – 943 viewers x 1682 movies – 943 viewers x 1682 movies – State-of-the-art Mean Absolute Error (0.72) – The optimal performance is achieved even with 300x300 cluster space

slide-18
SLIDE 18

13x6 Clusters

ute Error

Bound Test MAE Test MAE

+ =

i i i i

C X I X Q L N Q F ) ; ( ) ( ˆ ) ( β

Mean Absolute β β

slide-19
SLIDE 19

50x50 Clusters

Test MAE Bound Test MAE

+ =

i i i i

C X I X Q L N Q F ) ; ( ) ( ˆ ) ( β

ute Error Mean Absolute β β

slide-20
SLIDE 20

283x283 Clusters

Bound Test MAE

+ =

i i i i

C X I X Q L N Q F ) ; ( ) ( ˆ ) ( β

ute Error β β

Bound Test MAE

Mean Absolute

slide-21
SLIDE 21

Weighted Graph Clustering

  • The weigts of the edges wij are generated by

unknown distribution p(wij|xi,xj)

  • Given a sample of size N of edge weights
  • Given a sample of size N of edge weights
  • Build a model q(w|x1,x2) such that

Ep(x1,x2,w) E q(w’|x1,x2) l(w,w’) is minimized

slide-22
SLIDE 22

Other problems

  • Pairwise clustering = clustering of a

weighted graph

– Edge weights = pairwise relations

  • Clustering of unweigted graph

– Present edges = weight 1 – Absent edges = weight 0

slide-23
SLIDE 23

Weighted Graph Clustering

  • The weights of the links are generated

according to:

q(wij|Xi,Xj) = ΣCa,Cb q(wij|Ca,Cb) q(Ca|Xi) q(Cb|Xj)

  • This is co-clustering with shared q(C|X)

– Same bounds and (almost same) algorithms apply

slide-24
SLIDE 24

Application

  • Optimize the trade-off
  • Kings dataset

– Edge weights = exponentiated negative ) ; ( ) ( ˆ ) ( C X I X Q L N Q F + = β – Edge weights = exponentiated negative distance between DNS servers – |X| = 1740 – Number of edges = 1,512,930

slide-25
SLIDE 25

Graph Clustering Application

0.04 0.05

ss

  • 1.5

2

Inform

L

^ (Q)

I(X;C) Bound 0.01 0.02 0.03

|C| Loss

  • 5

10 15

0.5 1

rmation (nats)

slide-26
SLIDE 26

Relation with Matrix Factorization

  • Co-clustering:

– g(X1,X2) = ΣC1,C2 q(C1|X1)g(C1,C2) q(C2|X2) – M ≈ Q1

TGQ2

  • Graph clustering:

– g(X1,X2) = ΣC1,C2 q(C1|X1)g(C1,C2) q(C2|X2) – M ≈ QTGQ

slide-27
SLIDE 27

Summary of main contributions

  • Formulation of co-clustering and graph

clustering (unsupervised learning) as prediction problems

  • PAC-Bayesian analysis of co-clustering and

graph clustering – Regularization terms

  • Encouraging empirical results
slide-28
SLIDE 28

Future Directions

  • Practice:

– More applications

  • Theory:

– Continuous domains – Continuous domains – Multidimensional matrices

References

Co-clustering: Seldin & Tishby JMLR 2010 submitted, avail.online Graph clustering: Seldin Social Analytics 2010