PAC-Bayesian Analysis of Co-clustering, Graph Clustering and - - PowerPoint PPT Presentation
PAC-Bayesian Analysis of Co-clustering, Graph Clustering and - - PowerPoint PPT Presentation
PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin Motivation Example Clustering cannot be analyzed without specifying what it will be used for! be used for! Example Cluster then pack
Motivation
- Clustering cannot be
analyzed without specifying what it will be used for! Example be used for!
Example
- Cluster then pack
- Cluster then pack
- Clustering by shape is
preferable
- Evaluate the amount
- f time saved
How to define a clustering problem?
- Common pitfall: the goal is defined in terms
- f the solution
– Graph cut – Spectral clustering – Spectral clustering – Information-theoretic approaches
- Which one to choose??? How to compare?
- Our goal: suggest problem formulation
which is independent of the way of solution
Outline
- Two problems behind co-clustering
– Discriminative prediction – Density estimation
- PAC-Bayesian analysis of discriminative
- PAC-Bayesian analysis of discriminative
prediction with co-clustering
- PAC-Bayesian analysis of graph clustering
Discriminative Prediction with Co-clustering
- Example: collaborative filtering
- Goal: find discriminative
prediction rule q(Y|X1,X2)
Y Y
X2 (movies) viewers)
Y
X1 (vie
Discriminative Prediction with Co-clustering
- Example: collaborative filtering
- Goal: find discriminative
prediction rule q(Y|X1,X2)
- Evaluation:
Y Y
X2 (movies) viewers)
- Evaluation:
Y
X1 (vie
) ' , ( ) (
) , '| ( ) , , (
2 1 2 1
Y Y l E E q L
X X Y q Y X X p
=
Expectation w.r.t. the true distribution p(X1,X2,Y) Expectation w.r.t. the classifier q(Y|X1,X2) Given loss l(Y,Y’)
Co-occurrence Data Analysis
- Example: words-documents co-
- ccurrence data
- Goal: find an estimator q(X1,X2)
for the joint distribution p(X ,X )
X2 (words) cuments)
for the joint distribution p(X1,X2)
X1 (docu
Co-occurrence Data Analysis
- Example: words-documents co-
- ccurrence data
- Goal: find an estimator q(X1,X2)
for the joint distribution p(X ,X )
X2 (words) cuments)
for the joint distribution p(X1,X2)
- Evaluation:
X1 (docu
) , ( ln ) (
2 1 ) , (
2 1
X X q E q L
X X p
− =
The true distribution p(X1,X2)
Outline
- Two problems behind co
Two problems behind co Two problems behind co-clustering clustering clustering
– Discriminative prediction Discriminative prediction Discriminative prediction – Density estimation Density estimation Density estimation
- PAC-Bayesian analysis of discriminative
- PAC-Bayesian analysis of discriminative
prediction with co-clustering
- PAC
PAC PAC-Bayesian analysis of graph clustering Bayesian analysis of graph clustering Bayesian analysis of graph clustering
Discriminative prediction based on co-clustering
∑
=
2 1,
2 2 1 1 2 1 2 1
) | ( ) | ( ) , | ( ) , | (
C C
X C q X C q C C Y q X X Y q
Model:
Y X1 X2 C1 C2
Denote: Q = {q(C1|X1), q(C2|X2), q(Y|C1,C2)}
Generalization Bound
- With probability ≥ 1-δ:
N K C X I X Q L Q L kl
i i i i
+ ≤ ∑ ) ; ( )) ( || ) ( ˆ (
∑
=
2 1,
2 2 1 1 2 1 2 1
) | ( ) | ( ) , | ( ) , | (
C C
X C q X C q C C Y q X X Y q
{ }
) | ( ), | ( ), , | (
2 2 1 1 2 1
X C q X C q C C Y q Q =
- A looser, but simpler form of the bound:
) ( ) ( ˆ ln )) ( ˆ 1 ( ) ( ) ( ˆ ln ) ( ˆ )) ( || ) ( ˆ ( Q L Q L Q L Q L Q L Q L Q L Q L kl − + =
N K C X I X N K C X I X Q L Q L Q L
i i i i i i i i
+ + + + ≤
∑ ∑
) ; ( 2 ) ; ( ) ( ˆ 2 ) ( ˆ ) (
Generalization Bound
- With probability ≥ 1-δ:
N K C X I X Q L Q L kl
i i i i
+ ≤ ∑ ) ; ( )) ( || ) ( ˆ (
∑
=
2 1,
2 2 1 1 2 1 2 1
) | ( ) | ( ) , | ( ) , | (
C C
X C q X C q C C Y q X X Y q
{ }
) | ( ), | ( ), , | (
2 2 1 1 2 1
X C q X C q C C Y q Q =
δ ln 2 / ) 4 ln( ln ln − + + =
∏ ∑
N Y C X C K
i i i i i
Logarithmic in |Xi| Number of partition cells PAC-Bayesian bound part
Generalization Bound
- With probability ≥ 1-δ:
N K C X I X Q L Q L kl
i i i i
+ ≤ ∑ ) ; ( )) ( || ) ( ˆ (
∑
=
2 1,
2 2 1 1 2 1 2 1
) | ( ) | ( ) , | ( ) , | (
C C
X C q X C q C C Y q X X Y q
{ }
) | ( ), | ( ), , | (
2 2 1 1 2 1
X C q X C q C C Y q Q =
Low Complexity I(Xi;Ci) = 0 High Complexity I(Xi;Ci) = ln|Xi|
Generalization Bound
- With probability ≥ 1-δ:
N K C X I X Q L Q L kl
i i i i
+ ≤ ∑ ) ; ( )) ( || ) ( ˆ (
∑
=
2 1,
2 2 1 1 2 1 2 1
) | ( ) | ( ) , | ( ) , | (
C C
X C q X C q C C Y q X X Y q
{ }
) | ( ), | ( ), , | (
2 2 1 1 2 1
X C q X C q C C Y q Q =
Optimization tradeoff: Empirical loss vs. “Effective” partition complexity Lower Complexity Higher Complexity
Practice
- With probability ≥ 1-δ:
N K C X I X Q L Q L kl
i i i i
+ ≤ ∑ ) ; ( )) ( || ) ( ˆ (
- Replace with a trade-off:
∑
+ =
i i i i
C X I X Q L N Q F ) ; ( ) ( ˆ ) ( β
Application
- MovieLens dataset
– 100,000 ratings on 5-star scale – 80,000 train ratings, 20,000 test ratings – 943 viewers x 1682 movies – 943 viewers x 1682 movies – State-of-the-art Mean Absolute Error (0.72) – The optimal performance is achieved even with 300x300 cluster space
13x6 Clusters
ute Error
Bound Test MAE Test MAE
∑
+ =
i i i i
C X I X Q L N Q F ) ; ( ) ( ˆ ) ( β
Mean Absolute β β
50x50 Clusters
Test MAE Bound Test MAE
∑
+ =
i i i i
C X I X Q L N Q F ) ; ( ) ( ˆ ) ( β
ute Error Mean Absolute β β
283x283 Clusters
Bound Test MAE
∑
+ =
i i i i
C X I X Q L N Q F ) ; ( ) ( ˆ ) ( β
ute Error β β
Bound Test MAE
Mean Absolute
Weighted Graph Clustering
- The weigts of the edges wij are generated by
unknown distribution p(wij|xi,xj)
- Given a sample of size N of edge weights
- Given a sample of size N of edge weights
- Build a model q(w|x1,x2) such that
Ep(x1,x2,w) E q(w’|x1,x2) l(w,w’) is minimized
Other problems
- Pairwise clustering = clustering of a
weighted graph
– Edge weights = pairwise relations
- Clustering of unweigted graph
– Present edges = weight 1 – Absent edges = weight 0
Weighted Graph Clustering
- The weights of the links are generated
according to:
q(wij|Xi,Xj) = ΣCa,Cb q(wij|Ca,Cb) q(Ca|Xi) q(Cb|Xj)
- This is co-clustering with shared q(C|X)
– Same bounds and (almost same) algorithms apply
Application
- Optimize the trade-off
- Kings dataset
– Edge weights = exponentiated negative ) ; ( ) ( ˆ ) ( C X I X Q L N Q F + = β – Edge weights = exponentiated negative distance between DNS servers – |X| = 1740 – Number of edges = 1,512,930
Graph Clustering Application
0.04 0.05
ss
- 1.5
2
Inform
L
^ (Q)
I(X;C) Bound 0.01 0.02 0.03
|C| Loss
- 5
10 15
0.5 1
rmation (nats)
Relation with Matrix Factorization
- Co-clustering:
– g(X1,X2) = ΣC1,C2 q(C1|X1)g(C1,C2) q(C2|X2) – M ≈ Q1
TGQ2
- Graph clustering:
– g(X1,X2) = ΣC1,C2 q(C1|X1)g(C1,C2) q(C2|X2) – M ≈ QTGQ
Summary of main contributions
- Formulation of co-clustering and graph
clustering (unsupervised learning) as prediction problems
- PAC-Bayesian analysis of co-clustering and
graph clustering – Regularization terms
- Encouraging empirical results
Future Directions
- Practice:
– More applications
- Theory: