pac bayesian analysis of co clustering graph clustering
play

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and - PowerPoint PPT Presentation

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin Motivation Example Clustering cannot be analyzed without specifying what it will be used for! be used for! Example Cluster then pack


  1. PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

  2. Motivation Example • Clustering cannot be analyzed without specifying what it will be used for! be used for!

  3. Example • Cluster then pack • Cluster then pack • Clustering by shape is preferable • Evaluate the amount of time saved

  4. How to define a clustering problem? • Common pitfall: the goal is defined in terms of the solution – Graph cut – Spectral clustering – Spectral clustering – Information-theoretic approaches • Which one to choose??? How to compare? • Our goal: suggest problem formulation which is independent of the way of solution

  5. Outline • Two problems behind co-clustering – Discriminative prediction – Density estimation • PAC-Bayesian analysis of discriminative • PAC-Bayesian analysis of discriminative prediction with co-clustering • PAC-Bayesian analysis of graph clustering

  6. Discriminative Prediction with Co-clustering • Example: collaborative filtering X 2 (movies) • Goal: find discriminative viewers) Y prediction rule q ( Y | X 1 , X 2 ) Y X 1 (vie Y

  7. Discriminative Prediction with Co-clustering • Example: collaborative filtering X 2 (movies) • Goal: find discriminative viewers) Y prediction rule q ( Y | X 1 , X 2 ) Y X 1 (vie • Evaluation: • Evaluation: Y = ( ) ( , ' ) L q E E l Y Y ( , , ) ( '| , ) p X X Y q Y X X 1 2 1 2 Expectation w.r.t. the Expectation Given true distribution w.r.t. the loss p ( X 1 , X 2 , Y ) classifier l ( Y , Y’ ) q ( Y | X 1 , X 2 )

  8. Co-occurrence Data Analysis • Example: words-documents co- X 2 (words) occurrence data cuments) • Goal: find an estimator q ( X 1 , X 2 ) for the joint distribution p ( X , X ) for the joint distribution p ( X 1 , X 2 ) X 1 (docu

  9. Co-occurrence Data Analysis • Example: words-documents co- X 2 (words) occurrence data cuments) • Goal: find an estimator q ( X 1 , X 2 ) for the joint distribution p ( X , X ) for the joint distribution p ( X 1 , X 2 ) X 1 (docu • Evaluation: = − ( ) ln ( , ) L q E q X X p ( X , X ) 1 2 1 2 The true distribution p ( X 1 , X 2 )

  10. Outline • Two problems behind co Two problems behind co Two problems behind co-clustering clustering clustering – Discriminative prediction Discriminative prediction Discriminative prediction – Density estimation Density estimation Density estimation • PAC-Bayesian analysis of discriminative • PAC-Bayesian analysis of discriminative prediction with co-clustering • PAC PAC PAC-Bayesian analysis of graph clustering Bayesian analysis of graph clustering Bayesian analysis of graph clustering

  11. Discriminative prediction based on co-clustering ∑ = Model: ( | , ) ( | , ) ( | ) ( | ) q Y X X q Y C C q C X q C X 1 2 1 2 1 1 2 2 C 1 , C 2 Y C 1 C 2 X 1 X 2 Denote: Q = { q ( C 1 | X 1 ), q ( C 2 | X 2 ), q ( Y | C 1 , C 2 )}

  12. { } ∑ Q = = ( | , ), ( | ), ( | ) ( | , ) ( | , ) ( | ) ( | ) q Y C C q C X q C X q Y X X q Y C C q C X q C X 1 2 1 2 1 1 2 2 1 2 1 1 2 2 C 1 , C 2 Generalization Bound • With probability ≥ 1- δ : ≤ ∑ + ( ; ) X I X C K i i i ˆ i ( ( ) || ( )) kl L Q L Q N ˆ ˆ ( ) ( ) L Q L Q ˆ ˆ ˆ = + − ( ( ) || ( )) ( ) ln ( 1 ( )) ln kl L Q L Q L Q L Q ( ) ( ) L Q L Q • A looser, but simpler form of the bound:     ∑ ∑ ˆ  +   +  2 ( ) ( ; ) 2 ( ; ) L Q X I X C K X I X C K i i i i i i     ≤ ˆ + + i i ( ) ( ) L Q L Q N N

  13. { } ∑ Q = = ( | , ), ( | ), ( | ) ( | , ) ( | , ) ( | ) ( | ) q Y C C q C X q C X q Y X X q Y C C q C X q C X 1 2 1 2 1 1 2 2 1 2 1 1 2 2 C 1 , C 2 Generalization Bound • With probability ≥ 1- δ : ≤ ∑ + ( ; ) X I X C K i i i ˆ i ( ( ) || ( )) kl L Q L Q N   ∑ ∏ = + + − δ   ln ln ln( 4 ) / 2 ln K C X C Y N   i i i i i Logarithmic Number of PAC-Bayesian in | X i | partition cells bound part

  14. { } ∑ Q = = ( | , ), ( | ), ( | ) ( | , ) ( | , ) ( | ) ( | ) q Y C C q C X q C X q Y X X q Y C C q C X q C X 1 2 1 2 1 1 2 2 1 2 1 1 2 2 C 1 , C 2 Generalization Bound • With probability ≥ 1- δ : ≤ ∑ + ( ; ) X I X C K i i i ˆ i ( ( ) || ( )) kl L Q L Q N High Complexity Low Complexity I ( X i ; C i ) = ln| X i | I ( X i ; C i ) = 0

  15. { } ∑ Q = = ( | , ), ( | ), ( | ) ( | , ) ( | , ) ( | ) ( | ) q Y C C q C X q C X q Y X X q Y C C q C X q C X 1 2 1 2 1 1 2 2 1 2 1 1 2 2 C 1 , C 2 Generalization Bound • With probability ≥ 1- δ : ≤ ∑ + ( ; ) X I X C K i i i ˆ i ( ( ) || ( )) kl L Q L Q N Optimization tradeoff: Empirical loss vs. “Effective” partition Lower Higher complexity Complexity Complexity

  16. Practice • With probability ≥ 1- δ : ≤ ∑ + ( ; ) X I X C K i i i ˆ i ( ( ) || ( )) kl L Q L Q N • Replace with a trade-off: ∑ = β ˆ + ( ) ( ) ( ; ) F Q N L Q X I X C i i i i

  17. Application • MovieLens dataset – 100,000 ratings on 5-star scale – 80,000 train ratings, 20,000 test ratings – 943 viewers x 1682 movies – 943 viewers x 1682 movies – State-of-the-art Mean Absolute Error (0.72) – The optimal performance is achieved even with 300x300 cluster space

  18. 13x6 Clusters ∑ ˆ = β + ( ) ( ) ( ; ) F Q N L Q X I X C i i i i Bound Test MAE ute Error Test MAE Mean Absolute β β

  19. 50x50 Clusters ∑ ˆ = β + ( ) ( ) ( ; ) F Q N L Q X I X C i i i i Bound Test MAE ute Error Test MAE Mean Absolute β β

  20. 283x283 Clusters ∑ ˆ = β + ( ) ( ) ( ; ) F Q N L Q X I X C i i i i Test MAE ute Error Bound Bound Mean Absolute Test MAE β β

  21. Weighted Graph Clustering • The weigts of the edges w ij are generated by unknown distribution p ( w ij | x i , x j ) • Given a sample of size N of edge weights • Given a sample of size N of edge weights • Build a model q ( w | x 1 , x 2 ) such that E p ( x1 , x2 , w ) E q ( w ’| x 1, x 2) l ( w , w ’) is minimized

  22. Other problems • Pairwise clustering = clustering of a weighted graph – Edge weights = pairwise relations • Clustering of unweigted graph – Present edges = weight 1 – Absent edges = weight 0

  23. Weighted Graph Clustering • The weights of the links are generated according to: q ( w ij | X i , X j ) = Σ C a , C b q ( w ij | C a , C b ) q ( C a | X i ) q ( C b | X j ) • This is co-clustering with shared q ( C | X ) – Same bounds and (almost same) algorithms apply

  24. Application • Optimize the trade-off = β + ˆ ( ) ( ) ( ; ) F Q N L Q X I X C • Kings dataset – Edge weights = exponentiated negative – Edge weights = exponentiated negative distance between DNS servers – |X| = 1740 – Number of edges = 1,512,930

  25. Graph Clustering Application 0.05 • 2 ^ (Q) L I(X;C) Bound Inform 0.04 1.5 rmation (nats) ss Loss 0.03 1 0.02 0.5 0.01 • 0 0 5 10 15 |C|

  26. Relation with Matrix Factorization • Co-clustering: – g ( X 1 , X 2 ) = Σ C 1, C 2 q ( C 1 | X 1 ) g ( C 1 , C 2 ) q ( C 2 | X 2 ) – M ≈ Q 1 T GQ 2 • Graph clustering: – g ( X 1 , X 2 ) = Σ C 1, C 2 q ( C 1 | X 1 ) g ( C 1 , C 2 ) q ( C 2 | X 2 ) – M ≈ Q T GQ

  27. Summary of main contributions • Formulation of co-clustering and graph clustering (unsupervised learning) as prediction problems • PAC-Bayesian analysis of co-clustering and graph clustering – Regularization terms • Encouraging empirical results

  28. Future Directions • Practice: – More applications • Theory: – Continuous domains – Continuous domains – Multidimensional matrices References Co-clustering: Seldin & Tishby JMLR 2010 submitted, avail.online Graph clustering: Seldin Social Analytics 2010

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend