Convex Biclustering
Eric Chi
Rice University joint work with Genevera Allen and Rich Baraniuk- E. Chi
Convex Biclustering Eric Chi Rice University joint work with - - PowerPoint PPT Presentation
Convex Biclustering Eric Chi Rice University joint work with Genevera Allen and Rich Baraniuk E. Chi Convex Biclustering 1 The Biclustering Problem Task Given a data matrix X R p n , find subgroups of rows & columns that go
Convex Biclustering
Eric Chi
Rice University joint work with Genevera Allen and Rich BaraniukThe Biclustering Problem
Task Given a data matrix X ∈ Rp×n, find subgroups of rows & columns that go together. Text mining: similar documents share a small set of highly correlated words. Collaborative filtering: likeminded customers share similar preferences for a subset of products Cancer genomics: subtypes of cancerous tumors share similar molecular profiles over a subset of genes
Cancer Genomics
“lung cancer” is heterogenous at the molecular level Which genes are driving “lung cancer?” These genes are potential drug targets Collect expression data
Genes Tissue Sample
mal mal mal mal Colon Carcinoid Carcinoid Colon mal SmallCell Carcinoid Carcinoid Colon Carcinoid mal Carcinoid Carcinoid mal mal mal SmallCell Carcinoid mal SmallCell Colon Carcinoid Colon mal mal Carcinoid Colon mal Carcinoid Colon SmallCell Carcinoid Carcinoid Carcinoid mal Carcinoid Colon Colon Colon Carcinoid Carcinoid SmallCell SmallCell Carcinoid Colon mal Carcinoid Carcinoid mal Colon Colon malSimple Solution: Cluster Dendrogram
Genes
Hierarchical Clustering
Tissue Sample
Hierarchical Clustering
A E B C D A E C D B
Hierarchical Clustering
B D D B A E A E C C
Hierarchical Clustering
B D D B A E A E C C D D C F F
Hierarchical Clustering
B D D B A E C C D D C F F G G A E A E
Hierarchical Clustering
B D D A E C C D F G G A E A E B C D F H B D C F H
Hierarchical Clustering
B D D A E C C D F G G A E A E B C D F H B D C F H G A E B C D F H I I A E B D C
Simple Solution: Cluster Dendrogram
The Good Easy to interpret Fast computation - greedy algorithm The Bad: Non-convex optimization problem Local Minimizers Instability (initialization, tuning parameters, or data) The Ugly: How to choose number of biclusters?
More Sophisticated Approaches
SVD-like methods
Plaid - Lazzeroni & Owen (2000) Iterative signature algorithm - Bergmann et al. (2003 ) sparse SVD - Lee et al. 2010
Graph Cut
Dhillon (2001), Kluger (2003)
LAS - Shabalin et al. (2009) Sparse transposable biclustering - Tan & Witten (2013) Harmonic Analysis of Digital Databases - Coifman & Gavish (2010) Goal: Simple and interpretable like clustered dendrogram Good algorithmic behavior
Global minimizer Stability with respect to data and other inputs
Solution: Convex Relaxation
Solve combinatorially hard problem with a convex surrogate. All local minima are global minima Algorithms converge to global minimizer regardless of initialization Solve a convex optimization problem to go from A to B
to B
Convex Biclustering
Contributions: Characterization of the solution to the convex program
Stability of solution in tuning parameters and data
Simple intuitive meta-algorithm to get unique global minimizer
alternate convex clustering of rows and columns
Essentially one tuning parameter controls number of biclusters Data-adaptive way for selecting number of biclusters
Convex Clustering
Not much existing work, most is recent Pelckmans et al. 2005, Lindsten et al. 2011, Hocking et al. 2011, Chi & Lange 2013 minimize
u1 2
nX
i=1kxi uik2
2 + γX
i<jwijkui ujk2 Assign a centroid ui to each data point xi. Convex Fusion Penalty
shrinks cluster centroids together sparsity in pairwise differences of centroids ui uj = 0 ( ) xi and xj belong to the same cluster
γ : tunes overall amount of regularization wij : fine tunes pairwise shrinkage Generalizes fused lasso / edge lasso (Sharpnack et. al. 2012)
Convex Clustering
Not much existing work, most is recent Pelckmans et al. 2005, Lindsten et al. 2011, Hocking et al. 2011, Chi & Lange 2013 minimize
u1 2
nX
i=1kxi uik2
2 + γX
i<jwijkui ujk2 Assign a centroid ui to each data point xi. Convex Fusion Penalty
shrinks cluster centroids together sparsity in pairwise differences of centroids ui uj = 0 ( ) xi and xj belong to the same cluster
γ : tunes overall amount of regularization wij : fine tunes pairwise shrinkage Generalizes fused lasso / edge lasso (Sharpnack et. al. 2012)
Too many degrees of freedom!
Convex Clustering
Not much existing work, most is recent Pelckmans et al. 2005, Lindsten et al. 2011, Hocking et al. 2011, Chi & Lange 2013 minimize
u1 2
nX
i=1kxi uik2
2 + γX
i<jwijkui ujk2 Assign a centroid ui to each data point xi. Convex Fusion Penalty
shrinks cluster centroids together sparsity in pairwise differences of centroids ui uj = 0 ( ) xi and xj belong to the same cluster
γ : tunes overall amount of regularization wij : fine tunes pairwise shrinkage Generalizes fused lasso / edge lasso (Sharpnack et. al. 2012)
Convex Clustering
Not much existing work, most is recent Pelckmans et al. 2005, Lindsten et al. 2011, Hocking et al. 2011, Chi & Lange 2013 minimize
u1 2
nX
i=1kxi uik2
2 + γX
i<jwijkui ujk2 Assign a centroid ui to each data point xi. Convex Fusion Penalty
shrinks cluster centroids together sparsity in pairwise differences of centroids ui uj = 0 ( ) xi and xj belong to the same cluster
γ : tunes overall amount of regularization wij : fine tunes pairwise shrinkage Generalizes fused lasso / edge lasso (Sharpnack et. al. 2012)
p ≥ 1 okay
Choosing weights
Rules of thumb: wij / kxi xjk−1 Most wij = 0 Why? Encourage similar points to fuse early ! better clusterings Computation and storage scale with number of non-zero wij Fiddle free; set and forget
The Solution Path
minimize 1 2
nX
i=1kxi uik2
2 + γX
i<jwijkui ujk2
+ γ
The Solution Path
minimize 1 2
nX
i=1kxi uik2
2 + γX
i<jwijkui ujk2
+ γ
The Solution Path
minimize 1 2
nX
i=1kxi uik2
2 + γX
i<jwijkui ujk2
+ γ
The Solution Path
minimize 1 2
nX
i=1kxi uik2
2 + γX
i<jwijkui ujk2
+ γ
The Solution Path
minimize 1 2
nX
i=1kxi uik2
2 + γX
i<jwijkui ujk2
+ γ
The Solution Path
minimize 1 2
nX
i=1kxi uik2
2 + γX
i<jwijkui ujk2
The Solution Path
minimize 1 2
nX
i=1kxi uik2
2 + γX
i<jwijkui ujk2
The Solution Path
minimize 1 2
nX
i=1kxi uik2
2 + γX
i<jwijkui ujk2
The Solution Path
minimize 1 2
nX
i=1kxi uik2
2 + γX
i<jwijkui ujk2
The Solution Path
minimize 1 2
nX
i=1kxi uik2
2 + γX
i<jwijkui ujk2
tune # of clusters
Convex Biclustering
Apply fusion penalties to columns and rows minimize
UFγ(U) = 1 2kX Uk2
F + γJ(U)J(U) = X
i<jwijkU·i U·jk2 + X
k<l˜ wkl kUk· Ul·k2 Leads to “checkerboard” biclustering - every (i,j)th entry is assigned to one bicluster ith and jth columns belong to same column cluster iff U·i U·j = 0 kth and lth rows belong to same row cluster iff Uk· Ul· = 0
Biclustering path
minimize 1 2kX Uk2
F + γ2 4X
i<jwijkU·i U·jk2 + X
k<l˜ wkl kUk· Ul·k2 3 5
+ γ
Biclustering path
minimize 1 2kX Uk2
F + γ2 4X
i<jwijkU·i U·jk2 + X
k<l˜ wkl kUk· Ul·k2 3 5
+ γ
Biclustering path
minimize 1 2kX Uk2
F + γ2 4X
i<jwijkU·i U·jk2 + X
k<l˜ wkl kUk· Ul·k2 3 5
+ γ
Biclustering path
minimize 1 2kX Uk2
F + γ2 4X
i<jwijkU·i U·jk2 + X
k<l˜ wkl kUk· Ul·k2 3 5
+ γ
Biclustering path
minimize 1 2kX Uk2
F + γ2 4X
i<jwijkU·i U·jk2 + X
k<l˜ wkl kUk· Ul·k2 3 5
+ γ
Biclustering path
minimize 1 2kX Uk2
F + γ2 4X
i<jwijkU·i U·jk2 + X
k<l˜ wkl kUk· Ul·k2 3 5
Biclustering path
minimize 1 2kX Uk2
F + γ2 4X
i<jwijkU·i U·jk2 + X
k<l˜ wkl kUk· Ul·k2 3 5
Biclustering path
minimize 1 2kX Uk2
F + γ2 4X
i<jwijkU·i U·jk2 + X
k<l˜ wkl kUk· Ul·k2 3 5
Biclustering path
minimize 1 2kX Uk2
F + γ2 4X
i<jwijkU·i U·jk2 + X
k<l˜ wkl kUk· Ul·k2 3 5
Biclustering path
minimize 1 2kX Uk2
F + γ2 4X
i<jwijkU·i U·jk2 + X
k<l˜ wkl kUk· Ul·k2 3 5
Key Property of the Solution: Stability
U? = arg min
U1 2kX Uk2
F + γJ(U)J(U) = X
i<jwijkU·i U·jk2 + X
k<l˜ wkl kUk· Ul·k2 Proposition U?exists, is unique, and depends continuously on (γ, W, ˜ W, X). Implication: Stability to perturbations in data ith and jth columns belong to same column cluster iff U·i U·j = 0 U is continuous in X then U·i U·j is continuous in X
The Meta-Algorithm: COBRA
COnvex BiclusteRing Algorithm
Instance of Dykstra-Like Proximal Algorithm (Bauschke & Combettes, 2008)
Algorithm 1 COBRA Set U0 = X, P0 = 0, Q0 = 0 for m = 0, 1, . . .
1: repeat 2:YT
m = proxγΩ ˜ W(UT m + PT m) 3:Pm+1 = Um + Pm − Ym
4:Um+1 = proxγΩW(Ym + Qm)
5:Qm+1 = Ym + Qm − Um+1
6: until convergenceCOBRA converges to global minimizer Convex clustering via fast AMA (Chi & Lange, 2013)
proxγΩW(Z) = arg min
U1 2kZ Uk2
2 + γX
i<jwijkU·i U·jk2
The Meta-Algorithm: COBRA
COnvex BiclusteRing Algorithm
Instance of Dykstra-Like Proximal Algorithm (Bauschke & Combettes, 2008)
Algorithm 1 COBRA Set U0 = X, P0 = 0, Q0 = 0 for m = 0, 1, . . .
1: repeat 2:YT
m = proxγΩ ˜ W(UT m + PT m) 3:Pm+1 = Um + Pm − Ym
4:Um+1 = proxγΩW(Ym + Qm)
5:Qm+1 = Ym + Qm − Um+1
6: until convergenceCOBRA converges to global minimizer Convex clustering via fast AMA (Chi & Lange, 2013)
Convex Cluster Rows
proxγΩW(Z) = arg min
U1 2kZ Uk2
2 + γX
i<jwijkU·i U·jk2
The Meta-Algorithm: COBRA
COnvex BiclusteRing Algorithm
Instance of Dykstra-Like Proximal Algorithm (Bauschke & Combettes, 2008)
Algorithm 1 COBRA Set U0 = X, P0 = 0, Q0 = 0 for m = 0, 1, . . .
1: repeat 2:YT
m = proxγΩ ˜ W(UT m + PT m) 3:Pm+1 = Um + Pm − Ym
4:Um+1 = proxγΩW(Ym + Qm)
5:Qm+1 = Ym + Qm − Um+1
6: until convergenceCOBRA converges to global minimizer Convex clustering via fast AMA (Chi & Lange, 2013)
Update Correction
proxγΩW(Z) = arg min
U1 2kZ Uk2
2 + γX
i<jwijkU·i U·jk2
The Meta-Algorithm: COBRA
COnvex BiclusteRing Algorithm
Instance of Dykstra-Like Proximal Algorithm (Bauschke & Combettes, 2008)
Algorithm 1 COBRA Set U0 = X, P0 = 0, Q0 = 0 for m = 0, 1, . . .
1: repeat 2:YT
m = proxγΩ ˜ W(UT m + PT m) 3:Pm+1 = Um + Pm − Ym
4:Um+1 = proxγΩW(Ym + Qm)
5:Qm+1 = Ym + Qm − Um+1
6: until convergenceCOBRA converges to global minimizer Convex clustering via fast AMA (Chi & Lange, 2013)
Convex Cluster Cols
proxγΩW(Z) = arg min
U1 2kZ Uk2
2 + γX
i<jwijkU·i U·jk2
The Meta-Algorithm: COBRA
COnvex BiclusteRing Algorithm
Instance of Dykstra-Like Proximal Algorithm (Bauschke & Combettes, 2008)
Algorithm 1 COBRA Set U0 = X, P0 = 0, Q0 = 0 for m = 0, 1, . . .
1: repeat 2:YT
m = proxγΩ ˜ W(UT m + PT m) 3:Pm+1 = Um + Pm − Ym
4:Um+1 = proxγΩW(Ym + Qm)
5:Qm+1 = Ym + Qm − Um+1
6: until convergenceCOBRA converges to global minimizer Convex clustering via fast AMA (Chi & Lange, 2013)
Update Correction
proxγΩW(Z) = arg min
U1 2kZ Uk2
2 + γX
i<jwijkU·i U·jk2
The Meta-Algorithm: COBRA
COnvex BiclusteRing Algorithm
Instance of Dykstra-Like Proximal Algorithm (Bauschke & Combettes, 2008)
Algorithm 1 COBRA Set U0 = X, P0 = 0, Q0 = 0 for m = 0, 1, . . .
1: repeat 2:YT
m = proxγΩ ˜ W(UT m + PT m) 3:Pm+1 = Um + Pm − Ym
4:Um+1 = proxγΩW(Ym + Qm)
5:Qm+1 = Ym + Qm − Um+1
6: until convergenceCOBRA converges to global minimizer Convex clustering via fast AMA (Chi & Lange, 2013)
Model Selection
Validation Scheme (Wold, 1978) Randomly select a hold out set (∼ 10% of all entries) Θ ⊂ {1, . . . , p} × {1, . . . , n}
= +
) PΘc(X)
= 0
X PΘ(X) X
Model Selection
Solve the matrix completion problem for a sequence of candidate γm U?
m = arg min U1 2kPΘc(X) PΘc(U)k2
F + γmJ(U)Pick γm that minimizes prediction error on Θ
Error − log(γ) Error
Matrix Completion with COBRA-POD
COBRA with Partially Observed Data Fill in missing entries using last iterate Use COBRA to get next iterate Repeat Algorithm 2 COBRA-POD
1: Initialize U(0). 2: repeat 3:M ← PΘc(X) + PΘ(U(k))
4:U(k+1) ← COBRA(M)
5: until convergenceThis is a majorization-minimization (MM) algorithm Very similar majorization used in Mazumder et al. (2010) Converges to solution of matrix completion problem Biclustering when data is missing
Comparison: Checkerboard + Gaussian
X = “Checkerboard” + i.i.d. N(0, σ2)
Good Bad
COBRA spBC 10 20 30 40 50 10 20 30 40 50 Low Variance Noise High Variance Noise 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Rand Index countComparison: Stability
Lung cancer: baseline clustering vs clustering on perturbation
Summary
Convex Relaxation of Biclustering
Simple & Interpretable Solutions like Clustered Dendrogram Algorithmically well behaved
Unique, global minimizer Stability wrt initialization, parameters, and dataMeta-algorithm uses convex clustering primitive Essentially one tuning parameter that controls number of biclusters Data dependent way of selecting it
Future work
Inexact solutions for really big problems Connections to computational harmonic analysis
Thank you!