Convex Biclustering Eric Chi Rice University joint work with - - PowerPoint PPT Presentation

convex biclustering
SMART_READER_LITE
LIVE PREVIEW

Convex Biclustering Eric Chi Rice University joint work with - - PowerPoint PPT Presentation

Convex Biclustering Eric Chi Rice University joint work with Genevera Allen and Rich Baraniuk E. Chi Convex Biclustering 1 The Biclustering Problem Task Given a data matrix X R p n , find subgroups of rows & columns that go


slide-1
SLIDE 1

Convex Biclustering

Eric Chi

Rice University joint work with Genevera Allen and Rich Baraniuk
  • E. Chi
Convex Biclustering 1
slide-2
SLIDE 2

The Biclustering Problem

Task Given a data matrix X ∈ Rp×n, find subgroups of rows & columns that go together. Text mining: similar documents share a small set of highly correlated words. Collaborative filtering: likeminded customers share similar preferences for a subset of products Cancer genomics: subtypes of cancerous tumors share similar molecular profiles over a subset of genes

  • E. Chi
Convex Biclustering 2
slide-3
SLIDE 3

Cancer Genomics

“lung cancer” is heterogenous at the molecular level Which genes are driving “lung cancer?” These genes are potential drug targets Collect expression data

  • E. Chi
Convex Biclustering 3

Genes Tissue Sample

mal mal mal mal Colon Carcinoid Carcinoid Colon mal SmallCell Carcinoid Carcinoid Colon Carcinoid mal Carcinoid Carcinoid mal mal mal SmallCell Carcinoid mal SmallCell Colon Carcinoid Colon mal mal Carcinoid Colon mal Carcinoid Colon SmallCell Carcinoid Carcinoid Carcinoid mal Carcinoid Colon Colon Colon Carcinoid Carcinoid SmallCell SmallCell Carcinoid Colon mal Carcinoid Carcinoid mal Colon Colon mal
slide-4
SLIDE 4

Simple Solution: Cluster Dendrogram

  • E. Chi
Convex Biclustering 4

Genes

Hierarchical Clustering

Tissue Sample

slide-5
SLIDE 5

Hierarchical Clustering

  • E. Chi
Convex Biclustering 5 0.0 0.5 1.0 1.5 2.0 2.5 a e b c d
  • −1
−0.50 −0.25 0.00 0.25 0.50 0.75 x y

A E B C D A E C D B

slide-6
SLIDE 6

Hierarchical Clustering

  • E. Chi
Convex Biclustering 5 0.0 0.5 1.0 1.5 2.0 2.5 a e b c d
  • −1
−0.50 −0.25 0.00 0.25 0.50 0.75 x y

B D D B A E A E C C

slide-7
SLIDE 7

Hierarchical Clustering

  • E. Chi
Convex Biclustering 5 0.0 0.5 1.0 1.5 2.0 2.5 a e b c d
  • −1
−0.50 −0.25 0.00 0.25 0.50 0.75 x y

B D D B A E A E C C D D C F F

slide-8
SLIDE 8

Hierarchical Clustering

  • E. Chi
Convex Biclustering 5 0.0 0.5 1.0 1.5 2.0 2.5 a e b c d
  • −1
−0.50 −0.25 0.00 0.25 0.50 0.75 x y

B D D B A E C C D D C F F G G A E A E

slide-9
SLIDE 9

Hierarchical Clustering

  • E. Chi
Convex Biclustering 5 0.0 0.5 1.0 1.5 2.0 2.5 a e b c d
  • −1
−0.50 −0.25 0.00 0.25 0.50 0.75 x y

B D D A E C C D F G G A E A E B C D F H B D C F H

slide-10
SLIDE 10

Hierarchical Clustering

  • E. Chi
Convex Biclustering 5 0.0 0.5 1.0 1.5 2.0 2.5 a e b c d
  • −1
−0.50 −0.25 0.00 0.25 0.50 0.75 x y

B D D A E C C D F G G A E A E B C D F H B D C F H G A E B C D F H I I A E B D C

slide-11
SLIDE 11

Simple Solution: Cluster Dendrogram

The Good Easy to interpret Fast computation - greedy algorithm The Bad: Non-convex optimization problem Local Minimizers Instability (initialization, tuning parameters, or data) The Ugly: How to choose number of biclusters?

  • E. Chi
Convex Biclustering 6
slide-12
SLIDE 12

More Sophisticated Approaches

SVD-like methods

Plaid - Lazzeroni & Owen (2000) Iterative signature algorithm - Bergmann et al. (2003 ) sparse SVD - Lee et al. 2010

Graph Cut

Dhillon (2001), Kluger (2003)

LAS - Shabalin et al. (2009) Sparse transposable biclustering - Tan & Witten (2013) Harmonic Analysis of Digital Databases - Coifman & Gavish (2010) Goal: Simple and interpretable like clustered dendrogram Good algorithmic behavior

Global minimizer Stability with respect to data and other inputs

  • E. Chi
Convex Biclustering 7
slide-13
SLIDE 13

Solution: Convex Relaxation

Solve combinatorially hard problem with a convex surrogate. All local minima are global minima Algorithms converge to global minimizer regardless of initialization Solve a convex optimization problem to go from A to B

  • E. Chi
Convex Biclustering 8 mal mal mal mal Colon Carcinoid Carcinoid Colon mal SmallCell Carcinoid Carcinoid Colon Carcinoid mal Carcinoid Carcinoid mal mal mal SmallCell Carcinoid mal SmallCell Colon Carcinoid Colon mal mal Carcinoid Colon mal Carcinoid Colon SmallCell Carcinoid Carcinoid Carcinoid mal Carcinoid Colon Colon Colon Carcinoid Carcinoid SmallCell SmallCell Carcinoid Colon mal Carcinoid Carcinoid mal Colon Colon mal
  • m A

to B

slide-14
SLIDE 14

Convex Biclustering

Contributions: Characterization of the solution to the convex program

Stability of solution in tuning parameters and data

Simple intuitive meta-algorithm to get unique global minimizer

alternate convex clustering of rows and columns

Essentially one tuning parameter controls number of biclusters Data-adaptive way for selecting number of biclusters

  • E. Chi
Convex Biclustering 9
slide-15
SLIDE 15

Convex Clustering

Not much existing work, most is recent Pelckmans et al. 2005, Lindsten et al. 2011, Hocking et al. 2011, Chi & Lange 2013 minimize

u

1 2

n

X

i=1

kxi uik2

2 + γ

X

i<j

wijkui ujk2 Assign a centroid ui to each data point xi. Convex Fusion Penalty

shrinks cluster centroids together sparsity in pairwise differences of centroids ui uj = 0 ( ) xi and xj belong to the same cluster

γ : tunes overall amount of regularization wij : fine tunes pairwise shrinkage Generalizes fused lasso / edge lasso (Sharpnack et. al. 2012)

  • E. Chi
Convex Biclustering 10
slide-16
SLIDE 16

Convex Clustering

Not much existing work, most is recent Pelckmans et al. 2005, Lindsten et al. 2011, Hocking et al. 2011, Chi & Lange 2013 minimize

u

1 2

n

X

i=1

kxi uik2

2 + γ

X

i<j

wijkui ujk2 Assign a centroid ui to each data point xi. Convex Fusion Penalty

shrinks cluster centroids together sparsity in pairwise differences of centroids ui uj = 0 ( ) xi and xj belong to the same cluster

γ : tunes overall amount of regularization wij : fine tunes pairwise shrinkage Generalizes fused lasso / edge lasso (Sharpnack et. al. 2012)

  • E. Chi
Convex Biclustering 10

Too many degrees of freedom!

slide-17
SLIDE 17

Convex Clustering

Not much existing work, most is recent Pelckmans et al. 2005, Lindsten et al. 2011, Hocking et al. 2011, Chi & Lange 2013 minimize

u

1 2

n

X

i=1

kxi uik2

2 + γ

X

i<j

wijkui ujk2 Assign a centroid ui to each data point xi. Convex Fusion Penalty

shrinks cluster centroids together sparsity in pairwise differences of centroids ui uj = 0 ( ) xi and xj belong to the same cluster

γ : tunes overall amount of regularization wij : fine tunes pairwise shrinkage Generalizes fused lasso / edge lasso (Sharpnack et. al. 2012)

  • E. Chi
Convex Biclustering 10
slide-18
SLIDE 18

Convex Clustering

Not much existing work, most is recent Pelckmans et al. 2005, Lindsten et al. 2011, Hocking et al. 2011, Chi & Lange 2013 minimize

u

1 2

n

X

i=1

kxi uik2

2 + γ

X

i<j

wijkui ujk2 Assign a centroid ui to each data point xi. Convex Fusion Penalty

shrinks cluster centroids together sparsity in pairwise differences of centroids ui uj = 0 ( ) xi and xj belong to the same cluster

γ : tunes overall amount of regularization wij : fine tunes pairwise shrinkage Generalizes fused lasso / edge lasso (Sharpnack et. al. 2012)

  • E. Chi
Convex Biclustering 10

p ≥ 1 okay

slide-19
SLIDE 19

Choosing weights

Rules of thumb: wij / kxi xjk−1 Most wij = 0 Why? Encourage similar points to fuse early ! better clusterings Computation and storage scale with number of non-zero wij Fiddle free; set and forget

  • E. Chi
Convex Biclustering 11
slide-20
SLIDE 20

The Solution Path

minimize 1 2

n

X

i=1

kxi uik2

2 + γ

X

i<j

wijkui ujk2

  • E. Chi
Convex Biclustering 12

+ γ

  • 0.00
0.25 0.50 0.75 1.00 0.0 0.3 0.6 0.9 x y
slide-21
SLIDE 21

The Solution Path

minimize 1 2

n

X

i=1

kxi uik2

2 + γ

X

i<j

wijkui ujk2

  • E. Chi
Convex Biclustering 12
  • 0.00
0.25 0.50 0.75 1.00 0.0 0.3 0.6 0.9 x y

+ γ

slide-22
SLIDE 22

The Solution Path

minimize 1 2

n

X

i=1

kxi uik2

2 + γ

X

i<j

wijkui ujk2

  • E. Chi
Convex Biclustering 12
  • 0.00
0.25 0.50 0.75 1.00 0.0 0.3 0.6 0.9 x y

+ γ

slide-23
SLIDE 23

The Solution Path

minimize 1 2

n

X

i=1

kxi uik2

2 + γ

X

i<j

wijkui ujk2

  • E. Chi
Convex Biclustering 12
  • 0.00
0.25 0.50 0.75 1.00 0.0 0.3 0.6 0.9 x y

+ γ

slide-24
SLIDE 24

The Solution Path

minimize 1 2

n

X

i=1

kxi uik2

2 + γ

X

i<j

wijkui ujk2

  • E. Chi
Convex Biclustering 12
  • 0.00
0.25 0.50 0.75 1.00 0.0 0.3 0.6 0.9 x y

+ γ

slide-25
SLIDE 25

The Solution Path

minimize 1 2

n

X

i=1

kxi uik2

2 + γ

X

i<j

wijkui ujk2

  • E. Chi
Convex Biclustering 12
  • 0.00
0.25 0.50 0.75 1.00 0.0 0.3 0.6 0.9 x y

+ γ

slide-26
SLIDE 26

The Solution Path

minimize 1 2

n

X

i=1

kxi uik2

2 + γ

X

i<j

wijkui ujk2

  • E. Chi
Convex Biclustering 12
  • 0.00
0.25 0.50 0.75 1.00 0.0 0.3 0.6 0.9 x y

+ γ

slide-27
SLIDE 27

The Solution Path

minimize 1 2

n

X

i=1

kxi uik2

2 + γ

X

i<j

wijkui ujk2

  • E. Chi
Convex Biclustering 12
  • 0.00
0.25 0.50 0.75 1.00 0.0 0.3 0.6 0.9 x y

+ γ

slide-28
SLIDE 28

The Solution Path

minimize 1 2

n

X

i=1

kxi uik2

2 + γ

X

i<j

wijkui ujk2

  • E. Chi
Convex Biclustering 12
  • 0.00
0.25 0.50 0.75 1.00 0.0 0.3 0.6 0.9 x y

+ γ

slide-29
SLIDE 29

The Solution Path

minimize 1 2

n

X

i=1

kxi uik2

2 + γ

X

i<j

wijkui ujk2

  • E. Chi
Convex Biclustering 12
  • 0.00
0.25 0.50 0.75 1.00 0.0 0.3 0.6 0.9 x y

+ γ

tune # of clusters

slide-30
SLIDE 30

Convex Biclustering

Apply fusion penalties to columns and rows minimize

U

Fγ(U) = 1 2kX Uk2

F + γJ(U)

J(U) = X

i<j

wijkU·i U·jk2 + X

k<l

˜ wkl kUk· Ul·k2 Leads to “checkerboard” biclustering - every (i,j)th entry is assigned to one bicluster ith and jth columns belong to same column cluster iff U·i U·j = 0 kth and lth rows belong to same row cluster iff Uk· Ul· = 0

  • E. Chi
Convex Biclustering 13
slide-31
SLIDE 31

Biclustering path

minimize 1 2kX Uk2

F + γ

2 4X

i<j

wijkU·i U·jk2 + X

k<l

˜ wkl kUk· Ul·k2 3 5

  • E. Chi
Convex Biclustering 14

+ γ

slide-32
SLIDE 32

Biclustering path

minimize 1 2kX Uk2

F + γ

2 4X

i<j

wijkU·i U·jk2 + X

k<l

˜ wkl kUk· Ul·k2 3 5

  • E. Chi
Convex Biclustering 14

+ γ

slide-33
SLIDE 33

Biclustering path

minimize 1 2kX Uk2

F + γ

2 4X

i<j

wijkU·i U·jk2 + X

k<l

˜ wkl kUk· Ul·k2 3 5

  • E. Chi
Convex Biclustering 14

+ γ

slide-34
SLIDE 34

Biclustering path

minimize 1 2kX Uk2

F + γ

2 4X

i<j

wijkU·i U·jk2 + X

k<l

˜ wkl kUk· Ul·k2 3 5

  • E. Chi
Convex Biclustering 14

+ γ

slide-35
SLIDE 35

Biclustering path

minimize 1 2kX Uk2

F + γ

2 4X

i<j

wijkU·i U·jk2 + X

k<l

˜ wkl kUk· Ul·k2 3 5

  • E. Chi
Convex Biclustering 14

+ γ

slide-36
SLIDE 36

Biclustering path

minimize 1 2kX Uk2

F + γ

2 4X

i<j

wijkU·i U·jk2 + X

k<l

˜ wkl kUk· Ul·k2 3 5

  • E. Chi
Convex Biclustering 14

+ γ

slide-37
SLIDE 37

Biclustering path

minimize 1 2kX Uk2

F + γ

2 4X

i<j

wijkU·i U·jk2 + X

k<l

˜ wkl kUk· Ul·k2 3 5

  • E. Chi
Convex Biclustering 14

+ γ

slide-38
SLIDE 38

Biclustering path

minimize 1 2kX Uk2

F + γ

2 4X

i<j

wijkU·i U·jk2 + X

k<l

˜ wkl kUk· Ul·k2 3 5

  • E. Chi
Convex Biclustering 14

+ γ

slide-39
SLIDE 39

Biclustering path

minimize 1 2kX Uk2

F + γ

2 4X

i<j

wijkU·i U·jk2 + X

k<l

˜ wkl kUk· Ul·k2 3 5

  • E. Chi
Convex Biclustering 14

+ γ

slide-40
SLIDE 40

Biclustering path

minimize 1 2kX Uk2

F + γ

2 4X

i<j

wijkU·i U·jk2 + X

k<l

˜ wkl kUk· Ul·k2 3 5

  • E. Chi
Convex Biclustering 14

+ γ

slide-41
SLIDE 41

Key Property of the Solution: Stability

U? = arg min

U

1 2kX Uk2

F + γJ(U)

J(U) = X

i<j

wijkU·i U·jk2 + X

k<l

˜ wkl kUk· Ul·k2 Proposition U?exists, is unique, and depends continuously on (γ, W, ˜ W, X). Implication: Stability to perturbations in data ith and jth columns belong to same column cluster iff U·i U·j = 0 U is continuous in X then U·i U·j is continuous in X

  • E. Chi
Convex Biclustering 15
slide-42
SLIDE 42

The Meta-Algorithm: COBRA

COnvex BiclusteRing Algorithm

Instance of Dykstra-Like Proximal Algorithm (Bauschke & Combettes, 2008)

Algorithm 1 COBRA Set U0 = X, P0 = 0, Q0 = 0 for m = 0, 1, . . .

1: repeat 2:

YT

m = proxγΩ ˜ W(UT m + PT m) 3:

Pm+1 = Um + Pm − Ym

4:

Um+1 = proxγΩW(Ym + Qm)

5:

Qm+1 = Ym + Qm − Um+1

6: until convergence

COBRA converges to global minimizer Convex clustering via fast AMA (Chi & Lange, 2013)

  • E. Chi
Convex Biclustering 16 http://en.wikipedia.org/wiki/Cobra

proxγΩW(Z) = arg min

U

1 2kZ Uk2

2 + γ

X

i<j

wijkU·i U·jk2

slide-43
SLIDE 43

The Meta-Algorithm: COBRA

COnvex BiclusteRing Algorithm

Instance of Dykstra-Like Proximal Algorithm (Bauschke & Combettes, 2008)

Algorithm 1 COBRA Set U0 = X, P0 = 0, Q0 = 0 for m = 0, 1, . . .

1: repeat 2:

YT

m = proxγΩ ˜ W(UT m + PT m) 3:

Pm+1 = Um + Pm − Ym

4:

Um+1 = proxγΩW(Ym + Qm)

5:

Qm+1 = Ym + Qm − Um+1

6: until convergence

COBRA converges to global minimizer Convex clustering via fast AMA (Chi & Lange, 2013)

  • E. Chi
Convex Biclustering 16 http://en.wikipedia.org/wiki/Cobra

Convex Cluster Rows

proxγΩW(Z) = arg min

U

1 2kZ Uk2

2 + γ

X

i<j

wijkU·i U·jk2

slide-44
SLIDE 44

The Meta-Algorithm: COBRA

COnvex BiclusteRing Algorithm

Instance of Dykstra-Like Proximal Algorithm (Bauschke & Combettes, 2008)

Algorithm 1 COBRA Set U0 = X, P0 = 0, Q0 = 0 for m = 0, 1, . . .

1: repeat 2:

YT

m = proxγΩ ˜ W(UT m + PT m) 3:

Pm+1 = Um + Pm − Ym

4:

Um+1 = proxγΩW(Ym + Qm)

5:

Qm+1 = Ym + Qm − Um+1

6: until convergence

COBRA converges to global minimizer Convex clustering via fast AMA (Chi & Lange, 2013)

  • E. Chi
Convex Biclustering 16 http://en.wikipedia.org/wiki/Cobra

Update Correction

proxγΩW(Z) = arg min

U

1 2kZ Uk2

2 + γ

X

i<j

wijkU·i U·jk2

slide-45
SLIDE 45

The Meta-Algorithm: COBRA

COnvex BiclusteRing Algorithm

Instance of Dykstra-Like Proximal Algorithm (Bauschke & Combettes, 2008)

Algorithm 1 COBRA Set U0 = X, P0 = 0, Q0 = 0 for m = 0, 1, . . .

1: repeat 2:

YT

m = proxγΩ ˜ W(UT m + PT m) 3:

Pm+1 = Um + Pm − Ym

4:

Um+1 = proxγΩW(Ym + Qm)

5:

Qm+1 = Ym + Qm − Um+1

6: until convergence

COBRA converges to global minimizer Convex clustering via fast AMA (Chi & Lange, 2013)

  • E. Chi
Convex Biclustering 16 http://en.wikipedia.org/wiki/Cobra

Convex Cluster Cols

proxγΩW(Z) = arg min

U

1 2kZ Uk2

2 + γ

X

i<j

wijkU·i U·jk2

slide-46
SLIDE 46

The Meta-Algorithm: COBRA

COnvex BiclusteRing Algorithm

Instance of Dykstra-Like Proximal Algorithm (Bauschke & Combettes, 2008)

Algorithm 1 COBRA Set U0 = X, P0 = 0, Q0 = 0 for m = 0, 1, . . .

1: repeat 2:

YT

m = proxγΩ ˜ W(UT m + PT m) 3:

Pm+1 = Um + Pm − Ym

4:

Um+1 = proxγΩW(Ym + Qm)

5:

Qm+1 = Ym + Qm − Um+1

6: until convergence

COBRA converges to global minimizer Convex clustering via fast AMA (Chi & Lange, 2013)

  • E. Chi
Convex Biclustering 16 http://en.wikipedia.org/wiki/Cobra

Update Correction

proxγΩW(Z) = arg min

U

1 2kZ Uk2

2 + γ

X

i<j

wijkU·i U·jk2

slide-47
SLIDE 47

The Meta-Algorithm: COBRA

COnvex BiclusteRing Algorithm

Instance of Dykstra-Like Proximal Algorithm (Bauschke & Combettes, 2008)

Algorithm 1 COBRA Set U0 = X, P0 = 0, Q0 = 0 for m = 0, 1, . . .

1: repeat 2:

YT

m = proxγΩ ˜ W(UT m + PT m) 3:

Pm+1 = Um + Pm − Ym

4:

Um+1 = proxγΩW(Ym + Qm)

5:

Qm+1 = Ym + Qm − Um+1

6: until convergence

COBRA converges to global minimizer Convex clustering via fast AMA (Chi & Lange, 2013)

  • E. Chi
Convex Biclustering 16 http://en.wikipedia.org/wiki/Cobra
slide-48
SLIDE 48

Model Selection

Validation Scheme (Wold, 1978) Randomly select a hold out set (∼ 10% of all entries) Θ ⊂ {1, . . . , p} × {1, . . . , n}

  • E. Chi
Convex Biclustering 17

= +

) PΘc(X)

= 0

X PΘ(X) X

slide-49
SLIDE 49

Model Selection

Solve the matrix completion problem for a sequence of candidate γm U?

m = arg min U

1 2kPΘc(X) PΘc(U)k2

F + γmJ(U)

Pick γm that minimizes prediction error on Θ

  • E. Chi
Convex Biclustering 18
  • 50
60 70 2 4 6 log ( ) Validation Error

Error − log(γ) Error

slide-50
SLIDE 50

Matrix Completion with COBRA-POD

COBRA with Partially Observed Data Fill in missing entries using last iterate Use COBRA to get next iterate Repeat Algorithm 2 COBRA-POD

1: Initialize U(0). 2: repeat 3:

M ← PΘc(X) + PΘ(U(k))

4:

U(k+1) ← COBRA(M)

5: until convergence

This is a majorization-minimization (MM) algorithm Very similar majorization used in Mazumder et al. (2010) Converges to solution of matrix completion problem Biclustering when data is missing

  • E. Chi
Convex Biclustering 19 http://pitchfork.com/news/56629-apple-discontinues-ipod-classic
slide-51
SLIDE 51

Comparison: Checkerboard + Gaussian

X = “Checkerboard” + i.i.d. N(0, σ2)

  • E. Chi
Convex Biclustering 20

Good Bad

COBRA spBC 10 20 30 40 50 10 20 30 40 50 Low Variance Noise High Variance Noise 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Rand Index count
slide-52
SLIDE 52

Comparison: Stability

Lung cancer: baseline clustering vs clustering on perturbation

  • E. Chi
Convex Biclustering 21 spSVD ISA Small Pert. Medium Pert. Large Pert. 1.0 0.6 0.8 1.0 0.6 0.8 1.0 COBRA spBC 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 0.6 0.8 1.0 0.6 0.8 1.0 count 0.6 0.8 1.0 Rand Index
slide-53
SLIDE 53

Summary

Convex Relaxation of Biclustering

Simple & Interpretable Solutions like Clustered Dendrogram Algorithmically well behaved

Unique, global minimizer Stability wrt initialization, parameters, and data

Meta-algorithm uses convex clustering primitive Essentially one tuning parameter that controls number of biclusters Data dependent way of selecting it

Future work

Inexact solutions for really big problems Connections to computational harmonic analysis

  • E. Chi
Convex Biclustering 22
slide-54
SLIDE 54

Thank you!

  • E. Chi
Convex Biclustering 23 Code: Coming soon Contact: Eric Chi, echi@rice.edu Eric C. Chi, Genevera I. Allen, and Richard G. Baraniuk Convex Biclustering arXiv:1408.0856 http://arxiv.org/abs/1408.0856