[PPT] - 1 Analysis of high-dimensional data Theory Outline 2 Simultaneous PowerPoint Presentation

SLIDE 1

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

Simultaneous Subset Selection via Rate-Distortion Theory

with application to cluster and significance analysis
f gene expression data

Rebecka J¨

rnsten

Department of Statistics, Rutgers University

rebecka@stat.rutgers.edu, http://www.stat.rutgers.edu/∼rebecka

Biostatistics Day, April 25, 2008

SLIDE 2

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

1 Analysis of high-dimensional data 2 Simultaneous Selection via

Rate-distortion Theory

3 Cluster analysis 4 Significance analysis 5 Conclusion and Future work

SLIDE 3

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

Analysis of high-dimensional data

Clustering

Popular approach for dimension reduction Wide range of applications: engineering, geological data, social networks, high-throughput biology Assign gene function via ”guilt by association” Suggestive of biological pathways and networks

Multiple testing

Massive number of tests performed - how do we control the number (or proportion) of false rejections? Problem is encountered in e.g. clinical trials with multiple end-points, fMRI analysis, and proteomics and genomics. *Identify a set of genes whose expression levels differ between a set of experimental conditions

SLIDE 4

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

Thinking about the problems in terms

f Model Selection

Clustering

1 How many clusters? 2 Subset model selection: What is the most efficient

description of a cluster profile? Example: We want to objectively be able to state that a cluster corresponds to a particular pattern across experimental conditions (e.g. static).

Multiple testing

1 How many rejections? 2 Subset model selection: For each rejected

null-hypothesis, can we identify the alternative? Example: We want to identify the differentially expressed genes, and the discriminatory experimental conditions.

SLIDE 5

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

Thinking about the problems in terms

f Model Selection

Why are these model selection tasks so important?

1 Reduce the reliance on subjective interpretations of the

analysis outcome.

”This clusters seems to represent a static expression profile.” ”Selected genes appear to primarily represent differential expression between only one of the experimental factors.”

2 Waste not - want not!

Spend the parameter budget where it is needed. If we use efficient representations of simple data structures (e.g. static cluster profiles), we may detect more subtle structures.

SLIDE 6

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

Challenges

Clustering

Clustering and subset model selection are not separable. The search for the optimal cluster subset models is combinatorial in the number of clusters and experimental conditions.

Significance Analysis

Double multiplicity: multiple genes, and multiple model classes for each gene. Model space is HUGE!

SLIDE 7

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

Proposed strategy

Simultaneous subset selection via rate-distortion theory

Challenge: clustering and subset model selection are not separable tasks We appeal to results in rate-distortion theory to develop a selection method that is simultaneous across clusters Generalizes to multiple testing.

SLIDE 8

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

Bit-allocation and Rate-Distortion

We will turn the combinatorial model selection problems into a simultaneous search using results from optimal bit allocation in Rate-Distortion Theory.

Selected model

RATE DISTORTION Slope constraint

Here: What is ”Rate”? What is ”Distortion”?

SLIDE 9

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

Bit-allocation

Consider data blocks (block=single gene, or gene cluster).
For each data block k, model M results in a distortion

Dk(M), with rate Rk(M) (e.g. # parameters p(M)).

How do we allocate model complexity to each block fairly?

Selected model

RATE DISTORTION Slope constraint

RD theory: To minimize the overall distortion, (e.g. MSE), the optimal allocation is obtained at points of equal slope on the block-wise Rate-distortion curves.

SLIDE 10

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

The Bit-allocation/Equal Slope principle

Why does the equal slope constraint give the optimal allocation?

Selected model

RATE DISTORTION Slope constraint

For any other solution, there is at least one block-pair for which the rate-of-change of the distortion differs, and a better solution is

btained by re-allocating model complexity between these blocks.

SLIDE 11

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

The Bit-allocation/Equal Slope principle

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

8bpp original

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Fixed model .5bpp Optimal allocation .5bpp

SLIDE 12

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

Motivating example

Data: mRNA gene expression levels in two, divergent neural stem cell lines (one becomes neurons, the other predominantly glia). Timecourse; 0, 1 and 3 days after a growth factor is blocked in the media (initiates/speeds up proliferation). Gene cluster expression profiles appear ”parallel”, ”static”, ”diverging”...

glia neuron

1 1 1

−3 −2 −1 1 2 3 Time Log expression 1 1 1

2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8

1 2 3 1 2 3

SLIDE 13

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

Model formulation

For each gene g we observe a feature vector xg: xg | g in cluster k ∼ MVN(µk, Σk) We model each cluster profile µk = W θk, where W is a design matrix that reflects the biological question A sparse representation of µk is obtained if we set some

f the parameters θk to 0.

SLIDE 14

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

Model selection

Search strategies? Classification EM (CEM), a step-wise approach

1

Cluster the data.

2

Perform separate model selection for each cluster.

3

Update the clustering given the parameter constraints.

Combinatorial search

1

Cluster the data.

2

Iterate:

Consider reducing a cluster specific model by one parameter. Select the cluster k for which the drop does the least ”damage”.

3

Stop whenever the BIC increases.

CEM assumes clustering and cluster model selection are separable. The combinatorial search is greedy and computationally intensive.

SLIDE 15

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

Simultaneous selection

Cluster subset selection as a bit-allocation problem:

Selected model

RATE DISTORTION Slope constraint

Each cluster has its own rate-distortion curve. A slope constraint ∆ translates to a set of subset models for the clusters, with a corresponding BIC value. Perform a line-search over ∆ to minimize the BIC. This solution represents a balanced trade-off between goodness-of-fit and model complexity for all clusters.

SLIDE 16

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

Subset selection for cluster models

A complete search for all clusters and all models is computationally prohibitive. We compare... Backward selection = combinatorial search. Simultaneous Rate-Distortion based selection = a line search over slope constraint ∆. Selection via Classification EM = separate subset selection for all clusters.

SLIDE 17

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

Comparison of methods

BIC as a function of the number of clusters, for the 3 search methods.

7 8 9 10 11 10500 10550 10600 10650 Number of Clusters BIC

C C C C C R R R R R B B B B B

*RD is competitive with the backward search, and

utperforms the CEM approach.

*We gain one cluster by using a sparse (and easy-to-interpret) representation of the cluster profiles.

SLIDE 18

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

Clustering Results

glia neuron

1 1 1

−3 −2 −1 1 2 3 Time Log expression 1 1 1

2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9

1 2 3 1 2 3

The winning clustering model: many ’static’ profiles in the neuron cell-line (e.g. clusters 1, 2, 9), cluster 4 - no cell-line/time interaction

SLIDE 19

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

Simulation Results

Simulating from the selected model.

C

False positives False negatives

Cluster selection errors Computational cost

CEM RD Back CEM RD Back CEM RD Back

Number of Iterations

f the clustering

procedure required for model selection

SLIDE 20

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

Subset selection and Multiple testing

Each gene is its own cluster, and has its own rate-distortion curve. For each slope cutoff ∆ we can identify the corresponding gene models. We estimate the false discovery rate (for each ∆) using bootstrap. Alternative methods: We compare with a stepwise F-test (backward and forward).

SLIDE 21

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

Selection Results

Gains and losses

With the RD method we select 909 genes at FDR=1%, compared with 808 using standard selection. Forward schemes are more conservative than backward schemes: RD-forward selects 890 genes. stepwise-F backward does not control the FDR, stepwise-F forward is overly conservative.

Model subset classes

57 sign-specific significance classes are selected (out of a possible 35 =243)... ... and only 10 model classes are populated by more than 20 genes. These 10 model classes are: main cell line effects, main time effects, interaction models where the neuron cell-line exhibits static expression.

SLIDE 22

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

Selection Results

Top model classes.

1.0

1.5 2.0 2.5 3.0 4 5 6 7 8 9 Time log expression

Class 2

1.0

1.5 2.0 2.5 3.0 5 6 7 8 9 11 Time log expression

Class 10

1.0

1.5 2.0 2.5 3.0 4 5 6 7 8 9 Time log expression

Class 3

1.0

1.5 2.0 2.5 3.0 7.5 8.5 9.5 10.5 Time log expression

Class 5

1.0

1.5 2.0 2.5 3.0 8.5 9.5 10.5 12.0 Time log expression

Class 11

1.0

1.5 2.0 2.5 3.0 4 6 8 10 Time log expression

Class 4

1.0

1.5 2.0 2.5 3.0 7 8 9 10 11 Time log expression

Class 8

1.0

1.5 2.0 2.5 3.0 5 6 7 8 9 10 Time log expression

Class 6

SLIDE 23

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

Simulation Results

0.00

0.05 0.10 0.15 0.20

FDR

0.65

0.70 0.75 0.80 0.85 0.90 0.95 1.00

Power

FDR and Power across 50 simulated data sets (order RD backward/forward, F backward/forward, standard method). The RD method controls FDR (at 1, 5, 10%)... ... and Power is significantly increased. F-backward does not control FDR, F-forward exhibits a significant loss of power.

SLIDE 24

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

Conclusions

Simultaneous subset selection via rate-distortion theory

Cluster subset selection provides sparse and easy-to-interpret cluster profiles The Simultaneous Subset Selection method is fast and accurate... ... and can be generalized to subset selection in Multiple Testing. We increase Power by incorporating subset selection into Multiple Testing, and still control the FDR. ———————————————————————– Papers are available at http://www.stat.rutgers.edu/∼rebecka

SLIDE 25

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

Future Work

Experiments are getting more complex - we need to automate subset selection as much as possible (e.g. consider different parameterizations for different clusters/genes simultaneously). Generalizations to other loss functions or coding strategies - model selection and robust estimation Controlling FDR at the parameter level Incorporating prior information into clustering (current work with Sunduz Keles).

SLIDE 26

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of high-dimensional data Simultaneous Selection via Rate-distortion Theory Cluster analysis Significance analysis Conclusion and Future work

Acknowledgements

Professor Ron Hart and Loyal Goff, Keck Center for Collaborative Neuroscience, Rutgers University NSF and EPA

SLIDE 27

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory

Simulation Results

Using an ”incorrect” parameterization.

0.00

0.05 0.10 0.15 0.20

FDR

0.5

0.6 0.7 0.8 0.9 1.0

Power

RD controls FDR even for the inefficient parameterizations, and exhibits competitive Power with standard selection. Stepwise F does not control the FDR when an inefficient parameterization is used.

SLIDE 28

J¨

rnsten.

Simultaneous Subset selection via Rate-Distortion Theory

Simultaneous Subset Selection via Rate-Distortion Theory

Rebecka J¨

Department of Statistics, Rutgers University

Biostatistics Day, April 25, 2008

1 Analysis of high-dimensional data 2 Simultaneous Selection via

Rate-distortion Theory

3 Cluster analysis 4 Significance analysis 5 Conclusion and Future work

Analysis of high-dimensional data

Clustering

Popular approach for dimension reduction Wide range of applications: engineering, geological data, social networks, high-throughput biology *Assign gene function via ”guilt by association” *Suggestive of biological pathways and networks

Multiple testing

Thinking about the problems in terms

Clustering

description of a cluster profile? Example: We want to objectively be able to state that a cluster corresponds to a particular pattern across experimental conditions (e.g. static).

Multiple testing

null-hypothesis, can we identify the alternative? Example: We want to identify the differentially expressed genes, and the discriminatory experimental conditions.

Thinking about the problems in terms

Why are these model selection tasks so important?

analysis outcome.

”This clusters seems to represent a static expression profile.” ”Selected genes appear to primarily represent differential expression between only one of the experimental factors.”

Spend the parameter budget where it is needed. If we use efficient representations of simple data structures (e.g. static cluster profiles), we may detect more subtle structures.

Challenges

Clustering

Clustering and subset model selection are not separable. The search for the optimal cluster subset models is combinatorial in the number of clusters and experimental conditions.

Significance Analysis

Double multiplicity: multiple genes, and multiple model classes for each gene. Model space is HUGE!

Proposed strategy

Simultaneous subset selection via rate-distortion theory

Challenge: clustering and subset model selection are not separable tasks We appeal to results in rate-distortion theory to develop a selection method that is simultaneous across clusters Generalizes to multiple testing.

Bit-allocation and Rate-Distortion

We will turn the combinatorial model selection problems into a simultaneous search using results from optimal bit allocation in Rate-Distortion Theory.

Selected model

Here: What is ”Rate”? What is ”Distortion”?

Bit-allocation

Dk(M), with rate Rk(M) (e.g. # parameters p(M)).

RD theory: To minimize the overall distortion, (e.g. MSE), the optimal allocation is obtained at points of equal slope on the block-wise Rate-distortion curves.

The Bit-allocation/Equal Slope principle

Why does the equal slope constraint give the optimal allocation?

Selected model

For any other solution, there is at least one block-pair for which the rate-of-change of the distortion differs, and a better solution is

The Bit-allocation/Equal Slope principle

8bpp original

Fixed model .5bpp Optimal allocation .5bpp

Motivating example

glia neuron

Model formulation

For each gene g we observe a feature vector xg: xg | g in cluster k ∼ MVN(µk, Σk) We model each cluster profile µk = W θk, where W is a design matrix that reflects the biological question A sparse representation of µk is obtained if we set some

Model selection

Search strategies? Classification EM (CEM), a step-wise approach

Cluster the data.

Perform separate model selection for each cluster.

Update the clustering given the parameter constraints.

Combinatorial search

Cluster the data.

Iterate:

Consider reducing a cluster specific model by one parameter. Select the cluster k for which the drop does the least ”damage”.

Stop whenever the BIC increases.

*CEM assumes clustering and cluster model selection are separable. *The combinatorial search is greedy and computationally intensive.

Simultaneous selection

Cluster subset selection as a bit-allocation problem:

Subset selection for cluster models

Comparison of methods

BIC as a function of the number of clusters, for the 3 search methods.

*RD is competitive with the backward search, and

*We gain one cluster by using a sparse (and easy-to-interpret) representation of the cluster profiles.

Clustering Results

glia neuron

The winning clustering model: many ’static’ profiles in the neuron cell-line (e.g. clusters 1, 2, 9), cluster 4 - no cell-line/time interaction

Simulation Results

Simulating from the selected model.

C

False positives False negatives

Cluster selection errors Computational cost

CEM RD Back CEM RD Back CEM RD Back

Number of Iterations

procedure required for model selection

Subset selection and Multiple testing

Each gene is its own cluster, and has its own rate-distortion curve. For each slope cutoff ∆ we can identify the corresponding gene models. We estimate the false discovery rate (for each ∆) using bootstrap. Alternative methods: We compare with a stepwise F-test (backward and forward).

Selection Results

Gains and losses

Popular approach for dimension reduction Wide range of applications: engineering, geological data, social networks, high-throughput biology Assign gene function via ”guilt by association” Suggestive of biological pathways and networks

CEM assumes clustering and cluster model selection are separable. The combinatorial search is greedy and computationally intensive.