Seeki eking ng Interp erpretable retable Models ls for High - - PowerPoint PPT Presentation

seeki eking ng interp erpretable retable models ls for
SMART_READER_LITE
LIVE PREVIEW

Seeki eking ng Interp erpretable retable Models ls for High - - PowerPoint PPT Presentation

Seeki eking ng Interp erpretable retable Models ls for High Dimensi nsiona onal l Data Bin Yu Statistics Department, EECS Department University of California, Berkeley http://www.stat.berkeley.edu/~binyu Characteristics of Modern Data


slide-1
SLIDE 1

Seeki eking ng Interp erpretable retable Models ls for High Dimensi nsiona

  • nal

l Data

Bin Yu

Statistics Department, EECS Department University of California, Berkeley

http://www.stat.berkeley.edu/~binyu

slide-2
SLIDE 2

page 2 December 30, 2008

Characteristics of Modern Data Sets

  • Goal: efficient use of data for:

– Prediction – Interpretation (proxy: sparsity)

  • Larger number of variables:

– Number of variables (p) in data sets is large – Sample sizes (n) have not increased at same pace

  • Scientific opportunities:

– New findings in different scientific fields

slide-3
SLIDE 3

page 3 December 30, 2008

Today’s Talk

  • Understanding early visual cortex area V1

through fMRI

  • Occam’s Razor
  • Lasso: linear regression and Gaussian graphical

models

  • Discovering compressive property of V1 through

shared non-linear sparsity

  • Future work
slide-4
SLIDE 4

page 4 December 30, 2008

Understanding visual pathway

Gallant Lab at UCB is a leading vision lab.

Da Vinci (1452-1519) Mapping of different visual cortex areas PPoPolyak (1957) Small left middle grey area: V1

slide-5
SLIDE 5

page 5 December 30, 2008

Understanding visual pathway through fMRI

One goal at Gallant Lab: understand how natural images relate to fMRI signals

slide-6
SLIDE 6

page 6 December 30, 2008

Gallant Lab in Nature News

  • This article is part of Nature's premium content.
  • Published online 5 March 2008 | Nature | doi:10.1038/news.2008.650

Mind-reading with a brain scan

Brain activity can be decoded using magnetic resonance imaging.

  • Kerri Smith

Scientists have developed a way of „decoding‟ someone‟s brain activity to determine what they are looking at. “The problem is analogous to the classic „pick a card, any card‟ magic trick,” says Jack Gallant, a neuroscientist at the University

  • f California in Berkeley, who led the study.
slide-7
SLIDE 7

page 7 December 30, 2008

Stimuli

  • Natural image stimuli
slide-8
SLIDE 8

page 8 December 30, 2008

Stimulus to fMRI response

  • Natural image stimuli drawn randomly from a database of 11,499 images
  • Experiment designed so that response from different presentations are nearly

independent

  • fMRI response is pre-processed and roughly Gaussian
slide-9
SLIDE 9

page 9 December 30, 2008

Gabor Wavelet Pyramid

slide-10
SLIDE 10

page 10 December 30, 2008

Features

slide-11
SLIDE 11

page 11 December 30, 2008

“Neural” (fMRI) encoding for visual cortex V1

Predictor : p=10,921 features of an image Response: (preprocessed) fMRI signal at a voxel n=1750 samples Goal: understanding human visual system interpretable (sparse) model desired good prediction is necessary Minimization of an empirical loss (e.g. L2) leads to

  • ill-posed computational problem, and
  • bad prediction
slide-12
SLIDE 12

page 12 December 30, 2008

Linear Encoding Model by Gallant Lab

  • Data

– X: p=10921 dimensions (features) – Y: fMRI signal – n = 1750 training samples

  • Separate linear model for each voxel via e-

L2boosting (or Lasso)

  • Fitted model tested on 120 validation samples

– Performance measured by correlation

slide-13
SLIDE 13

page 13 December 30, 2008

Modeling “history” at Gallant Lab

  • Prediction on validation set is the benchmark
  • Methods tried: neural nets, SVMs, e-L2boosting (Lasso)
  • Among models with similar predictions, simpler (sparser)

models by e-L2boosting are preferred for interpretation This practice reflects a general trend in statistical machine learning -- moving from prediction to simpler/sparser models for interpretation, faster computation

  • r data transmission.
slide-14
SLIDE 14

page 14 December 30, 2008

Occam’s Razor

14th-century English logician and Franciscan friar, William of Ockham

Principle of Parsimony:

Entities must not be multiplied beyond necessity.

Wikipedia

slide-15
SLIDE 15

page 15 December 30, 2008

Occam’s Razor via Model Selection in Linear Regression

  • Maximum likelihood (ML) is LS when Gaussian assumption
  • There are 2^p submodels
  • ML goes for the largest submodel with all predictors
  • Largest model often gives bad prediction for p large
slide-16
SLIDE 16

page 16 December 30, 2008

Model Selection Criteria

Akaike (73,74) and Mallows’ Cp used estimated prediction error to choose a model: Schwartz (1980): Both are penalized LS by . Rissanen’s Minimum Description Length (MDL) principle gives rise to many different different criteria. The two-part code leads to BIC.

slide-17
SLIDE 17

page 17 December 30, 2008

Model Selection for image-fMRI problem

For the linear encoding model, the number of submodels Combinatorial search: too expensive and

  • ften not necessary

A recent alternative: continuous embedding into a convex optimization problem through L1 penalized LS (Lasso) -- a third generation computational method in statistics or machine learning.

slide-18
SLIDE 18

page 18 December 30, 2008

  • The L1 penalty is defined for coefficients 
  • Used initially with L2 loss:

Signal processing: Basis Pursuit (Chen & Donoho,1994)

Statistics: Non-Negative Garrote (Breiman, 1995)

Statistics: LASSO (Tibshirani, 1996)

  • Properties of Lasso

Sparsity (variable selection) and regularization

Convexity (convex relaxation of L0-penalty)

Lasso: L1-norm as a penalty

slide-19
SLIDE 19

page 19 December 30, 2008

The “right” tuning parameter unknown so “path” is needed

(discretized or continuous)

Initially: quadratic program for each a grid on . QP is called

for each . Later: path following algorithms such as homotopy by Osborne et al (2000) LARS by Efron et al (2004) Theoretical studies: much work recently on Lasso in terms of L2 prediction error L2 error of parameter model selection consistency

Lasso: computation and evaluation

slide-20
SLIDE 20

page 20 December 30, 2008

Model Selection Consistency of Lasso

Set-up: Linear regression model n observations and p predictors Assume (A): Knight and Fu (2000) showed L2 estimation consistency under (A).

slide-21
SLIDE 21

page 21 December 30, 2008

Model Selection Consistency of Lasso

  • p small, n large (Zhao and Y, 2006), assume (A) and

Then roughly* Irrepresentable condition (1 by (p-s) matrix)

* Some ambiguity when equality holds.

  • Related work: Tropp(06), Meinshausen and Buhlmann (06), Zou (06),

Wainwright (06) Population version

model selection consistency

slide-22
SLIDE 22

page 22 December 30, 2008

Irrepresentable condition (s=2, p=3): geomery

  • r=0.4
  • r=0.6
slide-23
SLIDE 23

page 23 December 30, 2008

Model Selection Consistency of Lasso

  • Consistency holds also for s and p growing with n, assume

irrepresentable condition bounds on max and min eigenvalues of design matrix smallest nonzero coefficient bounded away from zero. Gaussian noise (Wainwright, 06): Finite 2k-th moment noise (Zhao&Y,06):

slide-24
SLIDE 24

page 24 December 30, 2008

Consistency of Lasso for Model Selection

  • Interpretation of Condition – Regressing the irrelevant predictors on

the relevant predictors. If the L1 norm of regression coefficients (*)

Larger than 1, Lasso can not distinguish the irrelevant predictor from the relevant predictors for some parameter values.

Smaller than 1, Lasso can distinguish the irrelevant predictor from the relevant predictors.

  • Sufficient Conditions (Verifiable)

– Constant correlation – Power decay correlation – Bounded correlation*

slide-25
SLIDE 25

page 25 December 30, 2008

Sparse cov. estimation via L1 penalty

Banerjee, El Ghaoui, d’Aspremont (08)

slide-26
SLIDE 26

page 26 December 30, 2008

L1 penalized log Gaussian Likelihood

Given n iid observations of X with Banerjee, El Ghaoui, d’Aspremont (08): by a block descent algorithm.

slide-27
SLIDE 27

page 27 December 30, 2008

Model selection consistency

Ravikumar, Wainwright, Raskutti, Yu (08) gives sufficient conditions for model selection consistency. Hessian: Define “model complexity”:

slide-28
SLIDE 28

page 28 December 30, 2008

Model selection consistency (Ravikumar et al, 08)

Assume the irrepresentable condition below holds

  • 1. X sub-Gaussian with parameter and

effective sample size Or

  • 2. X has 4m-th moment,

Then with high probability as n tends to infinity, the correct model is chosen.

slide-29
SLIDE 29

page 29 December 30, 2008

Success prob’s dependence on n and p (Gaussian)

  • Edge covariances as

Each point is an average over 100 trials.

  • Curves stack up in second plot, so that (n/log p) controls model selection.

. 1 .

* 

ij

slide-30
SLIDE 30

page 30 December 30, 2008

Success prob’s dependence on “model complexity” K and n

  • Curves from left to right have increasing values of K.
  • Models with larger K thus require more samples n for same probability
  • f success.

Chain graph with p = 120 nodes.

slide-31
SLIDE 31

page 31 December 30, 2008

Back to image-fMRI problem: Linear sparse encoding model on complex “cells”

Gallant Lab’s approach:

  • Separate linear model for each voxel
  • Y = Xb + e
  • Model fitting via e-L2boosting and stopping by CV

– X: p=10921 dimensions (features or complex “cells”) – n = 1750 training samples

  • Fitted model tested on 120 validation samples (not used in fitting)

Performance measured by correlation (cc)

slide-32
SLIDE 32

page 32 December 30, 2008

Adding nonlinearity via Sparse Additive Models

  • Additive Models (Hastie and Tibshirani, 1990):
  • Sparse:
  • High dimensional: p >>> n

SpAM (Sparse Additve Models) By Ravikumar, Lafferty, Liu, Wasserman (2007) Related work: COSSO, Lin and Zhang (2006)

n i i ij p j j i

X f Y

, , 1 1

, ) (

  

  

j

f

for most j

slide-33
SLIDE 33

page 33 December 30, 2008

Sparse Additive Models (SpAM)

(Ravikumar, Lafferty, Liu and Wasserman, 07)

slide-34
SLIDE 34

page 34 December 30, 2008

Sparse Backfitting

slide-35
SLIDE 35

page 35 December 30, 2008

Simple Cell and Complex Cell Models for V1 Neurons

slide-36
SLIDE 36

page 36 December 30, 2008

Pooled-Complex “Cell” Model

Allows more flexible fitting, Pooling: aggregated neural responses in fMRI?

slide-37
SLIDE 37

page 37 December 30, 2008

SpAM V1 Model (Ravikumar et al, NIPS08)

Connections and components in dashed region are to be estimated, under the assumption that many of them are null

slide-38
SLIDE 38

page 38 December 30, 2008

SpAM V1 encoding model (1331 voxels from V1)

For each voxel,

  • Start with 10921+ complex cells (features)
  • Pre-selection of 100 complex cells via correlation
  • Fitting of SpAM to 100 complex cells with AIC stopping
  • Pooling of SpAM fitted complex cells according to location

and frequency to form pooled complex cells

  • Fitting SpAM to 100 complex cells and pooled complex cells

with AIC stopping

slide-39
SLIDE 39

page 39 December 30, 2008

Prediction performance (R2)

median improvement 12%.

inset region

slide-40
SLIDE 40

page 40 December 30, 2008

Spatial RFs and Tuning Curves

slide-41
SLIDE 41

page 41 December 30, 2008

Nonlinearities

Compressive effect (finite energy supply) Common nonlinearity for each voxel?

slide-42
SLIDE 42

page 42 December 30, 2008

Shared Nonlinearity

  • Shared V-SpAM model (Ravikumar et al, 08):

where is small or sparse.

slide-43
SLIDE 43

page 43 December 30, 2008

Nonlinearities vs Shared-Nonlinearity

Compressive effect (finite energy supply) Common nonlinearity for each voxel?

slide-44
SLIDE 44

page 44 December 30, 2008

Shared-nonlinearity vs linearity: R^2 prediction

Median improvement 16%

slide-45
SLIDE 45

page 45 December 30, 2008

Shared-nonlinearity vs nonlinearity: R^2 prediction

Median improvement 4.9%

slide-46
SLIDE 46

page 46 December 30, 2008

Shared-nonlinearity vs. nonlinearity: sparsity

Recall: 100 original features pre-selected. On average: V-SpAM: 70 predictors (original and pooled) Shared V-SpAM: sparser (13 predictors) and better prediction

slide-47
SLIDE 47

page 47 December 30, 2008

On-going: summarizing shared-nonlinearity (compressive effect)

slide-48
SLIDE 48

page 48 December 30, 2008

On-going: mapping voxels on cortex V1 area

slide-49
SLIDE 49

page 49 December 30, 2008

Summary

  • L1 penalized minimization model selection consistency

irrepresentable condition came from KKT and L1 penalty Effective sample size n/logp? Not always. Depending on the tail of relevant distribution. Model complexity parameters matter.

  • Understanding fMRI signals in V1

Discovered shared nonlinear compressive effect of fMRI response Supporting evidence: improved prediction with sparser models biologically sensible

slide-50
SLIDE 50

page 50 December 30, 2008

Future work

  • V1 voxels:

improved decoding? (to decode images seen by the subject using fMRI signals) strength borrowing? across voxels for linear and nonlinear models.

  • Higher vision area voxels: benchmark model does not work well.

hope: improved encoding models (hence making decoding possible) via nonlinearity/interaction and borrowing strength V4 challenge: how to build feedback into the modeling.

  • Better image representation
slide-51
SLIDE 51

page 51 December 30, 2008

Acknowledgements

Co-authors:

  • P. Zhao and N. Meinshausen (Lasso)
  • P. Ravikumar, M. Wainwright, G. Raskutti (Graphical model)
  • P. Ravikumar, V. Vu, K. Kay, T. Naleris, J. Gallant

(V-SpAM and shared V-SpAM) Funding: NSF-IGMS Grant (06-07 in Gallant Lab) Guggenheim Fellowship (06) NSF DMS Grant ARO Grant