Geometric perspectives for supervised dimension reduction A Tale of - - PowerPoint PPT Presentation

geometric perspectives for supervised dimension reduction
SMART_READER_LITE
LIVE PREVIEW

Geometric perspectives for supervised dimension reduction A Tale of - - PowerPoint PPT Presentation

Geometric perspectives for supervised dimension reduction Geometric perspectives for supervised dimension reduction A Tale of Two Manifolds S. Mukherjee, K. Mao, F. Liang, Q. Wu, D-X. Zhou, J. Guinney Department of Statistical Science


slide-1
SLIDE 1

Geometric perspectives for supervised dimension reduction

Geometric perspectives for supervised dimension reduction

A Tale of Two Manifolds

  • S. Mukherjee, K. Mao, F. Liang, Q. Wu, D-X. Zhou, J. Guinney

Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Department of Mathematics Duke University

December 11, 2009

slide-2
SLIDE 2

Geometric perspectives for supervised dimension reduction Supervised dimension reduction

Information and sufficiency

A fundamental idea in statistical thought is to reduce data to relevant information. This was the paradigm of R.A. Fisher (beloved Bayesian) and goes back to at least Adcock 1878 and Edgeworth 1884.

slide-3
SLIDE 3

Geometric perspectives for supervised dimension reduction Supervised dimension reduction

Information and sufficiency

A fundamental idea in statistical thought is to reduce data to relevant information. This was the paradigm of R.A. Fisher (beloved Bayesian) and goes back to at least Adcock 1878 and Edgeworth 1884. X1, ..., Xn drawn iid form a Gaussian can be reduced to µ, σ2.

slide-4
SLIDE 4

Geometric perspectives for supervised dimension reduction Supervised dimension reduction

Regression

Assume the model Y = f (X) + ε, I Eε = 0, with X ∈ X ⊂ Rp and Y ∈ R.

slide-5
SLIDE 5

Geometric perspectives for supervised dimension reduction Supervised dimension reduction

Regression

Assume the model Y = f (X) + ε, I Eε = 0, with X ∈ X ⊂ Rp and Y ∈ R. Data – D = {(xi, yi)}n

i=1 iid

∼ ρ(X, Y ).

slide-6
SLIDE 6

Geometric perspectives for supervised dimension reduction Supervised dimension reduction

Dimension reduction

If the data lives in a p-dimensional space X ∈ I Rp replace X with Θ(X) ∈ I Rd, p d.

slide-7
SLIDE 7

Geometric perspectives for supervised dimension reduction Supervised dimension reduction

Dimension reduction

If the data lives in a p-dimensional space X ∈ I Rp replace X with Θ(X) ∈ I Rd, p d. My belief: physical, biological and social systems are inherently low dimensional and variation of interest in these systems can be captured by a low-dimensional submanifold.

slide-8
SLIDE 8

Geometric perspectives for supervised dimension reduction Supervised dimension reduction

Supervised dimension reduction (SDR)

Given response variables Y1, ..., Yn ∈ I R and explanatory variables

  • r covariates X1, ..., Xn ∈ X ⊂ Rp

Yi = f (Xi) + εi, εi

iid

∼ No(0, σ2).

slide-9
SLIDE 9

Geometric perspectives for supervised dimension reduction Supervised dimension reduction

Supervised dimension reduction (SDR)

Given response variables Y1, ..., Yn ∈ I R and explanatory variables

  • r covariates X1, ..., Xn ∈ X ⊂ Rp

Yi = f (Xi) + εi, εi

iid

∼ No(0, σ2). Is there a submanifold S ≡ SY |X such that Y ⊥ ⊥ X | PS(X) ?

slide-10
SLIDE 10

Geometric perspectives for supervised dimension reduction Supervised dimension reduction

Visualization of SDR

−20 20 50 100 −20 −10 10 20 x (a) Data y z 0.5 1 0.2 0.4 0.6 0.8 1 Dimension 1 Dimension 2 (b) Diffusion map −10 10 20 −20 −10 10 20 Dimension 1 Dimension 2 (c) GOP 0.5 1 0.2 0.4 0.6 0.8 1 Dimension 1 Dimension 2 (d) GDM −0.5 0.5 −0.5 0.5 −0.5 0.5 −0.5 0.5

slide-11
SLIDE 11

Geometric perspectives for supervised dimension reduction Supervised dimension reduction

Linear projections capture nonlinear manifolds

In this talk PS(X) = BTX where B = (b1, ..., bd).

slide-12
SLIDE 12

Geometric perspectives for supervised dimension reduction Supervised dimension reduction

Linear projections capture nonlinear manifolds

In this talk PS(X) = BTX where B = (b1, ..., bd). Semiparametric model Yi = f (Xi) + εi = g(bT

1 Xi, . . . , bT d Xi) + εi,

span B is the dimension reduction (d.r.) subspace.

slide-13
SLIDE 13

Geometric perspectives for supervised dimension reduction Learning gradients

SDR model

Semiparametric model Yi = f (Xi) + εi = g(bT

1 Xi, . . . , bT d Xi) + εi,

span B is the dimension reduction (d.r.) subspace.

slide-14
SLIDE 14

Geometric perspectives for supervised dimension reduction Learning gradients

SDR model

Semiparametric model Yi = f (Xi) + εi = g(bT

1 Xi, . . . , bT d Xi) + εi,

span B is the dimension reduction (d.r.) subspace. Assume marginal distribution ρX is concentrated on a manifold M ⊂ I Rp of dimension d p.

slide-15
SLIDE 15

Geometric perspectives for supervised dimension reduction Learning gradients

Gradients and outer products

Given a smooth function f the gradient is ∇f (x) =

  • ∂f (x)

∂x1 , ..., ∂f (x) ∂xp

T .

slide-16
SLIDE 16

Geometric perspectives for supervised dimension reduction Learning gradients

Gradients and outer products

Given a smooth function f the gradient is ∇f (x) =

  • ∂f (x)

∂x1 , ..., ∂f (x) ∂xp

T . Define the gradient outer product matrix Γ Γij =

  • X

∂f ∂xi (x) ∂f ∂xj (x)dρ

X (x),

Γ = E[(∇f ) ⊗ (∇f )].

slide-17
SLIDE 17

Geometric perspectives for supervised dimension reduction Learning gradients

GOP captures the d.r. space

Suppose y = f (X) + ε = g(bT

1 X, ..., bT d X) + ε.

slide-18
SLIDE 18

Geometric perspectives for supervised dimension reduction Learning gradients

GOP captures the d.r. space

Suppose y = f (X) + ε = g(bT

1 X, ..., bT d X) + ε.

Note that for B = (b1, ..., bd) λibi = Γbi.

slide-19
SLIDE 19

Geometric perspectives for supervised dimension reduction Learning gradients

GOP captures the d.r. space

Suppose y = f (X) + ε = g(bT

1 X, ..., bT d X) + ε.

Note that for B = (b1, ..., bd) λibi = Γbi. For i = 1, .., d ∂f (x) ∂vi = v T

i (∇f (x)) = 0 ⇒ bT i Γbi = 0.

If w ⊥ bi for all i then wT Γw = 0.

slide-20
SLIDE 20

Geometric perspectives for supervised dimension reduction Learning gradients

Statistical interpretation

Linear case y = βT x + ε, ε iid ∼ No(0, σ2). Ω = cov (E[X|Y ]), ΣX = cov (X), σ2

Y = var (Y ).

slide-21
SLIDE 21

Geometric perspectives for supervised dimension reduction Learning gradients

Statistical interpretation

Linear case y = βT x + ε, ε iid ∼ No(0, σ2). Ω = cov (E[X|Y ]), ΣX = cov (X), σ2

Y = var (Y ).

Γ = σ2

Y

  • 1 − σ2

σ2

Y

2 Σ−1

X ΩΣ−1 X

≈ σ2

Y Σ−1 X ΩΣ−1 X .

slide-22
SLIDE 22

Geometric perspectives for supervised dimension reduction Learning gradients

Statistical interpretation

For smooth f (x) y = f (x) + ε, ε iid ∼ No(0, σ2). Ω = cov (E[X|Y ]) not so clear.

slide-23
SLIDE 23

Geometric perspectives for supervised dimension reduction Learning gradients

Nonlinear case

Partition into sections and compute local quantities X =

I

  • i=1

χi

slide-24
SLIDE 24

Geometric perspectives for supervised dimension reduction Learning gradients

Nonlinear case

Partition into sections and compute local quantities X =

I

  • i=1

χi Ωi = cov (E[Xχi |Yχi ])

slide-25
SLIDE 25

Geometric perspectives for supervised dimension reduction Learning gradients

Nonlinear case

Partition into sections and compute local quantities X =

I

  • i=1

χi Ωi = cov (E[Xχi |Yχi ]) Σi = cov (Xχi )

slide-26
SLIDE 26

Geometric perspectives for supervised dimension reduction Learning gradients

Nonlinear case

Partition into sections and compute local quantities X =

I

  • i=1

χi Ωi = cov (E[Xχi |Yχi ]) Σi = cov (Xχi ) σ2

i

= var (Yχi )

slide-27
SLIDE 27

Geometric perspectives for supervised dimension reduction Learning gradients

Nonlinear case

Partition into sections and compute local quantities X =

I

  • i=1

χi Ωi = cov (E[Xχi |Yχi ]) Σi = cov (Xχi ) σ2

i

= var (Yχi ) mi = ρ

X (χi).

slide-28
SLIDE 28

Geometric perspectives for supervised dimension reduction Learning gradients

Nonlinear case

Partition into sections and compute local quantities X =

I

  • i=1

χi Ωi = cov (E[Xχi |Yχi ]) Σi = cov (Xχi ) σ2

i

= var (Yχi ) mi = ρ

X (χi).

Γ ≈

I

  • i=1

mi σ2

i Σ−1 i

Ωi Σ−1

i

.

slide-29
SLIDE 29

Geometric perspectives for supervised dimension reduction Learning gradients

Estimating the gradient

Taylor expansion yi ≈ f (xi) ≈ f (xj) + ∇f (xj), xj − xi ≈ yj + ∇f (xj), xj − xi if xi ≈ xj.

slide-30
SLIDE 30

Geometric perspectives for supervised dimension reduction Learning gradients

Estimating the gradient

Taylor expansion yi ≈ f (xi) ≈ f (xj) + ∇f (xj), xj − xi ≈ yj + ∇f (xj), xj − xi if xi ≈ xj. Let f ≈ ∇f the following should be small

  • i,j

wij(yi − yj − f (xj), xj − xi)2, wij =

1 sp+2 exp(−xi − xj2/2s2) enforces xi ≈ xj.

slide-31
SLIDE 31

Geometric perspectives for supervised dimension reduction Learning gradients

Estimating the gradient

The gradient estimate

  • fD = arg min
  • f ∈Hp

  1 n2

n

  • i,j=1

wij

  • yi − yj − (

f (xj))T (xj − xi) 2 + λ f 2

K

  where f K is a smoothness penalty, reproducing kernel Hilbert space norm.

slide-32
SLIDE 32

Geometric perspectives for supervised dimension reduction Learning gradients

Estimating the gradient

The gradient estimate

  • fD = arg min
  • f ∈Hp

  1 n2

n

  • i,j=1

wij

  • yi − yj − (

f (xj))T (xj − xi) 2 + λ f 2

K

  where f K is a smoothness penalty, reproducing kernel Hilbert space norm. Goto board.

slide-33
SLIDE 33

Geometric perspectives for supervised dimension reduction Learning gradients

Computational efficiency

The computation requires fewer than n2 parameters and is O(n6) time and O(pn) memory

  • fD(x) =

n

  • i=1

ci,DK(xi, x) cD = (c1,D, . . . , cn,D)T ∈ Rnp.

slide-34
SLIDE 34

Geometric perspectives for supervised dimension reduction Learning gradients

Computational efficiency

The computation requires fewer than n2 parameters and is O(n6) time and O(pn) memory

  • fD(x) =

n

  • i=1

ci,DK(xi, x) cD = (c1,D, . . . , cn,D)T ∈ Rnp. Define gram matrix K where Kij = K(xi, xj) ˆ Γ = cDKcT

D .

slide-35
SLIDE 35

Geometric perspectives for supervised dimension reduction Learning gradients

Estimates on manifolds

Mrginal distribution ρ

X is concentrated on a compact Riemannian manifold M ∈ I

Rd with isometric embedding ϕ : M → Rp and metric d M and dµ is the uniform measure on M. Assume regular distribution (i) The density ν(x) =

dρ X (x) dµ

exists and is H¨

  • lder continuous (c1 > 0 and 0 < θ ≤ 1)

|ν(x) − ν(u)| ≤ c1 dθ M(x, u) ∀x, u ∈ M. (ii) The measure along the boundary is small: (c2 > 0) ρ

M

` ˘ x ∈ M : d M(x, ∂M) ≤ t ¯ ´ ≤ c2 t ∀t > 0.

slide-36
SLIDE 36

Geometric perspectives for supervised dimension reduction Learning gradients

Convergence to gradient on manifold

Theorem

Under above regularity conditions on ρ

X and f ∈ C 2(M), with

probability 1 − δ (dϕ)∗ fD − ∇Mf 2

L2

ρ M ≤ C log

1 δ n− 1

d

  • .

where (dϕ)∗ (projection onto tangent space) is the dual of the map dϕ.

slide-37
SLIDE 37

Geometric perspectives for supervised dimension reduction Learning gradients

Multi-task learning

Definition

Single Task Notation nt samples (xi, yi) xi ∈ Rd yi ∈ {−1, 1} for classification Assume to be working in d nt paradigm.

slide-38
SLIDE 38

Geometric perspectives for supervised dimension reduction Learning gradients

Multi-task learning

Definition

Single Task Notation nt samples (xi, yi) xi ∈ Rd yi ∈ {−1, 1} for classification Assume to be working in d nt paradigm.

Definition

Multi-task Learning (MTL) Formulation Given T tasks with t ∈ {1, . . . , T} Ft(x) = f0(x) + ft(x) + ε, ε iid No(0, σ2).

slide-39
SLIDE 39

Geometric perspectives for supervised dimension reduction Learning gradients

Multi-task gradient learning

Estimate not just the functions {f0, f1, ..., fT },

slide-40
SLIDE 40

Geometric perspectives for supervised dimension reduction Learning gradients

Multi-task gradient learning

Estimate not just the functions {f0, f1, ..., fT }, but the gradients as well {(f0, ∇f0), (ft, ∇ft)T

t=1}.

slide-41
SLIDE 41

Geometric perspectives for supervised dimension reduction Learning gradients

Multi-task gradient learning

Estimate not just the functions {f0, f1, ..., fT }, but the gradients as well {(f0, ∇f0), (ft, ∇ft)T

t=1}.

This provides us with T + 1 matrices

  • 1. ˆ

Γ0 is the GOP estimate across all the tasks

slide-42
SLIDE 42

Geometric perspectives for supervised dimension reduction Learning gradients

Multi-task gradient learning

Estimate not just the functions {f0, f1, ..., fT }, but the gradients as well {(f0, ∇f0), (ft, ∇ft)T

t=1}.

This provides us with T + 1 matrices

  • 1. ˆ

Γ0 is the GOP estimate across all the tasks

  • 2. ˆ

Γ1, . . . , ˆ ΓT are the task specific GOP estimates.

slide-43
SLIDE 43

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Principal components analysis (PCA)

Algorithmic view of PCA:

  • 1. Given X = (X1, ...., Xn) a p × n matrix construct

ˆ Σ = (X − ¯ X)(X − ¯ X)T

slide-44
SLIDE 44

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Principal components analysis (PCA)

Algorithmic view of PCA:

  • 1. Given X = (X1, ...., Xn) a p × n matrix construct

ˆ Σ = (X − ¯ X)(X − ¯ X)T

  • 2. Eigen-decomposition of ˆ

Σ λivi = ˆ Σvi.

slide-45
SLIDE 45

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Probabilistic PCA

X ∈ I Rp is charterized by a multivariate normal X ∼ No(µ + Aν, ∆), ν ∼ No(0, Id) µ ∈ I Rp A ∈ I Rp×d ∆ ∈ I Rp×p ν ∈ I Rd.

slide-46
SLIDE 46

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Probabilistic PCA

X ∈ I Rp is charterized by a multivariate normal X ∼ No(µ + Aν, ∆), ν ∼ No(0, Id) µ ∈ I Rp A ∈ I Rp×d ∆ ∈ I Rp×p ν ∈ I Rd. ν is a latent variable

slide-47
SLIDE 47

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

SDR model

Semiparametric model Yi = f (Xi) + εi = g(bT

1 Xi, . . . , bT d Xi) + εi,

span B is the dimension reduction (d.r.) subspace.

slide-48
SLIDE 48

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Principal fitted components (PFC)

Define Xy ≡ (X | Y = y) and specify multivariate normal distribution Xy ∼ No(µy, ∆), µy = µ + Aνy µ ∈ I Rp A ∈ I Rp×d νy ∈ I Rd.

slide-49
SLIDE 49

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Principal fitted components (PFC)

Define Xy ≡ (X | Y = y) and specify multivariate normal distribution Xy ∼ No(µy, ∆), µy = µ + Aνy µ ∈ I Rp A ∈ I Rp×d νy ∈ I Rd. B = ∆−1A.

slide-50
SLIDE 50

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Principal fitted components (PFC)

Define Xy ≡ (X | Y = y) and specify multivariate normal distribution Xy ∼ No(µy, ∆), µy = µ + Aνy µ ∈ I Rp A ∈ I Rp×d νy ∈ I Rd. B = ∆−1A. Captures global linear predictive structure. Does not generalize to manifolds.

slide-51
SLIDE 51

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Mixture models and localization

A driving idea in manifold learning is that manifolds are locally Euclidean.

slide-52
SLIDE 52

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Mixture models and localization

A driving idea in manifold learning is that manifolds are locally Euclidean. A driving idea in probabilistic modeling is that mixture models are flexible and can capture ”nonparametric” distributions.

slide-53
SLIDE 53

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Mixture models and localization

A driving idea in manifold learning is that manifolds are locally Euclidean. A driving idea in probabilistic modeling is that mixture models are flexible and can capture ”nonparametric” distributions. Mixture models can capture local nonlinear predictive manifold structure.

slide-54
SLIDE 54

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Model specification

Xy ∼ No(µyx, ∆) µyx = µ + Aνyx νyx ∼ Gy Gy: density indexed by y having multiple clusters µ ∈ I Rp ε ∼ N(0, ∆) with ∆ ∈ I Rp×p A ∈ I Rp×d νxy ∈ I Rd.

slide-55
SLIDE 55

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Dimension reduction space

Proposition

For this model the d.r. space is the span of B = ∆−1A Y | X d = Y | (∆−1A)TX.

slide-56
SLIDE 56

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Sampling distribution

Define νi ≡ νyixi. Sampling distribution for data xi | (yi, µ, νi, A, ∆) ∼ N(µ + Aνi, ∆) νi ∼ Gyi.

slide-57
SLIDE 57

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Categorical response: modeling Gy

Y = {1, ..., C}, so each category has a distribution νi | (yi = k) ∼ Gk, c = 1, ..., C.

slide-58
SLIDE 58

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Categorical response: modeling Gy

Y = {1, ..., C}, so each category has a distribution νi | (yi = k) ∼ Gk, c = 1, ..., C. νi modeled as a mixture of C distributions G1, ..., GC with a Dirichlet process model for ech distribution Gc ∼ DP(α0, G0).

slide-59
SLIDE 59

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Categorical response: modeling Gy

Y = {1, ..., C}, so each category has a distribution νi | (yi = k) ∼ Gk, c = 1, ..., C. νi modeled as a mixture of C distributions G1, ..., GC with a Dirichlet process model for ech distribution Gc ∼ DP(α0, G0). Goto board.

slide-60
SLIDE 60

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Likelihood

Lik(data | θ) ≡ Lik(data | A, ∆, ν1, ..., νn, µ) Lik(data | θ) ∝ det(∆−1)

n 2 ×

exp

  • −1

2

n

  • i=1

(xi − µ − Aνi)T ∆−1(xi − µ − Aνi)

  • .
slide-61
SLIDE 61

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Posterior inference

Given data Pθ ≡ Post(θ | data) ∝ Lik(θ | data) × π(θ).

slide-62
SLIDE 62

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Posterior inference

Given data Pθ ≡ Post(θ | data) ∝ Lik(θ | data) × π(θ).

  • 1. Pθ provides estimate of (un)certainty on θ
slide-63
SLIDE 63

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Posterior inference

Given data Pθ ≡ Post(θ | data) ∝ Lik(θ | data) × π(θ).

  • 1. Pθ provides estimate of (un)certainty on θ
  • 2. Requires prior on θ
slide-64
SLIDE 64

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Posterior inference

Given data Pθ ≡ Post(θ | data) ∝ Lik(θ | data) × π(θ).

  • 1. Pθ provides estimate of (un)certainty on θ
  • 2. Requires prior on θ
  • 3. Sample from Pθ ?
slide-65
SLIDE 65

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Markov chain Monte Carlo

No closed form for Pθ.

slide-66
SLIDE 66

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Markov chain Monte Carlo

No closed form for Pθ.

  • 1. Specify Markov transition kernel

K(θt, θt+1) with stationary distribution Pθ.

slide-67
SLIDE 67

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Markov chain Monte Carlo

No closed form for Pθ.

  • 1. Specify Markov transition kernel

K(θt, θt+1) with stationary distribution Pθ.

  • 2. Run the Markov chain to obtain θ1, ..., θT .
slide-68
SLIDE 68

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Sampling from the posterior

Inference consists of drawing samples θ(t) = (µ(t), A(t), ∆−1

(t), ν(t))

from the posterior.

slide-69
SLIDE 69

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Sampling from the posterior

Inference consists of drawing samples θ(t) = (µ(t), A(t), ∆−1

(t), ν(t))

from the posterior. Define θ/µ

(t)

≡ (A(t), ∆−1

(t), ν(t))

θ/A

(t)

≡ (µ(t), ∆−1

(t), ν(t))

θ/∆−1

(t)

≡ (µ(t), A(t), ν(t)) θ/ν

(t)

≡ (µ(t), A(t), ∆−1

(t)).

slide-70
SLIDE 70

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Gibbs sampling

Conditional probabilities can be used to sample µ, ∆−1, A µ(t+1) |

  • data, θ/µ

(t)

No

  • data, θ/µ

(t)

  • ,
slide-71
SLIDE 71

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Gibbs sampling

Conditional probabilities can be used to sample µ, ∆−1, A µ(t+1) |

  • data, θ/µ

(t)

No

  • data, θ/µ

(t)

  • ,

∆−1

(t+1) |

  • data, θ/∆−1

(t)

InvWishart

  • data, θ/∆−1

(t)

slide-72
SLIDE 72

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Gibbs sampling

Conditional probabilities can be used to sample µ, ∆−1, A µ(t+1) |

  • data, θ/µ

(t)

No

  • data, θ/µ

(t)

  • ,

∆−1

(t+1) |

  • data, θ/∆−1

(t)

InvWishart

  • data, θ/∆−1

(t)

  • A(t+1) |
  • data, θ/A

(t)

No

  • data.θ/A

(t)

  • .
slide-73
SLIDE 73

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Gibbs sampling

Conditional probabilities can be used to sample µ, ∆−1, A µ(t+1) |

  • data, θ/µ

(t)

No

  • data, θ/µ

(t)

  • ,

∆−1

(t+1) |

  • data, θ/∆−1

(t)

InvWishart

  • data, θ/∆−1

(t)

  • A(t+1) |
  • data, θ/A

(t)

No

  • data.θ/A

(t)

  • .

Sampling ν(t) is more involved.

slide-74
SLIDE 74

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Posterior draws from the Grassmann manifold

Given samples (∆−1

(t), A(t))m t=1 compute B(t) = ∆−1 (t)A(t).

slide-75
SLIDE 75

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Posterior draws from the Grassmann manifold

Given samples (∆−1

(t), A(t))m t=1 compute B(t) = ∆−1 (t)A(t).

Each B(t) is a subspace which is a point in the Grassmann manifold G(d,p). There is a Riemannian metric on this manifold. This has two implications.

slide-76
SLIDE 76

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Posterior mean and variance

Given draws (B(t))m

t=1 the posterior mean and variance should be

computed with respect to the Riemannian metric.

slide-77
SLIDE 77

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Posterior mean and variance

Given draws (B(t))m

t=1 the posterior mean and variance should be

computed with respect to the Riemannian metric. Given two subspaces W and U spanned by orthonormal bases W and V the Karcher mean is (I − X(X TX)−1X T)Y (X TY )−1 = UΣV T Θ = atan(Σ) dist(W, V) =

  • Tr(Θ2).
slide-78
SLIDE 78

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Posterior mean and variance

The posterior mean subspace BBayes = arg min

B∈G(d,p) m

  • i=1

dist(Bi, B).

slide-79
SLIDE 79

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Posterior mean and variance

The posterior mean subspace BBayes = arg min

B∈G(d,p) m

  • i=1

dist(Bi, B). Uncertainty var({B1, · · · , Bm}) = 1 m

m

  • i=1

dist(Bi, BBayes).

slide-80
SLIDE 80

Geometric perspectives for supervised dimension reduction Baysian Mixture of Inverses

Distribution theory on Grassmann manifolds

If B is a linear space of d central normal vectors in I Rp with covariance matrix Σ the density of Grassmannian distribution GΣ w.r.t. reference measure GI is dGΣ dGI (X) =

  • det(X TX)

det(X TΣ−1X) d/2 , where X ≡ span(X) where X = (x1, ...xd).

slide-81
SLIDE 81

Geometric perspectives for supervised dimension reduction Results on data Swiss roll

Swiss roll

X1 = t cos(t), X2 = h, X3 = t sin(t), X4,...,10

iid

∼ No(0, 1) where t = 3π

2 (1 + 2θ), θ ∼ Unif(0, 1), h ∼ Unif(0, 1) and

Y = sin(5πθ) + h2 + ε, ε ∼ No(0, 0.01).

slide-82
SLIDE 82

Geometric perspectives for supervised dimension reduction Results on data Swiss roll

Pictures

slide-83
SLIDE 83

Geometric perspectives for supervised dimension reduction Results on data Swiss roll

Metric

Projection of the estimated d.r. space ˆ B = (ˆ b1, · · · , ˆ bd) onto B 1 d

d

  • i=1

||PBˆ bi||2 = 1 d

d

  • i=1

||(BBT)ˆ bi||2

slide-84
SLIDE 84

Geometric perspectives for supervised dimension reduction Results on data Swiss roll

Comparison of algorithms

200 300 400 500 600 0.4 0.5 0.6 0.7 0.8 0.9 1 Sample size Accuracy BMI BAGL SIR LSIR PHD SAVE

slide-85
SLIDE 85

Geometric perspectives for supervised dimension reduction Results on data Swiss roll

Posterior variance

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 Boxplot for the Distances

slide-86
SLIDE 86

Geometric perspectives for supervised dimension reduction Results on data Swiss roll

Error as a function of d

1 2 3 4 5 6 7 8 9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 error number of e.d.r. directions 1 2 3 4 5 6 7 8 9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 error number of e.d.r. directions

slide-87
SLIDE 87

Geometric perspectives for supervised dimension reduction Results on data Digits

Digits

5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25

slide-88
SLIDE 88

Geometric perspectives for supervised dimension reduction Results on data Digits

Two classification problems

3 vs. 8 and 5 vs. 8.

slide-89
SLIDE 89

Geometric perspectives for supervised dimension reduction Results on data Digits

Two classification problems

3 vs. 8 and 5 vs. 8. 100 training samples from each class.

slide-90
SLIDE 90

Geometric perspectives for supervised dimension reduction Results on data Digits

BMI

−3 −2 −1 1 2 3 4 5 x 10

−4

−4 −3 −2 −1 1 2 3 4 5 6 x 10

−4

slide-91
SLIDE 91

Geometric perspectives for supervised dimension reduction Results on data Digits

3, 5, 8 Classification Problem

Goal

Learn features for predictive model:

3 vs 8 5 vs 8 3 and 5 vs 8

slide-92
SLIDE 92

Geometric perspectives for supervised dimension reduction Results on data Digits

3, 5, 8 Classification problem

5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25

slide-93
SLIDE 93

Geometric perspectives for supervised dimension reduction Results on data Digits

Top features: 3 and 5 vs 8

slide-94
SLIDE 94

Geometric perspectives for supervised dimension reduction Results on data Digits

Top features: 3 vs 8

slide-95
SLIDE 95

Geometric perspectives for supervised dimension reduction Results on data Digits

Top features: 5 vs 8

slide-96
SLIDE 96

Geometric perspectives for supervised dimension reduction Results on data Digits

All ten digits

digit Nonlinear Linear 0.04(± 0.01) 0.05 (± 0.01) 1 0.01(± 0.003) 0.03 (± 0.01) 2 0.14(± 0.02) 0.19 (± 0.02) 3 0.11(± 0.01) 0.17 (± 0.03) 4 0.13(± 0.02) 0.13 (± 0.03) 5 0.12(± 0.02) 0.21 (± 0.03) 6 0.04(± 0.01) 0.0816 (± 0.02) 7 0.11(± 0.01) 0.14 (± 0.02) 8 0.14(± 0.02) 0.20 (± 0.03) 9 0.11(± 0.02) 0.15 (± 0.02) average 0.09 0.14

Table: Average classification error rate and standard deviation on the digits data.

slide-97
SLIDE 97

Geometric perspectives for supervised dimension reduction Results on data Cancer

Cancer classification

n = 38 samples with expression levels for p = 7129 genes or ests 19 samples are Acute Myeloid Leukemia (AML) 19 are Acute Lymphoblastic Leukemia, these fall into two subclusters – B-cell and T-cell.

slide-98
SLIDE 98

Geometric perspectives for supervised dimension reduction Results on data Cancer

Substructure captured

−5 5 10 x 10

4

−10 −5 5 x 10

4

slide-99
SLIDE 99

Geometric perspectives for supervised dimension reduction The end

Funding

IGSP Center for Systems Biology at Duke NSF DMS-0732260