Galen Reeves Departments of ECE and Statistical Science Duke - - PowerPoint PPT Presentation

galen reeves
SMART_READER_LITE
LIVE PREVIEW

Galen Reeves Departments of ECE and Statistical Science Duke - - PowerPoint PPT Presentation

Scalable Posterior Approximation Galen Reeves Departments of ECE and Statistical Science Duke University August 2015 Collaborators at Duke David B. Dunson Willem van den Boom 2 variable selection / support recovery identify the locations


slide-1
SLIDE 1

Scalable Posterior Approximation

Galen Reeves

August 2015 Departments of ECE and Statistical Science Duke University

slide-2
SLIDE 2

Collaborators at Duke

2

Willem van den Boom David B. Dunson

slide-3
SLIDE 3

variable selection / support recovery

  • identify the locations / identities of agents which have

significant effects on observed behaviors

  • e.g., gene expression, face recognition, etc.
  • find relevant features for building a model
  • machine learning
  • recover a sparse signal from noisy linear measurements
  • compressed sensing
  • determine which entries of a unknown parameter vector are

significant

  • statistics

3

slide-4
SLIDE 4

high-dimensional inference

4

β1

β2

β3

β4 βp y1

y2

y3 yn

p unknown parameters n observations (the data)

slide-5
SLIDE 5

high-dimensional inference

5

β1

β2

β3

β4 βp y1

y2

y3 yn

p unknown parameters n observations (the data)

Types of questions:

  • posterior distribution
  • posterior mean and

covariance

  • posterior marginal

distribution p(β|y) E[β|y] Cov[β|y]

high-dimensional distribution p x 1 vector, p x p matrix

  • ne-dimensional

distribution

p(β1|y)

slide-6
SLIDE 6

edges mean dependencies

6

β1

β2

β3

β4 βp y1

y2

y3 yn

p unknown parameters n observations (the data)

slide-7
SLIDE 7

inference is easy if graph is sparse…

7

β1

β2

β3

β4 βp y1

y2

y3 yn

p unknown parameters n observations (the data)

slide-8
SLIDE 8

… but dense graphs are challenging

8

β1

β2

β3

β4 βp y1

y2

y3 yn

p unknown parameters n observations (the data)

slide-9
SLIDE 9

statistical model for parameters

9

p(β|θ) =

p

Y

j=1

p(βj|θ)

prior distribution

probability equal to zero

p(βj|θ)

distribution if nonzero

entries of β conditionally independent given hyper parameters θ mixed discrete-continuous distribution for marginal prior

slide-10
SLIDE 10

standard linear model

10

y = X + ✏

p unknown parameters n observations Gaussian errors n x p matrix

N(0, σ2I)

slide-11
SLIDE 11

why challenging?

  • number of feature subsets grows exponentially with p
  • curse of dimensionality
  • exact inference requires computing high-dimensional integrals
  • brute-force integration is computationally infeasible
  • extensive research focuses on methods for approximate

inference

11

slide-12
SLIDE 12

tradeoffs for high-dimensional inference

12

accuracy scalability

linear methods (least-squares) brute-force numerical integration LASSO MCMC MCMC (unbounded time) BCR AMP YFA focus of recent research

slide-13
SLIDE 13

problems with existing methods

13

slide-14
SLIDE 14

problems with existing methods

  • regularized least-squares (e.g. LASSO)
  • lack measures of statistical significance

13

slide-15
SLIDE 15

problems with existing methods

  • regularized least-squares (e.g. LASSO)
  • lack measures of statistical significance
  • sampling methods like MCMC
  • not clear when sufficiently converged / sampled

13

slide-16
SLIDE 16

problems with existing methods

  • regularized least-squares (e.g. LASSO)
  • lack measures of statistical significance
  • sampling methods like MCMC
  • not clear when sufficiently converged / sampled
  • variational approximations
  • difficulty with multimodal posteriors, hard to interpret

13

slide-17
SLIDE 17

Example: one-dimensional problem with spike & slab (Bernoulli-Gaussian) prior

14

prior distribution

y = + ✏

probability mass at zero

slide-18
SLIDE 18

Example: one-dimensional problem with spike & slab (Bernoulli-Gaussian) prior

14

prior distribution

y = + ✏

posterior distribution

large

  • bservation

probability mass at zero probability mass at zero

slide-19
SLIDE 19

Example: one-dimensional problem with spike & slab (Bernoulli-Gaussian) prior

14

prior distribution posterior distribution

small

  • bservation

probability mass at zero

y = + ✏

posterior distribution

large

  • bservation

probability mass at zero probability mass at zero

slide-20
SLIDE 20

15

prior distribution posterior distribution posterior distribution

large

  • bservation

small

  • bservation

Gaussian approximation Gaussian approximation

Example: one-dimensional problem with spike & slab (Bernoulli-Gaussian) prior

y = + ✏

slide-21
SLIDE 21

problems with existing methods

  • regularized least-squares (e.g. LASSO)
  • lack measures of statistical significance
  • sampling methods like MCMC
  • not clear when sufficiently converged / sampled
  • variational approximations
  • difficulty with multimodal posteriors, hard to interpret

16

slide-22
SLIDE 22

problems with existing methods

  • regularized least-squares (e.g. LASSO)
  • lack measures of statistical significance
  • sampling methods like MCMC
  • not clear when sufficiently converged / sampled
  • variational approximations
  • difficulty with multimodal posteriors, hard to interpret
  • loopy belief propagation, approximate message passing

(AMP)

  • lack theoretical guarantees for general matrices

16

slide-23
SLIDE 23

high-dimensional variable selection

17

y = X + ✏

p unknown parameters drawn independently with known distribution (e.g. spike & slab) n observations

✏ ∼ N(0, 2I)

Gaussian errors n x p matrix

slide-24
SLIDE 24

high-dimensional variable selection

17

y = X + ✏

p unknown parameters drawn independently with known distribution (e.g. spike & slab) n observations

✏ ∼ N(0, 2I)

Gaussian errors n x p matrix

Goal: compute posterior marginal distribution of first entry

p(β1|y) = Z p(β|y)dβp

2

slide-25
SLIDE 25
  • verview of our approach
  • rotate the data to isolate the parameter of interest
  • introduce an auxiliary variable which summarizes the influence
  • f the other parameters
  • use any means possible to compute / estimate the posterior

mean and posterior variance of the auxiliary variable

  • apply Gaussian approx. to auxiliary variable and solve one-

dimensional integration problem to obtain posterior approximation

18

slide-26
SLIDE 26

1: reparameterize

19

  • Apply rotation matrix to the data which zeros out all but one

entry in the first column of the data

  • Only first observation depends on first entry.

˜ y1 = ˜ x1,11 +

p

X

j=2

˜ x1,jj + ˜ ✏1 φ(βp

2)

auxiliary variable captures influence of other parameters

˜ y = 2 6 6 6 4 ˜ x1,1 ˜ x1,2 · · · ˜ x1,p ˜ x2,2 · · · ˜ x2,p . . . . . . . . . ˜ xn,2 · · · ˜ xn,p 3 7 7 7 5 + ˜ ✏

slide-27
SLIDE 27

20

β1 β2 β3 β4 βp y1

y2

y3 yn

step 1: reparameterize

slide-28
SLIDE 28

21

β1

β2 β3 β4 βp y1

y2

y3 yn

step 1: reparameterize

slide-29
SLIDE 29

22

β1

β2 β3 β4 βp y1

y2

y3 yn

step 1: reparameterize

slide-30
SLIDE 30

22

β1

β2 β3 β4 βp y1

y2

y3 yn

step 1: reparameterize

slide-31
SLIDE 31

23

β1

β2 β3 β4 βp

step 1: reparameterize

˜ y1 ˜ y2 ˜ y3 ˜ yn

slide-32
SLIDE 32

24

β1

β2 β3 β4 βp

φ(βp

2) = p

X

j=2

˜ x1,jβj

auxiliary variable encapsulates influence

  • f other parameters

φ(βp

2)

step 1: reparameterize

˜ y2 ˜ y3 ˜ yn ˜ y1

slide-33
SLIDE 33

25

β1

β2 β3 β4 βp

φ(βp

2) = p

X

j=2

˜ x1,jβj

auxiliary variable encapsulates influence

  • f other parameters

φ(βp

2)

step 1: reparameterize

˜ y2 ˜ y3 ˜ yn ˜ y1

p(β1, ˜ y1|˜ yp

2) =

Z p(β1, ˜ y1|φ(βp

2)) p(φ(βp 2)|˜

yp

2) dφ(βp 2)

slide-34
SLIDE 34

step 2: estimate / compute

  • compute the posterior mean and variance of auxiliary variable


  • can use a variety of methods
  • AMP (if iterations converge)
  • LASSO
  • Bayesian Compressed Regression (BCR)
  • [your favorite method]
  • the quantities are independent of target parameter!

26

V ar[φ(βp

2)|˜

yp

2]

E[φ(βp

2)|˜

yp

2]

slide-35
SLIDE 35

step 3: approximate

27

Gaussian by assumption on noise prior distribution replace with Gaussian using mean and variance from previous step

  • approximation can be accurate even if the prior and

posterior are highly non-Gaussian

  • apply Gaussian approximation to auxiliary variable to

compute posterior approximation

p(β1|y) ∝ Z p(˜ y1|φ(βp

2), β1) p(β1) p(φ(βp 2)|˜

yp

2) dφ(βp 2)

slide-36
SLIDE 36

advantages of our framework

  • does not apply Gaussian approximation directly to posterior
  • has precise theoretical guarantees under the same

assumptions as AMP

  • can leverage other methods (e.g. LASSO) to produce accurate

approximations in settings where AMP fails

28

slide-37
SLIDE 37

results: accuracy posterior inclusion probabilities

29

0.0 0.2 0.4 0.6 0.8 0.00 0.06 MSE

  • for small problem (p = 12) can compute MSE with respect

to true posterior inclusion probability

correlation between columns of matrix

p(β1 6= 0|y)

approximate message passing (AMP) Bayesian compressed regression (BCR)

slide-38
SLIDE 38

30

False positive rate True positive rate 1 1 False positive rate True positive rate 1 1 AMP LASSO LASSO AMP

matrix with! iid entries matrix with correlated columns

results: accuracy posterior inclusion probabilities p(β1 6= 0|y)

for large problems, ground true is intractable compare methods using empirical ROC curves

incoherent columns

slide-39
SLIDE 39

further directions

framework extends to more general models

31

p(β, y) = Z p(β|θ) p(y|β, θ) dθ p(β|θ) =

p

Y

j=1

p(βj|θ)

conditionally independent conditionally Gaussian

slide-40
SLIDE 40

further directions

framework extends to more general models

31

p(β, y) = Z p(β|θ) p(y|β, θ) dθ p(β|θ) =

p

Y

j=1

p(βj|θ)

conditionally independent conditionally Gaussian

provide theoretical guarantees for approximate Gaussianity of auxiliary variable p(φ(βp

2)|yp 2) ≈ N

⇣ E[φ(βp

2)|yp 2], V ar[φ(βp 2)|yp 2]

true posterior

  • ur Gaussian

approximation ??

slide-41
SLIDE 41

main points

  • high-dimensional variable selection is an important problem

that is studied heavily

  • many estimators lack measures of statistical significance
  • we introduce a framework which can turn point-estimates into

marginal posterior approximations

  • the key idea is to reparameterize the data, via a rotation, and

apply a Gaussian approximation to an auxiliary variable

32

slide-42
SLIDE 42

the end

33

slide-43
SLIDE 43

precise theoretical characterization

34

95% accuracy in detecting nonzeros [Reeves & Gastpar, ’12]

−20 20 40 60 80 100 10

−4

10

−3

10

−2

10

−1

10

SNR (dB) Sampling Rate

Not Achievable Linear MMSE AMP − Soft Thresholding AMP − MMSE Maximum Likelihood

signal-to-noise ratio

ratio of

  • bservations

to unkown paramters