Efficient adaptive experimental design Liam Paninski Department of - - PowerPoint PPT Presentation

efficient adaptive experimental design
SMART_READER_LITE
LIVE PREVIEW

Efficient adaptive experimental design Liam Paninski Department of - - PowerPoint PPT Presentation

Efficient adaptive experimental design Liam Paninski Department of Statistics and Center for Theoretical Neuroscience Columbia University http://www.stat.columbia.edu/ liam liam@stat.columbia.edu April 1, 2010 with J. Lewi, S. Woolley


slide-1
SLIDE 1

Efficient adaptive experimental design

Liam Paninski

Department of Statistics and Center for Theoretical Neuroscience Columbia University http://www.stat.columbia.edu/∼liam liam@stat.columbia.edu April 1, 2010

— with J. Lewi, S. Woolley

slide-2
SLIDE 2

Avoiding the curse of insufficient data

1: Estimate some functional f(p) instead of full joint distribution p(r, s) — information-theoretic functionals 2: Improved nonparametric estimators — minimax theory for discrete distributions under KL loss 3: Select stimuli more efficiently — optimal experimental design (4: Parametric approaches)

slide-3
SLIDE 3

Setup

Assume:

  • parametric model pθ(r|

x) on responses r given inputs x

  • prior distribution p(θ) on finite-dimensional model space

Goal: estimate θ from experimental data Usual approach: draw stimuli i.i.d. from fixed p( x) Adaptive approach: choose p( x) on each trial to maximize E

xI(θ; r|

x) (e.g. “staircase” methods).

slide-4
SLIDE 4

Snapshot: one-dimensional simulation

0.5 1 p(y = 1 | x, θ0) x 2 4 x 10

−3

I(y ; θ | x) 10 20 30 40 θ p(θ)

trial 100

  • ptimized

i.i.d.

slide-5
SLIDE 5

Asymptotic result

Under regularity conditions, a posterior CLT holds (Paninski, 2005): pN √ N(θ − θ0)

  • → N(µN, σ2);

µN ∼ N(0, σ2)

  • (σ2

iid)−1 = Ex(Ix(θ0))

  • (σ2

info)−1 = argmaxC∈co(Ix(θ0)) log |C|

= ⇒ σ2

iid > σ2 info unless Ix(θ0) is constant in x

co(Ix(θ0)) = convex closure (over x) of Fisher information matrices Ix(θ0). (log |C| strictly concave: maximum unique.)

slide-6
SLIDE 6

Illustration of theorem

θ 10 20 30 40 50 60 70 80 90 100 0.2 0.4 θ 10 20 30 40 50 60 70 80 90 100 0.2 0.4 10 20 30 40 50 60 70 80 90 100 0.2 0.4 E(p) 10

1

10

2

10

−2

σ(p) 10 20 30 40 50 60 70 80 90 100 0.5 1 P(θ0) trial number

slide-7
SLIDE 7

Psychometric example

  • stimuli x one-dimensional: intensity
  • responses r binary: detect/no detect

p(r = 1|x, θ) = f((x − θ)/a)

  • scale parameter a (assumed known)
  • want to learn threshold parameter θ as quickly as possible

0.5 1 θ p(1 | x, θ)

slide-8
SLIDE 8

Psychometric example: results

  • variance-minimizing and info-theoretic methods

asymptotically same

  • just one unique function f ∗ for which σiid = σopt; for any
  • ther f, σiid > σopt

Ix(θ) = ( ˙ fa,θ)2 fa,θ(1 − fa,θ)

  • f ∗ solves

˙ fa,θ = c

  • fa,θ(1 − fa,θ)

f ∗(t) = sin(ct) + 1 2

  • σ2

iid/σ2

  • pt ∼ 1/a for a small
slide-9
SLIDE 9

Part 2: Computing the optimal stimulus

OK, now how do we actually do this in neural case?

  • Computing I(θ; r|

x) requires an integration over θ — in general, exponentially hard in dim(θ)

  • Maximizing I(θ; r|

x) in x is doubly hard — in general, exponentially hard in dim( x) Doing all this in real time (∼ 10 ms - 1 sec) is a major challenge!

slide-10
SLIDE 10

Three key steps

  • 1. Choose a tractable, flexible model of neural encoding
  • 2. Choose a tractable, accurate approximation of the posterior

p( θ|{ xi, ri}i≤N)

  • 3. Use approximations and some perturbation theory to reduce
  • ptimization problem to a simple 1-d linesearch
slide-11
SLIDE 11

Step 1: focus on GLM case

ri ∼ Poiss(λi); λi| xi, θ = f( k · xi +

  • j

ajri−j). More generally, log p(ri|θ, xi) = k(r)f(θ · xi) + s(r) + g(θ · xi) Goal: learn θ = { k, a} in as few trials as possible.

slide-12
SLIDE 12

GLM likelihood

λi ∼ Poiss(λi) λi| xi, θ = f( k · xi +

  • j

ajri−j) log p(ri| xi, θ) = −f( k · xi +

  • j

ajri−j)+ri log f( k · xi +

  • j

ajri−j) Two key points:

  • Likelihood is “rank-1” — only depends on

θ along z = ( x, r).

  • f convex and log-concave =

⇒ log-likelihood concave in θ

slide-13
SLIDE 13

Step 2: representing the posterior

Idea: Laplace approximation p( θ|{ xi, ri}i≤N) ≈ N(µN, CN) Justification:

  • posterior CLT
  • likelihood is log-concave, so posterior is also log-concave:

log p( θ|{ xi, ri}i≤N) ∼ log p( θ|{ xi, ri}i≤N−1) + log p(rN|xN, θ)

slide-14
SLIDE 14

Efficient updating

Updating µN: one-d search Updating CN: rank-one update, CN = (C−1

N−1 + b

zt z)−1 — use Woodbury lemma Total time for update of posterior: O(d2)

slide-15
SLIDE 15

Step 3: Efficient stimulus optimization

Laplace approximation = ⇒ I(θ; r| x) ∼ Er|

x log |CN−1| |CN|

— this is nonlinear and difficult, but we can simplify using perturbation theory: log |I + A| ≈ trace(A). Now we can take averages over p(r| x) =

  • p(r|θ,

x)pN(θ)dθ: standard Fisher info calculation given Poisson assumption on r. Further assuming f(.) = exp(.) allows us to compute expectation exactly, using m.g.f. of Gaussian. ...finally, we want to maximize F( x) = g(µN · x)h( xtCN x).

slide-16
SLIDE 16

Computing the optimal x

max

x g(µN ·

x)h( xtCN x) increases with || x||2: constraining || x||2 reduces problem to nonlinear eigenvalue problem. Lagrange multiplier approach (Berkes and Wiskott, 2006) reduces problem to 1-d linesearch, once eigendecomposition is computed — much easier than full d-dimensional optimization! Rank-one update of eigendecomposition may be performed in O(d2) time (Gu and Eisenstat, 1994). = ⇒ Computing optimal stimulus takes O(d2) time.

slide-17
SLIDE 17

Side note: linear-Gaussian case is easy

Linear Gaussian case: ri = θ · xi + ǫi, ǫi ∼ N(0, σ2)

  • Previous approximations are exact; instead of nonlinear

eigenvalue problem, we have standard eigenvalue problem. No dependence on µN, just CN.

  • Fisher information does not depend on observed ri, so
  • ptimal sequence {

x1, x2, . . .} can be precomputed, since

  • bserved ri do not change optimal strategy.
slide-18
SLIDE 18

Near real-time adaptive design

200 400 600 0.001 0.01 0.1 Dimensionality Time(Seconds) total time diagonalization posterior update 1d line Search

slide-19
SLIDE 19

Simulation overview

slide-20
SLIDE 20

Gabor example

— infomax approach is an order of magnitude more efficient.

slide-21
SLIDE 21

Application to songbird data: choosing an

  • ptimal stimulus sequence

— stimuli chosen from a fixed pool; greater improvements expected if we can choose arbitrary stimuli on each trial.

slide-22
SLIDE 22

Handling nonstationary parameters

Various sources of nonsystematic nonstationarity:

  • Eye position drift
  • Changes in arousal / attentive state
  • Changes in health / excitability of preparation

Solution: allow diffusion in extended Kalman filter:

  • θN+1 =

θN + ǫ; ǫ ∼ N(0, Q)

slide-23
SLIDE 23

Nonstationary example

true θ θi trial 1 100 1 400 800

  • info. max.

θi 1 100

  • info. max.

no diffusion θi 1 100 θi random 1 100 0 0.5 1

slide-24
SLIDE 24

Asymptotic efficiency

We made a bunch of approximations; do we still achieve correct asymptotic rate? Recall:

  • (σ2

iid)−1 = Ex(Ix(θ0))

  • (σ2

info)−1 = argmaxC∈co(Ix(θ0)) log |C|

slide-25
SLIDE 25

Asymptotic efficiency: finite stimulus set

If |X| < ∞, computing infomax rate is just a finite-dimensional (numerical) convex optimization over p(x).

slide-26
SLIDE 26

Asymptotic efficiency: bounded norm case

If X = { x : || x||2 < c < ∞}, optimizing over p(x) is now infinite-dimensional, but symmetry arguments reduce this to a two-dimensional problem (Lewi et al., 2009). — σ2

iid/σ2

  • pt ∼ dim(

x): infomax is most efficient in high-d cases

slide-27
SLIDE 27

Conclusions

  • Three key assumptions/approximations enable real-time

(O(d2)) infomax stimulus design: — generalized linear model — Laplace approximation — first-order approximation of log-determinant

  • Able to deal with adaptation through spike history terms

and nonstationarity through Kalman formulation

  • Directions: application to real data; optimizing over

sequence of stimuli { xt, xt+1, . . . xt+b} instead of just next stimulus xt.

slide-28
SLIDE 28

References

Berkes, P. and Wiskott, L. (2006). On the analysis and interpretation of inhomogeneous quadratic forms as receptive fields. Neural Computation, 18:1868–1895. Gu, M. and Eisenstat, S. (1994). A stable and efficient algorithm for the rank-one modification of the symmetric eigenproblem. SIAM J. Matrix Anal. Appl., 15(4):1266–1276. Lewi, J., Butera, R., and Paninski, L. (2009). Sequential optimal design of neurophysiology experiments. Neural Computation, 21:619–687. Paninski, L. (2005). Asymptotic theory of information-theoretic experimental design. Neural Computation, 17:1480–1507.