efficient adaptive experimental design
play

Efficient adaptive experimental design Liam Paninski Department of - PowerPoint PPT Presentation

Efficient adaptive experimental design Liam Paninski Department of Statistics and Center for Theoretical Neuroscience Columbia University http://www.stat.columbia.edu/ liam liam@stat.columbia.edu April 1, 2010 with J. Lewi, S. Woolley


  1. Efficient adaptive experimental design Liam Paninski Department of Statistics and Center for Theoretical Neuroscience Columbia University http://www.stat.columbia.edu/ ∼ liam liam@stat.columbia.edu April 1, 2010 — with J. Lewi, S. Woolley

  2. Avoiding the curse of insufficient data 1 : Estimate some functional f ( p ) instead of full joint distribution p ( r, s ) — information-theoretic functionals 2 : Improved nonparametric estimators — minimax theory for discrete distributions under KL loss 3 : Select stimuli more efficiently — optimal experimental design ( 4 : Parametric approaches )

  3. Setup Assume: • parametric model p θ ( r | � x ) on responses r given inputs � x • prior distribution p ( θ ) on finite-dimensional model space Goal: estimate θ from experimental data Usual approach: draw stimuli i.i.d. from fixed p ( � x ) Adaptive approach: choose p ( � x ) on each trial to maximize E � x I ( θ ; r | � x ) (e.g. “staircase” methods).

  4. Snapshot: one-dimensional simulation x 1 p(y = 1 | x, θ 0 ) 0.5 0 −3 x 10 4 I(y ; θ | x) 2 0 40 trial 100 30 p( θ ) 20 optimized 10 i.i.d. 0 θ

  5. Asymptotic result Under regularity conditions, a posterior CLT holds (Paninski, 2005): � √ � → N ( µ N , σ 2 ); µ N ∼ N (0 , σ 2 ) p N N ( θ − θ 0 ) iid ) − 1 = E x ( I x ( θ 0 )) • ( σ 2 info ) − 1 = argmax C ∈ co ( I x ( θ 0 )) log | C | • ( σ 2 ⇒ σ 2 iid > σ 2 = info unless I x ( θ 0 ) is constant in x co ( I x ( θ 0 )) = convex closure (over x ) of Fisher information matrices I x ( θ 0 ). (log | C | strictly concave: maximum unique.)

  6. Illustration of theorem 0 0.2 θ 0.4 10 20 30 40 50 60 70 80 90 100 0 0.2 θ 0.4 10 20 30 40 50 60 70 80 90 100 0.4 E(p) 0.2 10 20 30 40 50 60 70 80 90 100 σ (p) −2 10 1 2 10 10 1 P( θ 0 ) 0.5 0 10 20 30 40 50 60 70 80 90 100 trial number

  7. Psychometric example • stimuli x one-dimensional: intensity • responses r binary: detect/no detect p ( r = 1 | x, θ ) = f (( x − θ ) /a ) • scale parameter a (assumed known) • want to learn threshold parameter θ as quickly as possible 1 p(1 | x, θ ) 0.5 0 θ

  8. Psychometric example: results • variance-minimizing and info-theoretic methods asymptotically same • just one unique function f ∗ for which σ iid = σ opt ; for any other f , σ iid > σ opt ( ˙ f a,θ ) 2 I x ( θ ) = f a,θ (1 − f a,θ ) • f ∗ solves � ˙ f a,θ (1 − f a,θ ) f a,θ = c f ∗ ( t ) = sin( ct ) + 1 2 • σ 2 iid /σ 2 opt ∼ 1 /a for a small

  9. Part 2: Computing the optimal stimulus OK, now how do we actually do this in neural case? • Computing I ( θ ; r | � x ) requires an integration over θ — in general, exponentially hard in dim( θ ) • Maximizing I ( θ ; r | � x ) in � x is doubly hard — in general, exponentially hard in dim( � x ) Doing all this in real time ( ∼ 10 ms - 1 sec) is a major challenge!

  10. Three key steps 1. Choose a tractable, flexible model of neural encoding 2. Choose a tractable, accurate approximation of the posterior p ( � θ |{ � x i , r i } i ≤ N ) 3. Use approximations and some perturbation theory to reduce optimization problem to a simple 1-d linesearch

  11. Step 1: focus on GLM case x i , � θ = f ( � � r i ∼ Poiss ( λ i ); λ i | � k · � x i + a j r i − j ) . j More generally, log p ( r i | θ, � x i ) = k ( r ) f ( θ · � x i ) + s ( r ) + g ( θ · � x i ) Goal: learn � θ = { � a } in as few trials as possible. k,�

  12. GLM likelihood λ i ∼ Poiss ( λ i ) x i , � θ = f ( � � λ i | � k · � x i + a j r i − j ) j x i , � θ ) = − f ( � a j r i − j )+ r i log f ( � � � log p ( r i | � k · � k · � x i + x i + a j r i − j ) j j Two key points: • Likelihood is “rank-1” — only depends on � θ along � z = ( � x,� r ). ⇒ log-likelihood concave in � • f convex and log-concave = θ

  13. Step 2: representing the posterior Idea: Laplace approximation p ( � θ |{ � x i , r i } i ≤ N ) ≈ N ( µ N , C N ) Justification: • posterior CLT • likelihood is log-concave, so posterior is also log-concave: log p ( � x i , r i } i ≤ N ) ∼ log p ( � x i , r i } i ≤ N − 1 ) + log p ( r N | x N , � θ |{ � θ |{ � θ )

  14. Efficient updating Updating µ N : one-d search z ) − 1 — use Updating C N : rank-one update, C N = ( C − 1 z t � N − 1 + b� Woodbury lemma Total time for update of posterior: O ( d 2 )

  15. Step 3: Efficient stimulus optimization x log | C N − 1 | ⇒ I ( θ ; r | � x ) ∼ E r | � Laplace approximation = | C N | — this is nonlinear and difficult, but we can simplify using perturbation theory: log | I + A | ≈ trace( A ). � Now we can take averages over p ( r | � p ( r | θ, � x ) = x ) p N ( θ ) dθ : standard Fisher info calculation given Poisson assumption on r . Further assuming f ( . ) = exp( . ) allows us to compute expectation exactly, using m.g.f. of Gaussian. x t C N � x ) = g ( µ N · � ...finally, we want to maximize F ( � x ) h ( � x ).

  16. Computing the optimal � x x t C N � max � x g ( µ N · � x ) h ( � x ) increases with || � x || 2 : constraining || � x || 2 reduces problem to nonlinear eigenvalue problem. Lagrange multiplier approach (Berkes and Wiskott, 2006) reduces problem to 1-d linesearch, once eigendecomposition is computed — much easier than full d -dimensional optimization! Rank-one update of eigendecomposition may be performed in O ( d 2 ) time (Gu and Eisenstat, 1994). ⇒ Computing optimal stimulus takes O ( d 2 ) time. =

  17. Side note: linear-Gaussian case is easy Linear Gaussian case: x i + ǫ i , ǫ i ∼ N (0 , σ 2 ) r i = θ · � • Previous approximations are exact; instead of nonlinear eigenvalue problem, we have standard eigenvalue problem. No dependence on µ N , just C N . • Fisher information does not depend on observed r i , so optimal sequence { � x 1 , � x 2 , . . . } can be precomputed, since observed r i do not change optimal strategy.

  18. Near real-time adaptive design 0.1 Time(Seconds) total time 0.01 diagonalization posterior update 1d line Search 0.001 0 200 400 600 Dimensionality

  19. Simulation overview

  20. Gabor example — infomax approach is an order of magnitude more efficient.

  21. Application to songbird data: choosing an optimal stimulus sequence — stimuli chosen from a fixed pool; greater improvements expected if we can choose arbitrary stimuli on each trial.

  22. Handling nonstationary parameters Various sources of nonsystematic nonstationarity: • Eye position drift • Changes in arousal / attentive state • Changes in health / excitability of preparation Solution: allow diffusion in extended Kalman filter: � θ N +1 = � θ N + ǫ ; ǫ ∼ N (0 , Q )

  23. Nonstationary example info. max. true θ info. max. no diffusion random 1 1 trial 0.5 400 100 0 800 1 1 100 1 100 1 100 θ i θ i θ i θ i

  24. Asymptotic efficiency We made a bunch of approximations; do we still achieve correct asymptotic rate? Recall: iid ) − 1 = E x ( I x ( θ 0 )) • ( σ 2 info ) − 1 = argmax C ∈ co ( I x ( θ 0 )) log | C | • ( σ 2

  25. Asymptotic efficiency: finite stimulus set If |X| < ∞ , computing infomax rate is just a finite-dimensional (numerical) convex optimization over p ( x ).

  26. Asymptotic efficiency: bounded norm case If X = { � x : || � x || 2 < c < ∞} , optimizing over p ( x ) is now infinite-dimensional, but symmetry arguments reduce this to a two-dimensional problem (Lewi et al., 2009). — σ 2 iid /σ 2 opt ∼ dim( � x ): infomax is most efficient in high-d cases

  27. Conclusions • Three key assumptions/approximations enable real-time ( O ( d 2 )) infomax stimulus design: — generalized linear model — Laplace approximation — first-order approximation of log-determinant • Able to deal with adaptation through spike history terms and nonstationarity through Kalman formulation • Directions: application to real data; optimizing over sequence of stimuli { � x t , � x t +1 , . . . � x t + b } instead of just next stimulus � x t .

  28. References Berkes, P. and Wiskott, L. (2006). On the analysis and interpretation of inhomogeneous quadratic forms as receptive fields. Neural Computation , 18:1868–1895. Gu, M. and Eisenstat, S. (1994). A stable and efficient algorithm for the rank-one modification of the symmetric eigenproblem. SIAM J. Matrix Anal. Appl. , 15(4):1266–1276. Lewi, J., Butera, R., and Paninski, L. (2009). Sequential optimal design of neurophysiology experiments. Neural Computation , 21:619–687. Paninski, L. (2005). Asymptotic theory of information-theoretic experimental design. Neural Computation , 17:1480–1507.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend