Real-time adaptive information-theoretic optimization of - - PowerPoint PPT Presentation

real time adaptive information theoretic optimization of
SMART_READER_LITE
LIVE PREVIEW

Real-time adaptive information-theoretic optimization of - - PowerPoint PPT Presentation

Real-time adaptive information-theoretic optimization of neurophysiology experiments Presented by Alex Roper March 5, 2009 Goals How do neurons react to stimuli? What is a neurons preferred stimulus? Goals How do neurons react


slide-1
SLIDE 1

Real-time adaptive information-theoretic

  • ptimization of neurophysiology experiments

Presented by Alex Roper March 5, 2009

slide-2
SLIDE 2

Goals

◮ How do neurons react to stimuli? ◮ What is a neuron’s preferred stimulus?

slide-3
SLIDE 3

Goals

◮ How do neurons react to stimuli? ◮ What is a neuron’s preferred stimulus? ◮ Minimize number of trials. ◮ Speed - must run in real time.

slide-4
SLIDE 4

Goals

◮ How do neurons react to stimuli? ◮ What is a neuron’s preferred stimulus? ◮ Minimize number of trials. ◮ Speed - must run in real time. ◮ Emphasis on dimensional scalability (vision)

slide-5
SLIDE 5

Challenges

◮ Typically high dimension

◮ Model complexity - memory ◮ Stimulus complexity - visual bitmap

slide-6
SLIDE 6

Challenges

◮ Typically high dimension

◮ Model complexity - memory ◮ Stimulus complexity - visual bitmap

◮ Bayesian approach expensive

◮ Estimation ◮ Integration ◮ Multivariate optimization

slide-7
SLIDE 7

Challenges

◮ Typically high dimension

◮ Model complexity - memory ◮ Stimulus complexity - visual bitmap

◮ Bayesian approach expensive

◮ Estimation ◮ Integration ◮ Multivariate optimization

◮ Limited firing capacity of a neuron (exhaustion)

slide-8
SLIDE 8

Challenges

◮ Typically high dimension

◮ Model complexity - memory ◮ Stimulus complexity - visual bitmap

◮ Bayesian approach expensive

◮ Estimation ◮ Integration ◮ Multivariate optimization

◮ Limited firing capacity of a neuron (exhaustion) ◮ Essential issues

◮ Update a posteriori beliefs quickly given new data ◮ Find optimal stimulus quickly

slide-9
SLIDE 9

Neuron Model

p(rt|{xt, xt−1, ..., xt−tk}, {rt−1, ..., rt−tk})

slide-10
SLIDE 10

Neuron Model

p(rt|{xt, xt−1, ..., xt−tk}, {rt−1, ..., rt−tk})

◮ The response rt to stimulus xt is dependent on xt itself, as

well as the history of stimuli and responses for a constant sliding window.

slide-11
SLIDE 11

Neuron Model

p(rt|{xt, xt−1, ..., xt−tk}, {rt−1, ..., rt−tk})

◮ The response rt to stimulus xt is dependent on xt itself, as

well as the history of stimuli and responses for a constant sliding window.

◮ This is needed to measure exhaustion, depletion, etc.

slide-12
SLIDE 12

Neuron Model

p(rt|{xt, xt−1, ..., xt−tk}, {rt−1, ..., rt−tk})

◮ The response rt to stimulus xt is dependent on xt itself, as

well as the history of stimuli and responses for a constant sliding window.

◮ This is needed to measure exhaustion, depletion, etc.

λt = E(rt) = f

  • i
  • l=1 tkki,t−l + ta

j=1 ajrt−j,

slide-13
SLIDE 13

Neuron Model

p(rt|{xt, xt−1, ..., xt−tk}, {rt−1, ..., rt−tk})

◮ The response rt to stimulus xt is dependent on xt itself, as

well as the history of stimuli and responses for a constant sliding window.

◮ This is needed to measure exhaustion, depletion, etc.

λt = E(rt) = f

  • i
  • l=1 tkki,t−l + ta

j=1 ajrt−j,

  • ◮ Filter coefficients ki,t−l represent dependence on the input

itself.

slide-14
SLIDE 14

Neuron Model

p(rt|{xt, xt−1, ..., xt−tk}, {rt−1, ..., rt−tk})

◮ The response rt to stimulus xt is dependent on xt itself, as

well as the history of stimuli and responses for a constant sliding window.

◮ This is needed to measure exhaustion, depletion, etc.

λt = E(rt) = f

  • i
  • l=1 tkki,t−l + ta

j=1 ajrt−j,

  • ◮ Filter coefficients ki,t−l represent dependence on the input

itself.

◮ aj models dependence on observed recent activity.

slide-15
SLIDE 15

Neuron Model

p(rt|{xt, xt−1, ..., xt−tk}, {rt−1, ..., rt−tk})

◮ The response rt to stimulus xt is dependent on xt itself, as

well as the history of stimuli and responses for a constant sliding window.

◮ This is needed to measure exhaustion, depletion, etc.

λt = E(rt) = f

  • i
  • l=1 tkki,t−l + ta

j=1 ajrt−j,

  • ◮ Filter coefficients ki,t−l represent dependence on the input

itself.

◮ aj models dependence on observed recent activity. ◮ We summarize all unknown parameters as θ. This is what

we’re trying to learn.

slide-16
SLIDE 16

Generalized Linear Models

◮ Distribution function (multivariate gaussian). ◮ Linear predictor, θ. ◮ Link function (exponential).

slide-17
SLIDE 17

Updating the Posterior

◮ Ideally, this runs in real time. ◮ Approximate the posterior as Gaussian

slide-18
SLIDE 18

Updating the Posterior

◮ Ideally, this runs in real time. ◮ Approximate the posterior as Gaussian

◮ The posterior is the product of two smooth, log-concave

terms.

◮ (The GLM likelihood function and the Gaussian prior)

slide-19
SLIDE 19

Updating the Posterior

◮ Ideally, this runs in real time. ◮ Approximate the posterior as Gaussian

◮ The posterior is the product of two smooth, log-concave

terms.

◮ (The GLM likelihood function and the Gaussian prior)

◮ Laplace approximation to construct a Gaussian

approximation of the posterior.

slide-20
SLIDE 20

Updating the Posterior

◮ Ideally, this runs in real time. ◮ Approximate the posterior as Gaussian

◮ The posterior is the product of two smooth, log-concave

terms.

◮ (The GLM likelihood function and the Gaussian prior)

◮ Laplace approximation to construct a Gaussian

approximation of the posterior.

◮ Set µt to the peak of the posterior. ◮ Set covariance matrix Ct to negative inverse of Hessian of

log posterior at µt.

slide-21
SLIDE 21

Updating the Posterior

◮ Ideally, this runs in real time. ◮ Approximate the posterior as Gaussian

◮ The posterior is the product of two smooth, log-concave

terms.

◮ (The GLM likelihood function and the Gaussian prior)

◮ Laplace approximation to construct a Gaussian

approximation of the posterior.

◮ Set µt to the peak of the posterior. ◮ Set covariance matrix Ct to negative inverse of Hessian of

log posterior at µt.

◮ Compute directly?

slide-22
SLIDE 22

Updating the Posterior

◮ Ideally, this runs in real time. ◮ Approximate the posterior as Gaussian

◮ The posterior is the product of two smooth, log-concave

terms.

◮ (The GLM likelihood function and the Gaussian prior)

◮ Laplace approximation to construct a Gaussian

approximation of the posterior.

◮ Set µt to the peak of the posterior. ◮ Set covariance matrix Ct to negative inverse of Hessian of

log posterior at µt.

◮ Compute directly? ◮ Complexity is O(td2 + d3)

slide-23
SLIDE 23

Updating the Posterior

◮ Ideally, this runs in real time. ◮ Approximate the posterior as Gaussian

◮ The posterior is the product of two smooth, log-concave

terms.

◮ (The GLM likelihood function and the Gaussian prior)

◮ Laplace approximation to construct a Gaussian

approximation of the posterior.

◮ Set µt to the peak of the posterior. ◮ Set covariance matrix Ct to negative inverse of Hessian of

log posterior at µt.

◮ Compute directly? ◮ Complexity is O(td2 + d3) ◮ O(td2) for product of t likelihood terms. ◮ O(d3) for inverting the Hessian ◮ Approximate p(θt−1|xt−1, rt−1) as Gaussian

slide-24
SLIDE 24

Updating the Posterior

◮ Ideally, this runs in real time. ◮ Approximate the posterior as Gaussian

◮ The posterior is the product of two smooth, log-concave

terms.

◮ (The GLM likelihood function and the Gaussian prior)

◮ Laplace approximation to construct a Gaussian

approximation of the posterior.

◮ Set µt to the peak of the posterior. ◮ Set covariance matrix Ct to negative inverse of Hessian of

log posterior at µt.

◮ Compute directly? ◮ Complexity is O(td2 + d3) ◮ O(td2) for product of t likelihood terms. ◮ O(d3) for inverting the Hessian ◮ Approximate p(θt−1|xt−1, rt−1) as Gaussian ◮ Now we can use Bayes’ rule to find the posterior in one

dimension.

slide-25
SLIDE 25

Updating the Posterior

◮ Ideally, this runs in real time. ◮ Approximate the posterior as Gaussian

◮ The posterior is the product of two smooth, log-concave

terms.

◮ (The GLM likelihood function and the Gaussian prior)

◮ Laplace approximation to construct a Gaussian

approximation of the posterior.

◮ Set µt to the peak of the posterior. ◮ Set covariance matrix Ct to negative inverse of Hessian of

log posterior at µt.

◮ Compute directly? ◮ Complexity is O(td2 + d3) ◮ O(td2) for product of t likelihood terms. ◮ O(d3) for inverting the Hessian ◮ Approximate p(θt−1|xt−1, rt−1) as Gaussian ◮ Now we can use Bayes’ rule to find the posterior in one

  • dimension. O(d2).
slide-26
SLIDE 26

Deriving the optimal stimulus

◮ Main idea: maximize conditional mutual information:

slide-27
SLIDE 27

Deriving the optimal stimulus

◮ Main idea: maximize conditional mutual information: ◮ I(θ; rt+1|xt+1, xt, rt) = H(θ|xt, rt) − H(θ|xt+1, rt+1).

slide-28
SLIDE 28

Deriving the optimal stimulus

◮ Main idea: maximize conditional mutual information: ◮ I(θ; rt+1|xt+1, xt, rt) = H(θ|xt, rt) − H(θ|xt+1, rt+1). ◮ This ends up being equivalent to minimizing the conditional

entropy H(θ|xt+1, rt+1).

slide-29
SLIDE 29

Deriving the optimal stimulus

◮ Main idea: maximize conditional mutual information: ◮ I(θ; rt+1|xt+1, xt, rt) = H(θ|xt, rt) − H(θ|xt+1, rt+1). ◮ This ends up being equivalent to minimizing the conditional

entropy H(θ|xt+1, rt+1).

◮ End up with equation for covariance in terms of Fisher

information, Jobs.

slide-30
SLIDE 30

Deriving the optimal stimulus

◮ Main idea: maximize conditional mutual information: ◮ I(θ; rt+1|xt+1, xt, rt) = H(θ|xt, rt) − H(θ|xt+1, rt+1). ◮ This ends up being equivalent to minimizing the conditional

entropy H(θ|xt+1, rt+1).

◮ End up with equation for covariance in terms of Fisher

information, Jobs.

◮ We are able to solve for optimal stimulus using the

Lagrange method for constrained optimization

slide-31
SLIDE 31

Deriving the optimal stimulus

◮ Main idea: maximize conditional mutual information: ◮ I(θ; rt+1|xt+1, xt, rt) = H(θ|xt, rt) − H(θ|xt+1, rt+1). ◮ This ends up being equivalent to minimizing the conditional

entropy H(θ|xt+1, rt+1).

◮ End up with equation for covariance in terms of Fisher

information, Jobs.

◮ We are able to solve for optimal stimulus using the

Lagrange method for constrained optimization

◮ Thus, we have a system of equations in the Lagrange

multiplier, and we can simply line search over it.

slide-32
SLIDE 32

Deriving the optimal stimulus

◮ Complexity?

slide-33
SLIDE 33

Deriving the optimal stimulus

◮ Complexity?

◮ Rank-one matrix update and line search to compute µt and

Ct.

slide-34
SLIDE 34

Deriving the optimal stimulus

◮ Complexity?

◮ Rank-one matrix update and line search to compute µt and

Ct.O(d2).

◮ Eigendecomposition of Ct.

slide-35
SLIDE 35

Deriving the optimal stimulus

◮ Complexity?

◮ Rank-one matrix update and line search to compute µt and

Ct.O(d2).

◮ Eigendecomposition of Ct. O(d3) ◮ Line search over Lagrange multiplier to compute optimal

  • stimulus. O(d2)

◮ O(d3) for the eigendecomposition isn’t great...

slide-36
SLIDE 36

Deriving the optimal stimulus

◮ Complexity?

◮ Rank-one matrix update and line search to compute µt and

Ct.O(d2).

◮ Eigendecomposition of Ct. O(d3) ◮ Line search over Lagrange multiplier to compute optimal

  • stimulus. O(d2)

◮ O(d3) for the eigendecomposition isn’t great... ◮ ...but because of our Gaussian approximation of θ, we can

  • btain Ct from Ct−1 with a rank-one modification...
slide-37
SLIDE 37

Deriving the optimal stimulus

◮ Complexity?

◮ Rank-one matrix update and line search to compute µt and

Ct.O(d2).

◮ Eigendecomposition of Ct. O(d3) ◮ Line search over Lagrange multiplier to compute optimal

  • stimulus. O(d2)

◮ O(d3) for the eigendecomposition isn’t great... ◮ ...but because of our Gaussian approximation of θ, we can

  • btain Ct from Ct−1 with a rank-one modification...

◮ ...and there are eigendecomposition algorithms that can

take advantage of this.

slide-38
SLIDE 38

Deriving the optimal stimulus

◮ Complexity?

◮ Rank-one matrix update and line search to compute µt and

Ct.O(d2).

◮ Eigendecomposition of Ct. O(d3) ◮ Line search over Lagrange multiplier to compute optimal

  • stimulus. O(d2)

◮ O(d3) for the eigendecomposition isn’t great... ◮ ...but because of our Gaussian approximation of θ, we can

  • btain Ct from Ct−1 with a rank-one modification...

◮ ...and there are eigendecomposition algorithms that can

take advantage of this.

◮ This provides an average case runtime of O(d2) for the

data considered, though the complexity is still O(d3) in the worst case.

slide-39
SLIDE 39

What if θ is dynamic?

◮ Spike history terms

slide-40
SLIDE 40

What if θ is dynamic?

◮ Spike history terms

◮ Adds a linear term to a quadratic minimization problem for

maximizing entropy.

slide-41
SLIDE 41

What if θ is dynamic?

◮ Spike history terms

◮ Adds a linear term to a quadratic minimization problem for

maximizing entropy.

◮ Systematic trends in θ.

slide-42
SLIDE 42

What if θ is dynamic?

◮ Spike history terms

◮ Adds a linear term to a quadratic minimization problem for

maximizing entropy.

◮ Systematic trends in θ.

◮ Just add a random variable N(0, Ct + Q) for known Q.

slide-43
SLIDE 43

What if θ is dynamic?

◮ Spike history terms

◮ Adds a linear term to a quadratic minimization problem for

maximizing entropy.

◮ Systematic trends in θ.

◮ Just add a random variable N(0, Ct + Q) for known Q. ◮ θt+1 = θt + ωt.

slide-44
SLIDE 44

What if θ is dynamic?

◮ Spike history terms

◮ Adds a linear term to a quadratic minimization problem for

maximizing entropy.

◮ Systematic trends in θ.

◮ Just add a random variable N(0, Ct + Q) for known Q. ◮ θt+1 = θt + ωt.

slide-45
SLIDE 45

Results

◮ Simple, memoryless, visual cell

◮ 25x33 bitmaps. ◮ Results on average much better, and never worse, than

random.

slide-46
SLIDE 46

Results

◮ Simple, memoryless, visual cell

◮ 25x33 bitmaps. ◮ Results on average much better, and never worse, than

random.

◮ Memoryful neuron (simple sine wave)

◮ Outperformed random sampling for estimating spike history

and stimulus coefficients.

slide-47
SLIDE 47

Results

◮ Simple, memoryless, visual cell

◮ 25x33 bitmaps. ◮ Results on average much better, and never worse, than

random.

◮ Memoryful neuron (simple sine wave)

◮ Outperformed random sampling for estimating spike history

and stimulus coefficients.

◮ Non-systematic time drift

◮ Analogous to eye fatigue/exhaustion. ◮ Outperformed random sampling for estimating spike history

and stimulus coefficients.

slide-48
SLIDE 48

Conclusion

◮ Approximations based on GLMs allow dramatically faster

algorithm.

slide-49
SLIDE 49

Conclusion

◮ Approximations based on GLMs allow dramatically faster

algorithm.

◮ At worst, O(n3). on average, O(n2).

slide-50
SLIDE 50

Conclusion

◮ Approximations based on GLMs allow dramatically faster

algorithm.

◮ At worst, O(n3). on average, O(n2). ◮ Fast enough to run in real time even for high-dimensional

problems.