Neural Encoding Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk - - PowerPoint PPT Presentation

neural encoding models
SMART_READER_LITE
LIVE PREVIEW

Neural Encoding Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk - - PowerPoint PPT Presentation

Neural Encoding Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2011 Studying sensory systems x ( t ) y ( t ) x ( t )= G [ y ( t )] Decoding: (reconstruction)


slide-1
SLIDE 1

Neural Encoding Models

Maneesh Sahani

maneesh@gatsby.ucl.ac.uk

Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2011

slide-2
SLIDE 2

Studying sensory systems

x(t) y(t)

Decoding:

ˆ x(t)= G[y(t)]

(reconstruction) Encoding:

ˆ y(t)= F[x(t)]

(systems identification)

slide-3
SLIDE 3

General approach

Goal: Estimate p(spike|x, H) [or λ(t|x[0, t), H(t))] from data.

  • Naive approach: measure p(spike, H|x) directly for every setting of x.

– too hard: too little data and too many potential inputs.

  • Estimate some functional f(p) instead (e.g. mutual information)
  • Select stimuli efficiently
  • Fit models with smaller numbers of parameters
slide-4
SLIDE 4

Spikes, or rate?

Most neurons communicate using action potentials — statistically de- scribed by a point process:

P

  • spike ∈ [t, t + dt)
  • = λ(t|H(t), stimulus, network activity)dt

To fully model the response we need to identify λ. In general this de- pends on spike history H(t) and network activity. Three options:

  • Ignore the history dependence, take network activity as source of

“noise” (i.e. assume firing is inhomogeneous Poisson or Cox process, conditioned on the stimulus).

  • Average multiple trials to estimate

λ(t, stimulus) = lim

N→∞

1 N

  • n

λ(t|Hn(t), stimulus, networkn)

the mean intensity (or PSTH), and try to fit this.

  • Attempt to capture history and network effects in simple models.
slide-5
SLIDE 5

Spike-triggered average

Decoding: mean of P (x | y = 1) Encoding: predictive filter

slide-6
SLIDE 6

Linear regression

y(t) = T x(t − τ)w(τ)dτ W(ω) = X(ω)∗Y (ω) |X(ω)|2 x1 x2 x3 . . . xT xT+1 . . . x1 x2 x3 . . . xT

  • x1 x2 x3 . . . xT xT
  • x1 x2 x3 . . . xT+1

x2 x3 x4 . . . xT+1

. . .

× wt

. . .

w3 w2 w1 = yT yT+1

. . .

XW = Y W = (XTX) ΣSS

−1 (XTY )

STA

slide-7
SLIDE 7

Linear models

So the (whitened) spike-triggered average gives the minimum-squared- error linear model. Issues:

  • overfitting and regularisation

– standard methods for regression

  • negative predicted rates

– can model deviations from background

  • real neurons aren’t linear

– models are still used extensively – interpretable suggestions of underlying sensitivity – may provide unbiased estimates of cascade filters (see later)

slide-8
SLIDE 8

How good are linear predictions?

We would like an absolute measure of model performance. Measured responses can never be predicted perfectly:

  • The measurements themselves are noisy.

Models may fail to predict because:

  • They are the wrong model.
  • Their parameters are mis-estimated due to noise.
slide-9
SLIDE 9

Estimating predictable power

Psignal Pnoise

response

  • r(n)

= signal + noise P(r(n)) = Psignal + Pnoise P(r(n)) = Psignal + 1 N Pnoise      ⇒     

  • Psignal =

1 N − 1

  • NP(r(n)) − P(r(n))
  • Pnoise = P(r(n)) −

Psignal

slide-10
SLIDE 10

Signal power in A1 responses

0.5 1 1.5 2 2.5 3 3.5 4 0.1 0.2 0.3 0.4 Noise power (spikes2/bin) Signal power (spikes2/bin) 50 100 150 Number of recordings

slide-11
SLIDE 11

Testing a model

For a perfect prediction

  • P(trial) − P(residual)
  • = P(signal)

Thus, we can judge the performance of a model by the normalized pre- dictive power

P(trial) − P(residual)

  • P(signal)

Similar to coefficient of determination (r2), but the denominator is the predictable variance.

slide-12
SLIDE 12

Predictive performance

0.5 1 1.5 2 2.5 0.5 1 1.5 2 2.5

normalised Bayes predictive power

Training Error

−2 −1 1 −2 −1.5 −1 −0.5 0.5 1

Cross−Validation Error

normalised STA predictive power

slide-13
SLIDE 13

Extrapolating the model performance

slide-14
SLIDE 14

Jackknifed estimates

50 100 150 −0.5 0.5 1 1.5 2 2.5 3

Normalized linearly predictive power Normalized noise power

slide-15
SLIDE 15

Extrapolated linearity

50 100 150 −0.5 0.5 1 1.5 2 2.5 −5 5 10 15 20 25 30 −0.2 0.2 0.4 0.6 0.8 1

Normalized noise power Normalized linearly predictive power

[extrapolated range: (0.19,0.39); mean Jackknife estimate: 0.29]

slide-16
SLIDE 16

Simulated (almost) linear data

50 100 150 0.5 1 1.5 2 2.5 3 −5 5 10 15 20 25 30 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

Normalized noise power Normalized linearly predictive power

[extrapolated range: (0.95,0.97); mean Jackknife estimate: 0.97]

slide-17
SLIDE 17

Linear fits to non-linear functions

slide-18
SLIDE 18

Linear fits to non-linear functions

(Stimulus dependence does not always signal response adaptation)

slide-19
SLIDE 19

Approximations are stimulus dependent

(Stimulus dependence does not always signal response adaptation)

slide-20
SLIDE 20

Consequences

Local fitting can have counterintuitive consequences on the interpreta- tion of a “receptive field”.

slide-21
SLIDE 21

“Independently distributed” stimuli

Knowing stimulus power at any set of points in analysis space provides noinformation about stimulus power at any other point. DRC:

Space Spectrotemporal

Ripple: Independence is a property of stimulus and analysis space.

Christianson, Sahani, and Linden (2008)

slide-22
SLIDE 22

Nonlinearity & non-independence distort RF estimates

Stimulus may have higher-order correlations in other analysis spaces — interaction with nonlinearities can produce misleading “receptive fields.”

Christianson, Sahani, and Linden (2008)

slide-23
SLIDE 23

What about natural sounds?

Multiplicative RF

Time (ms)

  • Freq. (kHz)

−30 −25 −20 −15 −10 −5 1 2 3 4 5 6 7

Multiplicative RF

  • Freq. (kHz)

Time (ms)

−30 −25 −20 −15 −10 −5 1 2 3 4 5 6 7

Finch Song

  • Freq. (kHz)

Time (ms)

−30 −25 −20 −15 −10 −5 1 2 3 4 5 6 7

Finch Song

  • Freq. (kHz)

Time (ms)

−30 −25 −20 −15 −10 −5 1 2 3 4 5 6 7

Usually not independent in any space — so STRFs may not be conser- vative estimates of receptive fields.

Christianson, Sahani, and Linden (2008)

slide-24
SLIDE 24

Beyond linearity

slide-25
SLIDE 25

Beyond linearity

Linear models often fail to predict well. Alternatives?

  • Wiener/Volterra functional expansions

– M-series – Linearised estimation – Kernel formulations

  • LN (Wiener) cascades

– Spike-trigger covariance (STC) methods – “Maximimally informative” dimensions (MID) ⇔ ML nonparametric LNP models – ML Parametric GLM models

  • NL (Hammerstein) cascades

– Multilinear formulations

slide-26
SLIDE 26

Non-linear models

The LNP (Wiener) cascade

k n

Rectification addresses negative firing rates. Possible biophysical justi- fication.

slide-27
SLIDE 27

LNP estimation – the Spike-triggered ensemble

slide-28
SLIDE 28

Single linear filter

STA. Non-linearity. STA unbiased for spherical (elliptical) data. Bussgang. Non-spherical inputs. Biases.

slide-29
SLIDE 29

Multiple filters

Distribution changes along relevant directions (and, usually, along all linear combinations of relevant directions). Proxies for distribution:

  • mean: STA (can only reveal a single direction)
  • variance: STC
  • binned (or kernel) KL: MID “maximally informative directions” (equiv-

alent to ML in LNP model with binned nonlinearity)

slide-30
SLIDE 30

STC

Project out STA:

  • X = X − (Xksta)kT

sta;

Cprior =

  • XT

X N ; Cspike =

  • XTdiag(Y )

X Nspike

Choose directions with greatest change in variance: k- argmax

v=1

vT(Cprior − Cspike)v ⇒ find eigenvectors of (Cprior − Cspike) with large (absolute) eigvals.

slide-31
SLIDE 31

STC

Reconstruct nonlinearity (may assume separability)

slide-32
SLIDE 32

Biases

STC (obviously) requires that the nonlinearity alter variance. If so, subspace is unbiased if distribution

  • radially (elliptically) symmetric
  • AND independent

⇒ Gaussian.

May be possible to correct by transformation, subsampling or weighting (latter two at cost of variance).

slide-33
SLIDE 33

More LNP methods

  • Non-parametric non-linearities: “Maximally informative dimensions”

(MID) ⇔ “non-parametric” maximum likelihood. – Intuitively, extends the variance difference idea to arbitrary differ- ences between marginal and spike-conditioned stimulus distribu- tions.

kMID = argmax

k

KL[P(k · x)P(k · x|spike)] – Measuring KL requires binning or smoothing—turns out to be equiv- alent to fitting a non-parametric nonlinearity by binning or smooth- ing. – Difficult to use for high-dimensional LNP models.

  • Parametric non-linearities: the “generalised linear model” (GLM).
slide-34
SLIDE 34

Generalised linear models

LN models with specified nonlinearities and exponential family noise. In general (for monotonic g):

y ∼ ExpFamily[µ(x)]; g(µ) = βx

For our purposes easier to write

y ∼ ExpFamily[f(βx)]

(Continuous time) point process likelihood with GLM-like dependence of

λ on covariates is approached in limit of bins → 0 by either Poisson or

Bernoulli GLM.

Mark Berman and T. Rolf Turner (1992) Approximating Point Process Likelihoods with GLIM Journal of the Royal Statistical Society. Series C (Applied Statistics), 41(1):31-38.

slide-35
SLIDE 35

Generalised linear models

Poisson distribution ⇒ f = exp() is canonical (natural params = βx). Canonical link functions give concave likelihoods ⇒ unique maxima. Generalises (for Poisson) to any f which is convex and log-concave: log-likelihood = c − f(βx) + y log f(βx) Includes:

  • threshold-linear
  • threshold-polynomial
  • “soft-threshold” f(z) = α−1 log(1 + eαz).
slide-36
SLIDE 36

Generalised linear models

ML parameters found by

  • gradient ascent
  • IRLS

Regularisation by L2 (quadratic) or L1 (absolute value – sparse) penal- ties (MAP with Gaussian/Laplacian priors) preserves concavity.

slide-37
SLIDE 37

Linear-Nonlinear-Poisson (GLM)

stimulus filter

point nonlinearity Poisson spiking

stimulus

k

(t)

slide-38
SLIDE 38

GLM with history-dependence

  • rate is a product of stim- and spike-history dependent terms
  • output no longer a Poisson process
  • also known as “soft-threshold” Integrate-and-Fire model

exponential nonlinearity

+

post-spike filter

h

(t)

stimulus filter

(Truccolo et al 04)

k

Poisson spiking

conditional intensity

(spike rate)

stimulus

slide-39
SLIDE 39

filter output

traditional IF

filter output

  • “hard threshold”

“soft-threshold” IF

spike rate

GLM with history-dependence

exponential nonlinearity

+

post-spike filter

h

(t)

stimulus filter

k

Poisson spiking

  • “soft-threshold” approximation to Integrate-and-Fire model

stimulus

slide-40
SLIDE 40

GLM dynamic behaviors

time after spike time (ms)

50 100 100 200 300 400 500

stimulus x(t) post-spike waveform stim-induced spike-history induced

regular spiking

slide-41
SLIDE 41

GLM dynamic behaviors

stimulus x(t) post-spike waveform stim-filter output spike-history filter output

regular spiking

10 20 100 200 300 400 500

time after spike time (ms)

irregular spiking

slide-42
SLIDE 42

GLM dynamic behaviors

stimulus x(t)

bursting

post-spike waveform

time after spike time (ms)

20 40 100 200 300 400 500

  • 10

adaptation

slide-43
SLIDE 43

Generalized Linear Model (GLM)

post-spike filter

exponential nonlinearity probabilistic spiking

stimulus

stimulus filter

+

slide-44
SLIDE 44

multi-neuron GLM

exponential nonlinearity probabilistic spiking

stimulus

neuron 1 neuron 2 post-spike filter stimulus filter

+ +

slide-45
SLIDE 45

multi-neuron GLM

exponential nonlinearity probabilistic spiking

coupling filters

stimulus

neuron 1 neuron 2 post-spike filter stimulus filter

+ +

slide-46
SLIDE 46

conditional intensity

(spike rate)

...

time

t

GLM equivalent diagram:

slide-47
SLIDE 47

Multilinear models

Input nonlinearities (Hammerstein cascades) can be identified in a mul- tilinear (cartesian tensor) framework.

slide-48
SLIDE 48

Input nonlinearities

The basic linear model (for sounds):

  • r(i)
  • predicted rate

=

  • jk

wtf

jk

  • STRF weights

s(i − j, k)

  • stimulus power

,

How to measure s? (pressure, intensity, dB, thresholded, . . . ) We can learn an optimal representation g(.):

ˆ r(i) =

  • jk

wtf

jkg(s(i − j, k)).

Define: basis functions {gl} such that g(s) =

l wl lgl(s)

and stimulus array Mijkl = gl(s(i − j, k)). Now the model is

ˆ r(i) =

  • j

wtf

jkwl lMijkl

  • r
  • r = (wtf ⊗ wl) • M.
slide-49
SLIDE 49

Multilinear models

Multilinear forms are straightforward to optimise by alternating least squares. Cost function:

E =

  • r − (wtf ⊗ wl) • M
  • 2

Minimise iteratively, defining matrices B = wl • M and A = wtf • M and updating wtf = (BTB)−1BTr and wl = (ATA)−1ATr. Each linear regression step can be regularised by evidence optimisa- tion (suboptimal), with uncertainty propagated approximately using vari- ational methods.

slide-50
SLIDE 50

Some input non-linearities

25 40 55 70 l (dB−SPL) wl

slide-51
SLIDE 51

Parameter grouping

Separable STRF: (time) ⊗ (frequency). The input nonlinearity model is separable in another sense: (time, frequency) ⊗ (sound level).

intensity time frequency time frequency intensity weight

Other separations:

  • (time, sound level) ⊗ (frequency):
  • r = (wtl ⊗ wf) • M,
  • (frequency, sound level) ⊗ (time):
  • r = (wfl ⊗ wt) • M,
  • (time) ⊗ (frequency) ⊗ (sound level):
  • r = (wl ⊗ wf ⊗ wl) • M.
slide-52
SLIDE 52

(time, frequency) ⊗ (sound level)

t (ms) f (kHz) 180 120 60 32 16 8 4 2 25 40 55 70 l (dB−SPL) wl

slide-53
SLIDE 53

(time, sound level) ⊗ (frequency)

t (ms) l (dB−SPL) 180 120 60 70 55 40 25 25 50 100 f (kHz) wf

slide-54
SLIDE 54

(frequency, sound level) ⊗ (time)

f (kHz) l (dB−SPL) 2 4 8 16 32 70 55 40 25 180 120 60 t (ms) wt

slide-55
SLIDE 55

Contextual influences

stimulus wτφ wtf PSTH time frequency

rate(i) = c +

  • jkl

wt

jwf kwl l gl(soundi−j,k)

  • c2 +
  • mnp

mwφ n wλ p gp(soundi−j−m,k+n)

slide-56
SLIDE 56

Contextual influences

Introduce extra dimensions:

  • τ: time difference between contextual and primary tone,
  • φ: frequency difference between contextual and primary tone,
  • λ: sound level of the contextual tone.

Model the effective sound level of a primary tone by Level(i, j) → Level(i, j) · (const + Context(i, j)) . and the context by Context(i, j) =

  • m,n,p

mwφ n wλ p hp(s(i − m, j + n))

This leads to a multilinear model

  • r = (wt ⊗ wf ⊗ wl ⊗ wτ ⊗ wφ ⊗ wλ) • M.
slide-57
SLIDE 57

Inseparable contexts

We can also allow inseparable contexts (and principal fields), dropping the level-nonlinearity to reduce parameters.

r(i) = c +

  • jk

wtf

jk soundi−j,k

  • 1 +
  • mn

wτφ

mn soundi−j−m,k+n

  • stimulus

wτφ wtf PSTH time frequency

slide-58
SLIDE 58

Performance

0.1 0.5 0.9 Extrapolated normalised predictive power STRF Separable Context Inseparable Context test set training set