Neural Encoding Models Maneesh Sahani Gatsby Computational - - PowerPoint PPT Presentation

neural encoding models
SMART_READER_LITE
LIVE PREVIEW

Neural Encoding Models Maneesh Sahani Gatsby Computational - - PowerPoint PPT Presentation

Neural Encoding Models Maneesh Sahani Gatsby Computational Neuroscience Unit University College London November 2014 Neural Coding The brain appears to be modular. Different structures and cortical areas compute, represent and transmit


slide-1
SLIDE 1

Neural Encoding Models

Maneesh Sahani Gatsby Computational Neuroscience Unit University College London November 2014

slide-2
SLIDE 2

Neural Coding

The brain appears to be modular. Different structures and cortical areas compute, represent and transmit separate pieces of information. The coding questions:

◮ What information is represented by a particular neural population? ◮ How is that information encoded?

slide-3
SLIDE 3

Neural Coding

The brain appears to be modular. Different structures and cortical areas compute, represent and transmit separate pieces of information. The coding questions:

◮ What information is represented by a particular neural population?

◮ easy (?) if we know the code

◮ How is that information encoded?

slide-4
SLIDE 4

Neural Coding

The brain appears to be modular. Different structures and cortical areas compute, represent and transmit separate pieces of information. The coding questions:

◮ What information is represented by a particular neural population?

◮ easy (?) if we know the code ◮ more generally, can search for selectivity / invariance

◮ How is that information encoded?

slide-5
SLIDE 5

Neural Coding

The brain appears to be modular. Different structures and cortical areas compute, represent and transmit separate pieces of information. The coding questions:

◮ What information is represented by a particular neural population?

◮ easy (?) if we know the code ◮ more generally, can search for selectivity / invariance ◮ encoded quantities might not be obvious: inferred latent variables, uncertainty . . .

◮ How is that information encoded?

slide-6
SLIDE 6

Neural Coding

The brain appears to be modular. Different structures and cortical areas compute, represent and transmit separate pieces of information. The coding questions:

◮ What information is represented by a particular neural population?

◮ easy (?) if we know the code ◮ more generally, can search for selectivity / invariance ◮ encoded quantities might not be obvious: inferred latent variables, uncertainty . . .

◮ How is that information encoded?

◮ firing rate, spiking timing (relative to other spikes, population oscillations, onset of

time-invariant stimulus)?

slide-7
SLIDE 7

Neural Coding

The brain appears to be modular. Different structures and cortical areas compute, represent and transmit separate pieces of information. The coding questions:

◮ What information is represented by a particular neural population?

◮ easy (?) if we know the code ◮ more generally, can search for selectivity / invariance ◮ encoded quantities might not be obvious: inferred latent variables, uncertainty . . .

◮ How is that information encoded?

◮ firing rate, spiking timing (relative to other spikes, population oscillations, onset of

time-invariant stimulus)?

◮ functional mapping of encoded variable to spikes?

slide-8
SLIDE 8

Neural Coding

The brain appears to be modular. Different structures and cortical areas compute, represent and transmit separate pieces of information. The coding questions:

◮ What information is represented by a particular neural population?

◮ easy (?) if we know the code ◮ more generally, can search for selectivity / invariance ◮ encoded quantities might not be obvious: inferred latent variables, uncertainty . . .

◮ How is that information encoded?

◮ firing rate, spiking timing (relative to other spikes, population oscillations, onset of

time-invariant stimulus)?

◮ functional mapping of encoded variable to spikes? ◮ easy (?) if we know what is encoded

slide-9
SLIDE 9

Neural Coding

The brain appears to be modular. Different structures and cortical areas compute, represent and transmit separate pieces of information. The coding questions:

◮ What information is represented by a particular neural population?

◮ easy (?) if we know the code ◮ more generally, can search for selectivity / invariance ◮ encoded quantities might not be obvious: inferred latent variables, uncertainty . . .

◮ How is that information encoded?

◮ firing rate, spiking timing (relative to other spikes, population oscillations, onset of

time-invariant stimulus)?

◮ functional mapping of encoded variable to spikes? ◮ easy (?) if we know what is encoded

A complete answer will require convergence of theory and empirical results.

slide-10
SLIDE 10

Neural Coding

The brain appears to be modular. Different structures and cortical areas compute, represent and transmit separate pieces of information. The coding questions:

◮ What information is represented by a particular neural population?

◮ easy (?) if we know the code ◮ more generally, can search for selectivity / invariance ◮ encoded quantities might not be obvious: inferred latent variables, uncertainty . . .

◮ How is that information encoded?

◮ firing rate, spiking timing (relative to other spikes, population oscillations, onset of

time-invariant stimulus)?

◮ functional mapping of encoded variable to spikes? ◮ easy (?) if we know what is encoded

A complete answer will require convergence of theory and empirical results. Computation plays a vital part in systematising empirical data.

slide-11
SLIDE 11

Stimulus coding

s(t) r(t) Decoding:

ˆ

s(t) = G[r(t)] (reconstruction)

slide-12
SLIDE 12

Stimulus coding

s(t) r(t) Decoding:

ˆ

s(t) = G[r(t)] (reconstruction) Encoding:

ˆ

r(t) = F[s(t)] (systems identification)

slide-13
SLIDE 13

Why?

The stimulus coding problem has sometimes been identified with the “neural coding” problem. However, on the face of it, mapping either the decoding or encoding function does not by itself answer either of our basic questions about coding. So why do we do it?

slide-14
SLIDE 14

Why?

The stimulus coding problem has sometimes been identified with the “neural coding” problem. However, on the face of it, mapping either the decoding or encoding function does not by itself answer either of our basic questions about coding. So why do we do it?

◮ encapsulate and systematise the response so that we can ask the questions that we

want answered.

slide-15
SLIDE 15

Why?

The stimulus coding problem has sometimes been identified with the “neural coding” problem. However, on the face of it, mapping either the decoding or encoding function does not by itself answer either of our basic questions about coding. So why do we do it?

◮ encapsulate and systematise the response so that we can ask the questions that we

want answered.

◮ design hypothesis-driven stimulus-coding models: evaluate coding reliability for different

function(al)s of s(t) and for different definitions of r(t).

slide-16
SLIDE 16

Why?

The stimulus coding problem has sometimes been identified with the “neural coding” problem. However, on the face of it, mapping either the decoding or encoding function does not by itself answer either of our basic questions about coding. So why do we do it?

◮ encapsulate and systematise the response so that we can ask the questions that we

want answered.

◮ design hypothesis-driven stimulus-coding models: evaluate coding reliability for different

function(al)s of s(t) and for different definitions of r(t).

◮ but correlation ⇒ causation: in this case the presence of information about an aspect of

the stimulus in a particular aspect of the response does not mean that the brain uses that information.

slide-17
SLIDE 17

General approach

Goal: Estimate p(spike|s, H) [or λ(t|s[0, t), H(t))] from data.

slide-18
SLIDE 18

General approach

Goal: Estimate p(spike|s, H) [or λ(t|s[0, t), H(t))] from data.

◮ Naive approach: measure p(spike, H|s) directly for every setting of s.

slide-19
SLIDE 19

General approach

Goal: Estimate p(spike|s, H) [or λ(t|s[0, t), H(t))] from data.

◮ Naive approach: measure p(spike, H|s) directly for every setting of s.

◮ too hard: too little data and too many potential inputs.

slide-20
SLIDE 20

General approach

Goal: Estimate p(spike|s, H) [or λ(t|s[0, t), H(t))] from data.

◮ Naive approach: measure p(spike, H|s) directly for every setting of s.

◮ too hard: too little data and too many potential inputs.

◮ Estimate some functional F[p] instead (e.g. mutual information)

slide-21
SLIDE 21

General approach

Goal: Estimate p(spike|s, H) [or λ(t|s[0, t), H(t))] from data.

◮ Naive approach: measure p(spike, H|s) directly for every setting of s.

◮ too hard: too little data and too many potential inputs.

◮ Estimate some functional F[p] instead (e.g. mutual information) ◮ Select stimuli efficiently

slide-22
SLIDE 22

General approach

Goal: Estimate p(spike|s, H) [or λ(t|s[0, t), H(t))] from data.

◮ Naive approach: measure p(spike, H|s) directly for every setting of s.

◮ too hard: too little data and too many potential inputs.

◮ Estimate some functional F[p] instead (e.g. mutual information) ◮ Select stimuli efficiently ◮ Fit models with smaller numbers of parameters

slide-23
SLIDE 23

Spikes, or rate?

Most neurons communicate using action potentials — statistically described by a point process: P

  • spike ∈ [t, t + dt)
  • = λ(t|H(t), stimulus, network activity)dt

To fully model the response we need to identify λ. In general this depends on spike history H(t) and network activity. Three options:

slide-24
SLIDE 24

Spikes, or rate?

Most neurons communicate using action potentials — statistically described by a point process: P

  • spike ∈ [t, t + dt)
  • = λ(t|H(t), stimulus, network activity)dt

To fully model the response we need to identify λ. In general this depends on spike history H(t) and network activity. Three options:

◮ Ignore the history dependence, take network activity as source of “noise” (i.e. assume

firing is inhomogeneous Poisson or Cox process, conditioned on the stimulus).

slide-25
SLIDE 25

Spikes, or rate?

Most neurons communicate using action potentials — statistically described by a point process: P

  • spike ∈ [t, t + dt)
  • = λ(t|H(t), stimulus, network activity)dt

To fully model the response we need to identify λ. In general this depends on spike history H(t) and network activity. Three options:

◮ Ignore the history dependence, take network activity as source of “noise” (i.e. assume

firing is inhomogeneous Poisson or Cox process, conditioned on the stimulus).

◮ Average multiple trials to estimate the mean intensity (or PSTH)

λ(t, stimulus) =

lim

N→∞

1 N

  • n

λ(t|Hn(t), stimulus, networkn) ,

and try to fit this.

slide-26
SLIDE 26

Spikes, or rate?

Most neurons communicate using action potentials — statistically described by a point process: P

  • spike ∈ [t, t + dt)
  • = λ(t|H(t), stimulus, network activity)dt

To fully model the response we need to identify λ. In general this depends on spike history H(t) and network activity. Three options:

◮ Ignore the history dependence, take network activity as source of “noise” (i.e. assume

firing is inhomogeneous Poisson or Cox process, conditioned on the stimulus).

◮ Average multiple trials to estimate the mean intensity (or PSTH)

λ(t, stimulus) =

lim

N→∞

1 N

  • n

λ(t|Hn(t), stimulus, networkn) ,

and try to fit this.

◮ Attempt to capture history and network effects in simple models.

slide-27
SLIDE 27

Spike-triggered average

Decoding: mean of P (s | r = 1)

slide-28
SLIDE 28

Spike-triggered average

Decoding: mean of P (s | r = 1) Encoding: predictive filter

slide-29
SLIDE 29

Linear regression

r(t) =

T

s(t − τ)w(τ)dτ s1 s2 s3

. . .

sT sT+1

. . .

slide-30
SLIDE 30

Linear regression

r(t) =

T

s(t − τ)w(τ)dτ s1 s2 s3

. . .

sT sT+1

. . .

s1 s2 s3

. . .

sT

  • s1

s2 s3

. . .

sT+1

×

wt . . . w3 w2 w1

=

rT

slide-31
SLIDE 31

Linear regression

r(t) =

T

s(t − τ)w(τ)dτ s1 s2 s3

. . .

sT sT+1

. . .

s1 s2 s3

. . .

sT

  • s1

s2 s3

. . .

sT sT

  • s1

s2 s3

. . .

sT+1 s2 s3 s4

. . .

sT+1 . . .

×

wt . . . w3 w2 w1

=

rT rT+1 . . .

slide-32
SLIDE 32

Linear regression

r(t) =

T

s(t − τ)w(τ)dτ s1 s2 s3

. . .

sT sT+1

. . .

s1 s2 s3

. . .

sT

  • s1

s2 s3

. . .

sT sT

  • s1

s2 s3

. . .

sT+1 s2 s3 s4

. . .

sT+1 . . .

×

wt . . . w3 w2 w1

=

rT rT+1 . . . SW = R

slide-33
SLIDE 33

Linear regression

r(t) =

T

s(t − τ)w(τ)dτ W(ω) = S(ω)∗R(ω)

|S(ω)|2

s1 s2 s3

. . .

sT sT+1

. . .

s1 s2 s3

. . .

sT

  • s1

s2 s3

. . .

sT sT

  • s1

s2 s3

. . .

sT+1 s2 s3 s4

. . .

sT+1 . . .

×

wt . . . w3 w2 w1

=

rT rT+1 . . . SW = R W = (STS)

ΣSS

−1 (STR)

STA

slide-34
SLIDE 34

Linear models

So the (whitened) spike-triggered average gives the minimum-squared-error linear model. Issues:

slide-35
SLIDE 35

Linear models

So the (whitened) spike-triggered average gives the minimum-squared-error linear model. Issues:

◮ overfitting and regularisation

slide-36
SLIDE 36

Linear models

So the (whitened) spike-triggered average gives the minimum-squared-error linear model. Issues:

◮ overfitting and regularisation

◮ standard methods for regression

slide-37
SLIDE 37

Linear models

So the (whitened) spike-triggered average gives the minimum-squared-error linear model. Issues:

◮ overfitting and regularisation

◮ standard methods for regression

◮ negative predicted rates

slide-38
SLIDE 38

Linear models

So the (whitened) spike-triggered average gives the minimum-squared-error linear model. Issues:

◮ overfitting and regularisation

◮ standard methods for regression

◮ negative predicted rates

◮ can model deviations from background

slide-39
SLIDE 39

Linear models

So the (whitened) spike-triggered average gives the minimum-squared-error linear model. Issues:

◮ overfitting and regularisation

◮ standard methods for regression

◮ negative predicted rates

◮ can model deviations from background

◮ real neurons aren’t linear

slide-40
SLIDE 40

Linear models

So the (whitened) spike-triggered average gives the minimum-squared-error linear model. Issues:

◮ overfitting and regularisation

◮ standard methods for regression

◮ negative predicted rates

◮ can model deviations from background

◮ real neurons aren’t linear

◮ models are still used extensively

slide-41
SLIDE 41

Linear models

So the (whitened) spike-triggered average gives the minimum-squared-error linear model. Issues:

◮ overfitting and regularisation

◮ standard methods for regression

◮ negative predicted rates

◮ can model deviations from background

◮ real neurons aren’t linear

◮ models are still used extensively ◮ interpretable suggestions of underlying sensitivity (but see later)

slide-42
SLIDE 42

Linear models

So the (whitened) spike-triggered average gives the minimum-squared-error linear model. Issues:

◮ overfitting and regularisation

◮ standard methods for regression

◮ negative predicted rates

◮ can model deviations from background

◮ real neurons aren’t linear

◮ models are still used extensively ◮ interpretable suggestions of underlying sensitivity (but see later) ◮ may provide unbiased estimates of cascade filters (see later)

slide-43
SLIDE 43

How good are linear predictions?

We would like an absolute measure of model performance. Two things make this difficult:

slide-44
SLIDE 44

How good are linear predictions?

We would like an absolute measure of model performance. Two things make this difficult: Measured responses can never be predicted perfectly, even in principle:

◮ The measurements themselves are noisy.

slide-45
SLIDE 45

How good are linear predictions?

We would like an absolute measure of model performance. Two things make this difficult: Measured responses can never be predicted perfectly, even in principle:

◮ The measurements themselves are noisy.

Even if we can discount this, a model may predict poorly because either:

◮ It is the wrong model. ◮ The parameters are mis-estimated due to noise.

slide-46
SLIDE 46

How good are linear predictions?

We would like an absolute measure of model performance. Two things make this difficult: Measured responses can never be predicted perfectly, even in principle:

◮ The measurements themselves are noisy.

Even if we can discount this, a model may predict poorly because either:

◮ It is the wrong model. ◮ The parameters are mis-estimated due to noise.

Approaches:

◮ Compare I(resp; pred) to I(resp; stim).

◮ mutual information estimators are biased

◮ Compare E(resp − pred) to E(resp − psth) where psth is gathered over a very large

number of trials.

◮ may require impractical amounts of data to estimate the psth

◮ Compare the predictive power to the predicatable power (similar to ANOVA).

slide-47
SLIDE 47

Estimating predictable power

Psignal Pnoise response

  • r(n)

= signal + noise

P(r(n)) = Psignal + Pnoise P(r(n)) = Psignal + 1 N Pnoise

     ⇒     

  • Psignal =

1 N − 1

  • NP(r(n)) − P(r(n))
  • Pnoise = P(r(n)) −

Psignal

slide-48
SLIDE 48

Testing a model

For a perfect prediction

  • P(trial) − P(residual)
  • = P(signal)
slide-49
SLIDE 49

Testing a model

For a perfect prediction

  • P(trial) − P(residual)
  • = P(signal)

Thus, we can judge the performance of a model by the normalized predictive power P(trial) − P(residual)

  • P(signal)
slide-50
SLIDE 50

Testing a model

For a perfect prediction

  • P(trial) − P(residual)
  • = P(signal)

Thus, we can judge the performance of a model by the normalized predictive power P(trial) − P(residual)

  • P(signal)

Similar to coefficient of determination (r 2), but the denominator is the predictable variance.

slide-51
SLIDE 51

Predictive performance

0.5 1 1.5 2 2.5 0.5 1 1.5 2 2.5

normalised Bayes predictive power

Training Error

−2 −1 1 −2 −1.5 −1 −0.5 0.5 1

Cross−Validation Error

normalised STA predictive power

slide-52
SLIDE 52

Extrapolating the model performance

150 100 normalized noise power 50 −0.5 0.5 1 1.5 2 2.5 3 normalized linearly predictive power

slide-53
SLIDE 53

Extrapolating the model performance

150 100 normalized noise power 50 −0.5 0.5 1 1.5 2 2.5 3 normalized linearly predictive power

slide-54
SLIDE 54

Extrapolating the model performance

150 100 normalized noise power 50 −0.5 0.5 1 1.5 2 2.5 3 normalized linearly predictive power

slide-55
SLIDE 55

Extrapolating the model performance

150 100 normalized noise power 50 −0.5 0.5 1 1.5 2 2.5 3 normalized linearly predictive power

slide-56
SLIDE 56

Jacknife bias correction

Estimate bias by extrapolation in data size:

Tjn = NT − (N − 1)Tloo

where T is the training error on all data and Tloo is the average training error on all sets of N − 1 data. For a linear model we can find this in closed form:

Tjn = 1

N

  • i
  • (ri − siwML)2

1 − si(STS)−1sT

i

slide-57
SLIDE 57

Jackknifed estimates

50 100 150 −0.5 0.5 1 1.5 2 2.5 3

Normalized linearly predictive power Normalized noise power

slide-58
SLIDE 58

Extrapolated linearity

50 100 150 −0.5 0.5 1 1.5 2 2.5 −5 5 10 15 20 25 30 −0.2 0.2 0.4 0.6 0.8 1

Normalized noise power Normalized linearly predictive power

[extrapolated range: (0.19,0.39); mean Jackknife estimate: 0.29]

slide-59
SLIDE 59

Simulated (almost) linear data

50 100 150 0.5 1 1.5 2 2.5 3 −5 5 10 15 20 25 30 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

Normalized noise power Normalized linearly predictive power

[extrapolated range: (0.95,0.97); mean Jackknife estimate: 0.97]

slide-60
SLIDE 60

Beyond linearity

slide-61
SLIDE 61

Beyond linearity

Linear models often fail to predict well. Alternatives?

◮ Wiener/Volterra functional expansions

◮ M-series ◮ Linearised estimation ◮ Kernel formulations

◮ LN (Wiener) cascades

◮ Spike-trigger covariance (STC) methods ◮ “Maximimally informative” dimensions (MID) ⇔ ML nonparametric LNP models ◮ ML Parametric GLM models

◮ NL (Hammerstein) cascades

◮ Multilinear formulations

slide-62
SLIDE 62

The Volterra functional expansion

A polynomial-like expansion for functionals (or operators). Let y(t) = F[x(t)]. Then: y(t) ≈ k(0) +

  • dx dτk(1)(τ)x(t − τ) +
  • dx dτ1 dτ2k(2)(τ1, τ2)x(t − τ1)x(t − τ2)

+

  • dx dτ1 dτ2 dτ3k(3)(τ1, τ2, τ3)x(t − τ1)x(t − τ2)x(t − τ3) + . . .
  • r (in discretised time)

yt = K (0) +

  • i

K (1)

i

xt−i +

  • ij

K (2)

ij

xt−ixt−j +

  • ijk

K (3)

ijk xt−ixt−jxt−k + . . .

For finite expansion, the kernels k(0), k(1)(·), k(2)(·, ·), k(3)(·, ·, ·), . . . are not straightforwardly related to the functional F. Indeed, values of lower-order kernels change as the maximum

  • rder of the expansion is increased.

Estimation: model is linear in kernels, so can be estimated just like a linear (first-order) model with expanded “input”.

◮ Kernel trick: polynomial kernel K(x1, x2) = (1 + x1x2)n. ◮ M-series.

slide-63
SLIDE 63

Wiener Expansion

The Wiener expansion gives functionals of different orders that are orthogonal for white noise input x(t). G0[x(t); h(0)] = h(0) G1[x(t); h(1)] =

  • dx dτh(1)(τ)x(t − τ)

G2[x(t); h(2)] =

  • dx dτ1 dτ2h(2)(τ1, τ2)x(t − τ1)x(t − τ2) − P
  • dx dτ1h(2)(τ1, τ1)

G3[x(t); h(3)] =

  • dx dτ1 dτ2 dτ3h(3)(τ1, τ2, τ3)x(t − τ1)x(t − τ2)x(t − τ3)

− 3P

  • dx dτ1 dτ2h(3)(τ1, τ2, τ2)x(t − τ1)

Easy to verify that E[Gi[x(t)]Gj[x(t)]] = 0 for i = j. Thus, these kernels can be estimated independently. But, they depend on the stimulus.

slide-64
SLIDE 64

Cascade models

The LNP (Wiener) cascade

k n

◮ Rectification addresses negative firing rates. ◮ Loose biophysical correspondance.

slide-65
SLIDE 65

LNP estimation – the Spike-triggered ensemble

slide-66
SLIDE 66

Single linear filter

k n

◮ STA is unbiased estimate of filter for spherical input distribution. (Bussgang’s theorem) ◮ Elliptically-distributed data can be whitened ⇒ linear regression weights are unbiased. ◮ Linear weights are not necessarily maximum-likelihood (or otherwise optimal), even for

spherical/elliptical stimulus distributions.

◮ Linear weights may be biased for general stimuli (binary/uniform or natural).

slide-67
SLIDE 67

Multiple filters

Distribution changes along relevant directions (and, usually, along all linear combinations of relevant directions). Proxies to measure change in distribution:

◮ mean: STA (can only reveal a single direction) ◮ variance: STC ◮ binned (or kernel) KL divergence: MID “maximally informative directions” (equivalent to

ML in LNP model with binned nonlinearity)

slide-68
SLIDE 68

STC

Project out STA:

  • X = X − (Xksta)kT

sta;

Cprior =

  • X T

X N ; Cspike =

  • X Tdiag(Y)

X Nspike Choose directions with greatest change in variance: k- argmax

v=1

vT(Cprior − Cspike)v

⇒ find eigenvectors of (Cprior − Cspike) with large (absolute) eigvals.

slide-69
SLIDE 69

STC

Reconstruct nonlinearity (may assume separability)

slide-70
SLIDE 70

Biases

STC (obviously) requires that the nonlinearity alter variance. If so, subspace is unbiased provided distribution is

◮ radially (elliptically) symmetric ◮ AND independent

⇒ Gaussian.

May be possible to correct for non-Gaussian stimulus by transformation, subsampling or weighting (latter two at cost of variance).

slide-71
SLIDE 71

More LNP methods

◮ Non-parametric non-linearities:

“Maximally informative dimensions” (MID) ⇔ “non-parametric” maximum likelihood.

◮ Intuitively, extends the variance difference idea to arbitrary differences between

marginal and spike-conditioned stimulus distributions. kMID = argmax

k

KL[P(k · x)P(k · x|spike)]

◮ Measuring KL requires binning or smoothing—turns out to be equivalent to fitting a

non-parametric nonlinearity by binning or smoothing.

◮ Difficult to use for high-dimensional LNP models (but ML viewpoint suggests

separable or “cylindrical” basis functions).

◮ Parametric non-linearities: the “generalised linear model” (GLM).

slide-72
SLIDE 72

Generalised linear models

LN models with specified nonlinearities and exponential-family noise. In general (for monotonic g): y ∼ ExpFamily[µ(x)]; g(µ) = βx For our purposes easier to write y ∼ ExpFamily[f(βx)] (Continuous time) point process likelihood with GLM-like dependence of λ on covariates is approached in limit of bins → 0 by either Poisson or Bernoulli GLM.

Mark Berman and T. Rolf Turner (1992) Approximating Point Process Likelihoods with GLIM Journal of the Royal Statistical Society. Series C (Applied Statistics), 41(1):31-38.

slide-73
SLIDE 73

Generalised linear models

Poisson distribution ⇒ f = exp() is canonical (natural params = βx). Canonical link functions give concave likelihoods ⇒ unique maxima. Generalises (for Poisson) to any f which is convex and log-concave: log-likelihood = c − f(βx) + y log f(βx) Includes:

◮ threshold-linear ◮ threshold-polynomial ◮ “soft-threshold” f(z) = α−1 log(1 + eαz).

z f(z) f(z) = [z3]+ f(z) = log(1 + ez) f(z) = 1

3 log(1 + e3z)

f(z) = [z]+

slide-74
SLIDE 74

Generalised linear models

ML parameters found by

◮ gradient ascent ◮ IRLS

Regularisation by L2 (quadratic) or L1 (absolute value – sparse) penalties (MAP with Gaussian/Laplacian priors) preserves concavity.

slide-75
SLIDE 75

Linear-Nonlinear-Poisson (GLM)

stimulus filter point nonlinearity Poisson spiking

stimulus

k

(t)

slide-76
SLIDE 76

GLM with history-dependence

  • rate is a product of stim- and spike-history dependent terms
  • output no longer a Poisson process
  • also known as “soft-threshold” Integrate-and-Fire model

exponential nonlinearity

+

post-spike filter

h

(t)

stimulus filter

(Truccolo et al 04)

k

Poisson spiking

conditional intensity

(spike rate)

stimulus

slide-77
SLIDE 77

filter output

traditional IF

filter output

“hard threshold” “soft-threshold” IF

spike rate

GLM with history-dependence

exponential nonlinearity

+

post-spike filter

h

!(t)

stimulus filter

k

Poisson spiking

  • “soft-threshold” approximation to Integrate-and-Fire model

stimulus

slide-78
SLIDE 78

GLM dynamic behaviors

time after spike time (ms)

50 100 100 200 300 400 500

stimulus x(t) post-spike waveform stim-induced spike-history induced

regular spiking

slide-79
SLIDE 79

GLM dynamic behaviors

stimulus x(t) post-spike waveform stim-filter output spike-history filter output

regular spiking

10 20 100 200 300 400 500

time after spike time (ms)

irregular spiking

slide-80
SLIDE 80

GLM dynamic behaviors

stimulus x(t)

bursting

post-spike waveform

time after spike time (ms)

20 40 100 200 300 400 500

  • 10

adaptation

slide-81
SLIDE 81

Generalized Linear Model (GLM)

post-spike filter exponential nonlinearity probabilistic spiking

stimulus

stimulus filter

+

slide-82
SLIDE 82

multi-neuron GLM

exponential nonlinearity probabilistic spiking

stimulus

neuron 1 neuron 2 post-spike filter stimulus filter

+ +

slide-83
SLIDE 83

multi-neuron GLM

exponential nonlinearity probabilistic spiking coupling filters

stimulus

neuron 1 neuron 2 post-spike filter stimulus filter

+ +

slide-84
SLIDE 84

conditional intensity

(spike rate)

...

time

t

GLM equivalent diagram:

slide-85
SLIDE 85

Non-LN models?

The idea of responses depending on one or a few linear stimulus projections has been dominant, but cannot capture all non-linearities.

◮ Contrast sensitivity might require normalisation by s. ◮ Linear weighting may depend on units of stimulus measurement: amplitude? energy?

logarithms? thresholds? (NL models – Hammerstein cascades)

◮ Neurons, particularly in the auditory system are known to be sensitive to combinations

  • f inputs: forward suppression; spectral patterns (Young); time-frequency interactions

(Sadogopan and Wang).

◮ Experiments with realistic stimuli reveal nonlinear sensivity to parts/whole (Bar-Yosef

and Nelken). Many of these questions can be tackled using a multilinear (cartesian tensor) framework.

slide-86
SLIDE 86

Input nonlinearities

The basic linear model (for sounds):

  • r(i)
  • predicted rate

=

  • jk

wtf

jk

  • STRF weights

s(i − j, k)

  • stimulus power

,

slide-87
SLIDE 87

Input nonlinearities

The basic linear model (for sounds):

  • r(i)
  • predicted rate

=

  • jk

wtf

jk

  • STRF weights

s(i − j, k)

  • stimulus power

,

How to measure s? (pressure, intensity, dB, thresholded, . . . )

slide-88
SLIDE 88

Input nonlinearities

The basic linear model (for sounds):

  • r(i)
  • predicted rate

=

  • jk

wtf

jk

  • STRF weights

s(i − j, k)

  • stimulus power

,

How to measure s? (pressure, intensity, dB, thresholded, . . . ) We can learn an optimal representation g(.):

ˆ

r(i) =

  • jk

wtf

jkg(s(i − j, k)).

slide-89
SLIDE 89

Input nonlinearities

The basic linear model (for sounds):

  • r(i)
  • predicted rate

=

  • jk

wtf

jk

  • STRF weights

s(i − j, k)

  • stimulus power

,

How to measure s? (pressure, intensity, dB, thresholded, . . . ) We can learn an optimal representation g(.):

ˆ

r(i) =

  • jk

wtf

jkg(s(i − j, k)).

Define: basis functions {gl} such that g(s) =

l wl l gl(s)

and stimulus array Mijkl = gl(s(i − j, k)). Now the model is

ˆ

r(i) =

  • j

wtf

jkwl l Mijkl

slide-90
SLIDE 90

Input nonlinearities

The basic linear model (for sounds):

  • r(i)
  • predicted rate

=

  • jk

wtf

jk

  • STRF weights

s(i − j, k)

  • stimulus power

,

How to measure s? (pressure, intensity, dB, thresholded, . . . ) We can learn an optimal representation g(.):

ˆ

r(i) =

  • jk

wtf

jkg(s(i − j, k)).

Define: basis functions {gl} such that g(s) =

l wl l gl(s)

and stimulus array Mijkl = gl(s(i − j, k)). Now the model is

ˆ

r(i) =

  • j

wtf

jkwl l Mijkl

  • r
  • r = (wtf ⊗ wl) • M.
slide-91
SLIDE 91

Multilinear models

Multilinear forms are straightforward to optimise by alternating least squares. Cost function:

E =

  • r − (wtf ⊗ wl) • M
  • 2

Minimise iteratively, defining matrices B = wl • M and A = wtf • M and updating wtf = (BTB)−1BTr and wl = (ATA)−1ATr. Each linear regression step can be regularised by evidence optimisation (suboptimal), with uncertainty propagated approximately using variational methods.

slide-92
SLIDE 92

Some input non-linearities

25 40 55 70 l (dB−SPL) wl

slide-93
SLIDE 93

Parameter grouping

Separable models: (time) ⊗ (frequency). The input nonlinearity model is separable in another sense: (time, frequency) ⊗ (sound level).

intensity time frequency time frequency intensity weight

Other separations:

◮ (time, sound level) ⊗ (frequency):

  • r = (wtl ⊗ wf) • M,

◮ (frequency, sound level) ⊗ (time):

  • r = (wfl ⊗ wt) • M,

◮ (time) ⊗ (frequency) ⊗ (sound level):

  • r = (wl ⊗ wf ⊗ wl) • M.
slide-94
SLIDE 94

Some examples

(time, frequency) ⊗ (sound level):

t (ms) f (kHz) 180 120 60 32 16 8 4 2 25 40 55 70 l (dB−SPL) wl

(time, sound level) ⊗ (frequency):

t (ms) l (dB−SPL) 180 120 60 70 55 40 25 25 50 100 f (kHz) wf

(frequency, sound level) ⊗ (time):

f (kHz) l (dB−SPL) 2 4 8 16 32 70 55 40 25 180 120 60 t (ms) wt

slide-95
SLIDE 95

Variable (combination-dependent) input gain

◮ Sensitivities to different points in sensory space are not independent.

slide-96
SLIDE 96

Variable (combination-dependent) input gain

◮ Sensitivities to different points in sensory space are not independent. ◮ Rather, the sensitivity at one point depends on other elements of the stimulus that create

a local sensory context.

slide-97
SLIDE 97

Variable (combination-dependent) input gain

◮ Sensitivities to different points in sensory space are not independent. ◮ Rather, the sensitivity at one point depends on other elements of the stimulus that create

a local sensory context.

◮ This context adjusts the input gain of the cell from moment to moment, dynamically

refining the shape of the weighted receptive field.

slide-98
SLIDE 98

A context-sensitive model

s(i, k) r(i)

slide-99
SLIDE 99

A context-sensitive model

ˆ

r(i) = c +

J

  • j=0

K

  • k=1

wtf

j+1,ks(i − j, k)

slide-100
SLIDE 100

A context-sensitive model

ˆ

r(i) = c +

J

  • j=0

K

  • k=1

wtf

j+1,ks(i − j, k)

  • 1 +

M

  • m=0

N

  • n=−N

wτφ

m+1,n+N+1s(i − j − m, k + n)

slide-101
SLIDE 101

Some examples

slide-102
SLIDE 102

Predictive performance

0.4 0.8 STRF generalisation 0.4 0.8 CGF model generalisation Cortex Thalamus

slide-103
SLIDE 103

Predictive performance

Cortex

20 40 Normalised noise power 1 2 Predictive power STRF CGF 0.79 0.51 0.37 0.31

Thalamus

20 40 Normalised noise power 1 2 Predictive power STRF CGF 0.83 0.68 0.52 0.48

slide-104
SLIDE 104

Range of input gain

0.25 0.5 1 2 Input gain 0.2 0.4 CGF generalisation advantage suppression facilitation

  • Cortex
  • Thalamus

median IQR

slide-105
SLIDE 105

Input gain fluctuates rapidly

slide-106
SLIDE 106

Mean CGFs

slide-107
SLIDE 107

CGF variability

20 40 60 Normalised noise power −0.06 0.06 Predictive power diff (individual − fixed) 20 40 60 Normalised noise power −0.06 0.06 Predictive power diff (individual − fixed)

slide-108
SLIDE 108

Component significance

Cortex Thalamus

slide-109
SLIDE 109

Component significance

Cortex Thalamus

slide-110
SLIDE 110

Component significance

slide-111
SLIDE 111

CGF consistency across the PRF

◮ As the CGF can be associated with the PRF weights rather than the stimulus, we can

apply different CGFs to different PRF domains.

slide-112
SLIDE 112

CGF consistency across the PRF

Cortex CGFexc CGFinh Thalamus −20 % +20 −240 τ (ms) −1 1 φ (oct.) CGFexc CGFinh

slide-113
SLIDE 113

CGF consistency across the PRF

true pairs shuffled pairs −1 1 normalised correlation

Cortex

true pairs shuffled pairs −1 1 normalised correlation

Thalamus

slide-114
SLIDE 114

CGF consistency across the PRF

0.4 0.8 single CGF model 0.4 0.8 dual (CGFexc CGFinh) model Cortex Thalamus

slide-115
SLIDE 115

Linear fits to non-linear functions

(Stimulus dependence does not always signal response adaptation)

slide-116
SLIDE 116

Linear fits to non-linear functions

(Stimulus dependence does not always signal response adaptation)

slide-117
SLIDE 117

Approximations are stimulus dependent

slide-118
SLIDE 118

Approximations are stimulus dependent

slide-119
SLIDE 119

Approximations are stimulus dependent

slide-120
SLIDE 120

Approximations are stimulus dependent

slide-121
SLIDE 121

Approximations are stimulus dependent

(Stimulus dependence does not always signal response adaptation)

slide-122
SLIDE 122

Consequences

Local fitting can have counterintuitive consequences on the interpretation of a “receptive field”.

slide-123
SLIDE 123

“Independently distributed” stimuli

Knowing stimulus power at any set of points in analysis space provides noinformation about stimulus power at any other point. DRC: Space Spectrotemporal Ripple: Independence is a property of stimulus and analysis space.

slide-124
SLIDE 124

Nonlinearity & non-independence distort RF estimates

Stimulus may have higher-order correlations in other analysis spaces — interaction with nonlinearities can produce misleading “receptive fields.”

slide-125
SLIDE 125

What about natural sounds?

Multiplicative RF

Time (ms)

  • Freq. (kHz)

−30 −25 −20 −15 −10 −5 1 2 3 4 5 6 7

Multiplicative RF

  • Freq. (kHz)

Time (ms)

−30 −25 −20 −15 −10 −5 1 2 3 4 5 6 7

Finch Song

  • Freq. (kHz)

Time (ms)

−30 −25 −20 −15 −10 −5 1 2 3 4 5 6 7

Finch Song

  • Freq. (kHz)

Time (ms)

−30 −25 −20 −15 −10 −5 1 2 3 4 5 6 7

Usually not independent in any space — so STRFs may not be conservative estimates of receptive fields.

slide-126
SLIDE 126

Issues: complex selectivity

slide-127
SLIDE 127

Issues: complex selectivity

slide-128
SLIDE 128

Issues: complex selectivity

slide-129
SLIDE 129

Issues: adaptation, task-dependence

slide-130
SLIDE 130

The “agnostic” coding approach can only take us so far. Eventually, we need solid scientifically (and probably theoretically) motivated hypotheses.