Section 4: Statistics and Inference Probability : an abstract - - PDF document

section 4 statistics and inference
SMART_READER_LITE
LIVE PREVIEW

Section 4: Statistics and Inference Probability : an abstract - - PDF document

Mathematical Tools for Neural and Cognitive Science Fall semester, 2016 Section 4: Statistics and Inference Probability : an abstract mathematical framework for describing random quantities (e.g. measurements) Statistics : use of probability


slide-1
SLIDE 1

Mathematical Tools for Neural and Cognitive Science

Section 4: Statistics and Inference

Fall semester, 2016 Probability: an abstract mathematical framework for describing random quantities (e.g. measurements) Statistics: use of probability to summarize, analyze, interpret data. Fundamental to all experimental science.

slide-2
SLIDE 2

Statistics as a form of summary

0, 1, 0, 0, 0, 1, 0, 1, ... P(x)

The purpose of statistics is to replace a quantity of data by relatively few quantities which shall ... contain as much as possible, ideally the whole, of the relevant information contained in the original data.

  • R.A. Fisher, 1934

Statistics for Data Summary...

  • Sample average (minimizes mean squared error)
  • Sample median (minimizes mean absolute

deviation)

  • Least-squares regression - summarizes

relationships between controlled and measured quantities

  • TLS regression - summarizes relationships

between measured quantities

slide-3
SLIDE 3
  • Efron & Tibshirani, Introduction to the Bootstrap

Scientific process

Summarize, and compare with expectations Create/modify Hypothesis/model Generate predictions, Design experiment Observe / Measure

slide-4
SLIDE 4

Probability basics

  • discrete probability distributions
  • continuous probability densities
  • cumulative distributions
  • translation and scaling of distributions

(adding or multiplying by a constant)

  • monotonic nonlinear transformations
  • drawing samples from a distribution via

inverse cumulative mapping

  • example densities/distributions

[on board]

1 0.1 0.2 0.3 0.4 0.5 0.6 0.7

1 2 3 4 5 6 7 8 9 10 11 0.05 0.1 0.15 0.2 0.25 200 400 600 800 1000 0.02 0.04 0.06 0.08 0.1

2 3 4 5 6 7 8 9 10 11 12 0.05 0.1 0.15 0.2 1 2 3 4 5 6 0.05 0.1 0.15 0.2

a not-quite-fair coin sum of two rolled fair dice clicks of a Geiger counter, in a fixed time interval horizontal velocity of gas molecules exiting a fan ... and, time between clicks

Example distributions

roll of a fair die

1 2 4 5 3 6 7 8 9 10

slide-5
SLIDE 5
  • Joint distributions
  • Marginals (integrating)
  • Conditionals (slicing)
  • Bayes’ Rule (inverting)
  • Statistical independence

Multi-dimensional random variables

p(x, y)

Joint distribution

slide-6
SLIDE 6

p(x) = Z p(x, y)dy p(x, y)

Marginal distribution

p(z) = Z

~ x·ˆ u=z

p(~ x)d~ x z ˆ u

Generalized marginal distribution

Using vector notation:

slide-7
SLIDE 7

p(x, y) p(x|y = 68)

Conditional distribution

p(x|y = 68) = p(x, y = 68) Z p(x, y = 68)dx = p(x, y = 68) . p(y = 68)

P(x|Y=68)

Conditional distribution

slice joint distribution normalize (by marginal)

p(x|y) = p(x, y)/p(y) More generally:

slide-8
SLIDE 8

Bayes’ Rule

p(x|y) = p(y|x) p(x)/p(y)

(a direct consequence of the definition of conditional probability)

P(x|Y=120) P(x)

Conditional vs. marginal

In general, these differ. When are they they same? In particular, when are all conditionals equal to the marginal?

slide-9
SLIDE 9

p(y|x) = p(y, x)/p(x) = p(y), ∀x

Statistical independence

Variables x and y are statistically independent if (and only if): p(x, y) = p(x)p(y) Independence implies that all condionals are equal to the corresponding marginal:

Uncorrelated doesn’t mean independent...

Statistical independence a stronger assumption uncorrelatedness

⇒ All independent variables are uncorrelated ⇒ Not all uncorrelated variables are independent:

r =

slide-10
SLIDE 10

Expected value

E(x) = Z x p(x) dx E(x2) = Z x2 p(x) dx E

  • (x − µ)2

= Z (x − µ)2 p(x) dx = Z x2 p(x) dx − µ2 E (f(x)) = Z f(x) p(x) dx [the mean, ] µ [the “second moment”] σ2 [the variance, ] [note: an inner product, and thus linear!]

  • One-D: mean and covariance summarize centroid/width
  • translation and rescaling of random variables
  • nonlinear transformations - “warping”
  • Multi-D: vector mean and covariance matrix, elliptical geometry
  • Mean/variance of weighted sum of random variables
  • The sample average
  • ... converges to true mean (except for bizarre distributions)
  • ... with variance
  • ... most common common choice for an estimate ...
  • Correlation

Mean and (co)variance

slide-11
SLIDE 11

The Central Limit Theorem Distribution of a sum of independent R.V.’s - the return of convolution

[on board]

−4 −3 −2 −1 1 2 3 4 50 100 150 200 250 300 350 400 450 500 (u+u+u+u)/sqrt(4) −4 −3 −2 −1 1 2 3 4 50 100 150 200 250 104 samples of uniform dist −4 −3 −2 −1 1 2 3 4 100 200 300 400 500 600 10 u’s divided by sqrt(10) −4 −3 −2 −1 1 2 3 4 50 100 150 200 250 300 350 400 450 (u+u)/sqrt(2)

Central limit for a uniform distribution...

10k samples, uniform density (sigma=1)

slide-12
SLIDE 12

0.2 0.4 0.6 0.8 1 1000 2000 3000 4000 5000 6000

  • ne coin

0.2 0.4 0.6 0.8 1 1000 2000 3000 4000 avg of 4 coins 0.2 0.4 0.6 0.8 1 500 1000 1500 2000 avg of 16 coins 0.2 0.4 0.6 0.8 1 500 1000 1500 2000 avg of 64 coins 0.2 0.4 0.6 0.8 1 500 1000 1500 2000 2500 avg of 256 coins

Central limit for a binary distribution...

The Gaussian

  • parameterized by mean and stdev (position / width)
  • joint density of two indep Gaussian RVs is circular! [easy]
  • product of two Gaussians is Gaussian! [easy]
  • conditionals of a Gaussian are Gaussian! [easy]
  • sum of Gaussian RVs is Gaussian! [moderate]
  • marginals of a Gaussian are Gaussian! [moderate]
  • central limit theorem: sum of many RVs is Gaussian! [hard]
  • most random (max entropy) density with this variance! [moderate]
slide-13
SLIDE 13

mean: [0.2, 0.8] cov: [1.0 -0.3;

  • 0.3 0.4]

Gaussian densities

Product of Gaussians is Gaussian

Completing the square shows that this posterior is also Gaussian, with

(average, weighted by inverse variances!)

slide-14
SLIDE 14

Product of Gaussians is Gaussian

Completing the square shows that this posterior is also Gaussian, with

(average, weighted by inverse variances!)

p(x|y) ∝ p(y|x)p(x) ∝ e

− 1

2

1 σ2 n (x−y)2

  • e

− 1

2

1 σ2 x (x−µx)2

  • =

e

− 1

2

✓

1 σ2 n + 1 σ2 x

◆ x2−2 ✓

y σ2 n + µx σ2 x

◆ x+...

  • ~

x ∼ N(~ µ, C), let P = C−1 Gaussian, with:

Conditional:

p(x1) = Z p(~ xdx2

Marginal:

Gaussian, with:

(known as the “precision” matrix)

slide-15
SLIDE 15

ˆ u z = ˆ uT~ x µz = ˆ uT ~ µx 2

z

= ˆ uT Cxˆ u z p(z)

Generalized marginals of a Gaussian

x1 x2 w

is Gaussian, with: true mean: [0 0.8] true cov: [1.0 -0.25

  • 0.25 0.3]

sample mean: [-0.05 0.83] sample cov: [0.95 -0.23

  • 0.23 0.29]

700 samples Measurement (sampling) Inference true density

slide-16
SLIDE 16
  • Estimator: Any function of the data, intended to represent

the best approximation of the true value of a parameter

  • Most common estimator is the sample average
  • Statistically-motivated examples:
  • Maximum likelihood (ML):
  • Max a posteriori (MAP):
  • Min Mean Squared Error

(MMSE):

Point Estimates

p(x|y) proportional to p(x) * p(y|x)

  • why must both prior and likelihood be taken into account?
  • why doesn’t data dominate?
  • when would it? when would prior dominate?
  • what if prior and likelihood are incompatible?
slide-17
SLIDE 17
slide-18
SLIDE 18

Likelihood: 1 head Likelihood: 1 tail

More heads More tails

T=0 1 2 3 2 3 1 H=0

Posteriors, p(H,T|x), assuming prior p(x)=1

slide-19
SLIDE 19

example

infer whether a coin is fair by flipping it repeatedly here, x is the probability of heads (50% is fair) y1...n are the outcomes of flips Consider three different priors: suspect fair suspect biased no idea

prior fair prior biased prior uncertain X likelihood (heads) = posterior

slide-20
SLIDE 20

previous posteriors X likelihood (heads) = new posterior previous posteriors X likelihood (tails) = new posterior

slide-21
SLIDE 21

Posteriors after observing 75 heads, 25 tails àprior differences are ultimately overwhelmed by data

PDFs CDFs

10H / 5T 20H / 10T 2H / 1T .975 .025 .19 .93 .49 .80

Confidence

slide-22
SLIDE 22

Bias & Variance

  • MSE = bias^2 + variance
  • Bias is difficult to assess (since requires knowing the

“true” value). But variance is easier.

  • Classical statistics generally aims for an unbiased

estimator, with minimal variance

  • The MLE is asymptotically unbiased (under fairly

general conditions), but this is only useful if

  • the likelihood model is correct
  • the optimum can be computed
  • you have enough data
  • More general/modern view: estimation is about trading
  • ff bias and variance, through model selection,

“regularization”, or Bayesian priors.

statAnMod - 9/12/07 - E.P. Simoncelli

Optimization...

Smooth (C2) Convex Quadratic Closed-form, and unique Iterative descent, (possibly) nonunique Iterative descent, unique Heuristics, exhaustive search, (pain & suffering)

slide-23
SLIDE 23

Bootstrapping

  • “The Baron had fallen to the bottom of a deep
  • lake. Just when it looked like all was lost, he

thought to pick himself up by his own boostraps” [Adventures of Baron von Munchausen, by Rudolph Erich Raspe]

  • A resampling method for computing estimator

distribution (incl. stdev or error bars)

  • Idea: instead of running experiment multiple

times, resample from existing data (with replacement). Compute estimates from these “bootstrap” data sets.

[Efron & Tibshirani ’98] [New York Times, 27 Jan 1987] Histogram of bootstrap estimates

0.2 0.4 0.6 0.8 1 200 400 600 800 1000 1200 1400 Boostrapped Original 95% conf

=> with 95% confidence,

slide-24
SLIDE 24

Cross-validation

5 10 15 20 10

−2

10 10

2

10

4

10

6

polynomial degree MSE fit error x−val error true degree true error

(1) Randomly partition your data into a “training” set, and a “test” set. (2) Fit model to training set. Measure error on test set. (3) Repeat (many times)

A resampling method for determining predictive power of a model. Widely used to identify/avoid over-fitting.

slide-25
SLIDE 25

arg min

~

  • ||~

y − X~ ||2 arg min

~

  • ||~

y − X~ ||2 + ||~ ||2

Ridge regression

(a.k.a. Tikhonov regularization, or linear regularization)

Note: negative log posterior, assuming Gaussian likelihood & prior Ordinary least squares regression: “Regularized” least squares regression: Choose lambda by cross-validation

0.2 0.4 0.6 0.8 1 −2 −1 1 2 3 4 5 data LS reg Ridge reg

7th-order polynomial regression

arg min

~

  • ||~

y − X~ ||2 + X

k

|k|

L1 regularization

(a.k.a. LASSO - least absolute shrinkage and selection operator) Using an absolute error regularization term promotes selection of regressors:

L1 norm