Probability, Statistics and Inference Probability : an abstract - - PDF document

probability statistics and inference
SMART_READER_LITE
LIVE PREVIEW

Probability, Statistics and Inference Probability : an abstract - - PDF document

Mathematical Tools for Neural and Cognitive Science Fall semester, 2017 Probability, Statistics and Inference Probability : an abstract mathematical framework for describing random quantities (e.g., measurements) Statistics : use of


slide-1
SLIDE 1

Mathematical Tools for Neural and Cognitive Science

Probability, Statistics and Inference

Fall semester, 2017 Probability: an abstract mathematical framework for describing random quantities (e.g., measurements) Statistics: use of probability to summarize, analyze, interpret data. Fundamental to all experimental science.

slide-2
SLIDE 2

Probabilistic Middleville

The stork delivers boys and girls randomly, with equal probability.

p r

  • b

a b i l i s t i c m

  • d

e l data statistical inference

In Middleville, every family has two children, brought by the stork. You pick a family at random and discover that one of the children is a girl. What is the probability that the other child is a girl?

Statistical Middleville

In Middleville, every family has two children, brought by the stork. You pick a family at random and discover that one of the children is a girl. What is the probability that the other child is a girl? In a survey of 100 Middleville families, 32 have two girls, 24 have two boys, and the remainder have one of each. The stork delivers boys and girls randomly, with equal probability.

slide-3
SLIDE 3
  • Efron & Tibshirani, Introduction to the Bootstrap

Some historical context

  • 1600’s: Early notions of data summary/averaging
  • 1700’s: Bayesian prob/statistics (Bayes, Laplace)
  • 1920’s: Frequentist statistics for science (e.g., Fisher)
  • 1940’s: Statistical signal analysis and communication,

estimation/decision theory (Shannon, Wiener, etc)

  • 1970’s: Computational optimization and simulation

(e.g,. Tukey)

  • 1990’s: Machine learning (large-scale computing +

statistical inference + lots of data)

  • Since 1950’s: statistical neural/cognitive models
slide-4
SLIDE 4

Scientific process

Summarize/fit , compare with predictions Create/modify hypothesis/model Generate predictions, design experiment Observe / measure data

Estimating model parameters

  • How do I compute the estimate?


(mathematics vs. numerical optimization)

  • How “good” are my estimates?
  • How well does my model explain the data? 


Future data (prediction/generalization)?

  • How do I compare two (or more) models?
slide-5
SLIDE 5

Outline of what’s coming

Themes:

  • Uni-variate vs. multi-variate
  • Discrete vs. continuous
  • Math vs. simulation
  • Bayesian vs. frequentist inference

Topics:

  • Descriptive statistics
  • Basic probability theory: univariate, multivariate
  • Model parameter estimation
  • Hypothesis testing / model comparison

Example: Localization

Issues: Mean and variability (accuracy and precision)

slide-6
SLIDE 6

Descriptive statistics: Central tendency

  • We often summarize data with the average. Why?
  • Average minimizes the squared error (think regression!)
  • More generally, for Lp norms:
  • minimum L1 norm: median
  • minimum L0 norm: mode
  • Issues: Data from a common source, outliers,

asymmetry, bimodality

arg min

ˆ x

1 N

N

X

n=1

(xn − ˆ x)2 = 1 N

N

X

n=1

xn " 1 N

N

X

i=1

|xn − ˆ x|p #1/p

Descriptive statistics: Dispersion

  • Sample variance
  • Why N-1?
  • Sample standard deviation
  • Mean absolute deviation

s2 = 1 N −1 xi − x

( )

2 i=1 N

1 N xi − x

i=1 N

slide-7
SLIDE 7

Example: Localization

I find that . Is that convincing? Is the apparent bias real? To answer this, we need tools from probability…

x ≠ 0

Probability: notation

let X, Y, Z be random variables they can take on values (like ‘heads’ or ‘tails’; or integers 1-6; or real-valued numbers) let x, y, z stand generically for values they can take, and also, in shorthand, for events like X = x we write the probability that X takes on value x as P(X = x), or PX(x), or sometimes just P(x) P(x) is a function over x, which we call the probability “distribution” function (pdf) (or, for continuous variables only, “density”)

slide-8
SLIDE 8

A distribution (the sum of 2 dice rolls) Another distribution (IQ or a randomly chosen person) P(x) p(x)

Discrete pdf Continuous pdf

Normalization

P(x) p(x)

0 < P(x) <1 P(xi) = 1

i

0 < p(x) p(x)dx = 1

−∞ ∞

slide-9
SLIDE 9

Probability basics

  • discrete probability distributions
  • continuous probability densities
  • cumulative distributions
  • translation and scaling of distributions
  • monotonic nonlinear transformations
  • drawing samples from a distribution.
  • Uniform. Inverse cumulative mapping
  • example densities/distributions

[on board]

1 0.1 0.2 0.3 0.4 0.5 0.6 0.7

1 2 3 4 5 6 7 8 9 10 11 0.05 0.1 0.15 0.2 0.25 200 400 600 800 1000 0.02 0.04 0.06 0.08 0.1 2 3 4 5 6 7 8 9 10 11 12 0.05 0.1 0.15 0.2 1 2 3 4 5 6 0.05 0.1 0.15 0.2

a not-quite-fair coin sum of two rolled fair dice clicks of a Geiger counter, in a fixed time interval horizontal velocity of gas molecules exiting a fan ... and, time between clicks

Example distributions

roll of a fair die

  • 1

2 4 5 3 6 7 8 9 10

slide-10
SLIDE 10

Expected value - discrete

[the mean, ]

µ E(X ) = xi p(xi)

i=1 N

1 2 3 4 # of credit cards 0.5 1 1.5 2 2.5 # of students 104 1 2 3 4 # of credit cards 0.1 0.2 0.3 0.4 0.5 0.6 0.7 P(x)

Expected value - continuous

E(x) = Z x p(x) dx E(x2) = Z x2 p(x) dx E

  • (x − µ)2

= Z (x − µ)2 p(x) dx = Z x2 p(x) dx − µ2 E (f(x)) = Z f(x) p(x) dx [the mean, ] µ [the “second moment”] σ2 [the variance, ] note: an inner product, and thus linear, i.e., E(af (X ) + bg(X )) = aE( f (X )) + bE(g(X ))

slide-11
SLIDE 11

Joint and conditional probability - discrete Joint and conditional probability - discrete

P(Ace) P(Heart) P(Ace & Heart) P(Ace | Heart) P(not Jack of Diamonds) P(Ace | not Jack of Diamonds) “Independence”

slide-12
SLIDE 12
  • Joint distributions
  • Marginals (integrating)
  • Conditionals (slicing)
  • Bayes’ Rule (inverting)
  • Statistical independence (separability)

Multi-variate probability

[on board]

slide-13
SLIDE 13

p(x) = Z p(x, y)dy p(x, y)

Marginal distribution

Conditional probability

A B A & B

p(A| B) = probability of A given that B is asserted to be true = p(A& B) p(B)

Neither A nor B

slide-14
SLIDE 14

p(x, y) p(x|y = 68)

Conditional distribution

p(x|y = 68) = p(x, y = 68) Z p(x, y = 68)dx = p(x, y = 68) . p(y = 68)

P(x|Y=68)

Conditional distribution

slice joint distribution normalize (by marginal)

  • p(x|y) = p(x, y)/p(y)

More generally:

slide-15
SLIDE 15

Bayes’ Rule

A B A & B

p(A& B) = p(B)p(A| B) = p(A)p(B | A) ⇒ p(A| B) = p(B | A)p(A) p(B) p(A| B) = probability of A given that B is asserted to be true = p(A& B) p(B)

Bayes’ Rule

p(x|y) = p(y|x) p(x)/p(y)

(a direct consequence of the definition of conditional probability)

slide-16
SLIDE 16

P(x|Y=120) P(x)

Conditional vs. marginal

In general, these differ. When are they they same? In particular, when are all conditionals equal to the marginal?

  • Statistical independence

Random variables X and Y are statistically independent if (and only if): Independence implies that all conditionals are equal to the corresponding marginal: p(x, y) = p(x)p(y) ∀ x, y

p(x | y) = p(x, y) / p(y) = p(x) ∀ x, y

[note: for discrete distributions, this is an outer product!]

slide-17
SLIDE 17

Sums of independent RVs

Suppose X and Y are independent, then

E(XY) = E(X )E(Y) σ X +Y

2

= E X +Y

( )− µX + µY ( )

( )

2

⎛ ⎝ ⎞ ⎠ = σ X

2 +σ Y 2

and is a convolution

pX +Y (z)

Implications: (1) Sums of Gaussians are Gaussian, (2) Properties of the sample average

E(X +Y) = E(X ) + E(Y)

For any two random variables (independent or not):

  • Mean and variance summarize centroid/width
  • translation and rescaling of random variables
  • nonlinear transformations - “warping”
  • Mean/variance of weighted sum of random variables
  • The sample average
  • ... converges to true mean (except for bizarre distributions)
  • ... with variance
  • ... most common common choice for an estimate ...

Mean and variance

slide-18
SLIDE 18
  • Estimator: Any function of the data, intended to compute

an estimate of the true value of a parameter

  • The most common estimator is the sample average, used

to estimate the true mean of the distribution.

  • Statistically-motivated examples:
  • Maximum likelihood (ML):
  • Max a posteriori (MAP):
  • Min Mean Squared Error


(MMSE):

Point Estimates

Example: Estimate the bias of a coin

slide-19
SLIDE 19

Bayes’ Rule and Estimation

p(parameter value |data) = p(data | parameter value)p(parameter value) p(data)

Posterior Prior Likelihood Nuisance normalizing term

slide-20
SLIDE 20

Likelihood: 1 head Likelihood: 1 tail

More heads More tails

T=0 1 2 3 2 3 1 H=0

Posteriors, p(H,T|x), assuming prior p(x)=1

slide-21
SLIDE 21

example

infer whether a coin is fair by flipping it repeatedly here, x is the probability of heads (50% is fair) y1...n are the outcomes of flips Consider three different priors: suspect fair suspect biased no idea

prior fair prior biased prior uncertain X likelihood (heads) = posterior

slide-22
SLIDE 22

previous posteriors X likelihood (heads) = new posterior previous posteriors X likelihood (tails) = new posterior

slide-23
SLIDE 23

Posteriors after observing 75 heads, 25 tails àprior differences are ultimately overwhelmed by data

PDFs CDFs

10H / 5T 20H / 10T 2H / 1T .975 .025 .19 .93 .49 .80

Confidence

slide-24
SLIDE 24

Bias & Variance

  • Mean squared error = bias^2 + variance
  • Bias is difficult to assess (requires knowing the “true”

value). Variance is easier.

  • Classical statistics generally aims for an unbiased

estimator, with minimal variance (“MVUE”).

  • The MLE is asymptotically unbiased (under fairly

general conditions), but this is only useful if

  • the likelihood model is correct
  • the optimum can be computed
  • you have enough data
  • More general/modern view: estimation is about trading
  • ff bias and variance, through model selection,

“regularization”, or Bayesian priors.

  • Is the coin fair? Compared to what?
  • Point hypotheses:

Bayesian Model Comparison

M1 : p = p1 = 0.5 M2 : p = p2 = 0.6 p(M1 | D) = p(D | M1)P(M1) p(D) = p(D | M1)P( p1) p(D)

Assuming equal priors over models the Bayes factor is

p(M1 | D) p(M2 | D) = p(D | M1)P(M1) p(D | M2)P(M2) = p(D | M1)P( p1) p(D | M2)P( p2)

slide-25
SLIDE 25
  • Is the coin fair? Compared to what?
  • Alternative hypothesis:

Bayesian Model Comparison

M1 : p = p1 = 0.5 M2 : p ≠ 0.5 p(M2 | D) = p(D | M2)p(M2) p(D) = p( pcoin | D)p( pcoin)dpcoin

1

= p(D | M2, pcoin)p( pcoin)dpcoin

1

P(M2) p(D)

Compute Bayes factor as before.

The Gaussian

  • parameterized by mean and stdev (position / width)
  • joint density of two indep Gaussian RVs is circular! [easy]
  • product of two Gaussians is Gaussian! [easy]
  • conditionals of a Gaussian are Gaussian! [easy]
  • sum of Gaussian RVs is Gaussian! [moderate]
  • marginals of a Gaussian are Gaussian! [moderate]
  • central limit theorem: sum of many RVs is Gaussian! [hard]
  • most random (max entropy) density with this variance! [moderate]
slide-26
SLIDE 26

Product of Gaussians is Gaussian

Completing the square shows that this posterior is also Gaussian, with

(average, weighted by inverse variances!)

Product of Gaussians is Gaussian

Completing the square shows that this posterior is also Gaussian, with

(average, weighted by inverse variances!)

p(x|y) ∝ p(y|x)p(x) ∝ e

− 1

2

1 σ2 n (x−y)2

  • e

− 1

2

1 σ2 x (x−µx)2

  • =

e

− 1

2

✓

1 σ2 n + 1 σ2 x

◆ x2−2 ✓

y σ2 n + µx σ2 x

◆ x+...

slide-27
SLIDE 27

mean: [0.2, 0.8] cov: [1.0 -0.3;

  • 0.3 0.4]

Gaussian densities

~ x ∼ N(~ µ, C), let P = C−1 Gaussian, with:

Conditional:

p(x1) = Z p(~ xdx2

Marginal:

Gaussian, with:

(known as the “precision” matrix)

slide-28
SLIDE 28

ˆ u z = ˆ uT~ x µz = ˆ uT ~ µx 2

z

= ˆ uT Cxˆ u z p(z)

Generalized marginals of a Gaussian

x1 x2 w

is Gaussian, with: true mean: [0 0.8] true cov: [1.0 -0.25

  • 0.25 0.3]

sample mean: [-0.05 0.83] sample cov: [0.95 -0.23

  • 0.23 0.29]

700 samples Measurement (sampling) Inference true density

slide-29
SLIDE 29

−4 −3 −2 −1 1 2 3 4 50 100 150 200 250 300 350 400 450 500 (u+u+u+u)/sqrt(4) −4 −3 −2 −1 1 2 3 4 50 100 150 200 250 104 samples of uniform dist −4 −3 −2 −1 1 2 3 4 100 200 300 400 500 600 10 u’s divided by sqrt(10) −4 −3 −2 −1 1 2 3 4 50 100 150 200 250 300 350 400 450 (u+u)/sqrt(2)

Central limit for a uniform distribution...

10k samples, uniform density (sigma=1)

0.2 0.4 0.6 0.8 1 1000 2000 3000 4000 5000 6000

  • ne coin

0.2 0.4 0.6 0.8 1 1000 2000 3000 4000 avg of 4 coins 0.2 0.4 0.6 0.8 1 500 1000 1500 2000 avg of 16 coins 0.2 0.4 0.6 0.8 1 500 1000 1500 2000 avg of 64 coins 0.2 0.4 0.6 0.8 1 500 1000 1500 2000 2500 avg of 256 coins

Central limit for a binary distribution...