Probability, Statistics and Inference Probability : an abstract - - PDF document
Probability, Statistics and Inference Probability : an abstract - - PDF document
Mathematical Tools for Neural and Cognitive Science Fall semester, 2017 Probability, Statistics and Inference Probability : an abstract mathematical framework for describing random quantities (e.g., measurements) Statistics : use of
Probabilistic Middleville
The stork delivers boys and girls randomly, with equal probability.
p r
- b
a b i l i s t i c m
- d
e l data statistical inference
In Middleville, every family has two children, brought by the stork. You pick a family at random and discover that one of the children is a girl. What is the probability that the other child is a girl?
Statistical Middleville
In Middleville, every family has two children, brought by the stork. You pick a family at random and discover that one of the children is a girl. What is the probability that the other child is a girl? In a survey of 100 Middleville families, 32 have two girls, 24 have two boys, and the remainder have one of each. The stork delivers boys and girls randomly, with equal probability.
- Efron & Tibshirani, Introduction to the Bootstrap
Some historical context
- 1600’s: Early notions of data summary/averaging
- 1700’s: Bayesian prob/statistics (Bayes, Laplace)
- 1920’s: Frequentist statistics for science (e.g., Fisher)
- 1940’s: Statistical signal analysis and communication,
estimation/decision theory (Shannon, Wiener, etc)
- 1970’s: Computational optimization and simulation
(e.g,. Tukey)
- 1990’s: Machine learning (large-scale computing +
statistical inference + lots of data)
- Since 1950’s: statistical neural/cognitive models
Scientific process
Summarize/fit , compare with predictions Create/modify hypothesis/model Generate predictions, design experiment Observe / measure data
Estimating model parameters
- How do I compute the estimate?
(mathematics vs. numerical optimization)
- How “good” are my estimates?
- How well does my model explain the data?
Future data (prediction/generalization)?
- How do I compare two (or more) models?
Outline of what’s coming
Themes:
- Uni-variate vs. multi-variate
- Discrete vs. continuous
- Math vs. simulation
- Bayesian vs. frequentist inference
Topics:
- Descriptive statistics
- Basic probability theory: univariate, multivariate
- Model parameter estimation
- Hypothesis testing / model comparison
Example: Localization
Issues: Mean and variability (accuracy and precision)
Descriptive statistics: Central tendency
- We often summarize data with the average. Why?
- Average minimizes the squared error (think regression!)
- More generally, for Lp norms:
- minimum L1 norm: median
- minimum L0 norm: mode
- Issues: Data from a common source, outliers,
asymmetry, bimodality
arg min
ˆ x
1 N
N
X
n=1
(xn − ˆ x)2 = 1 N
N
X
n=1
xn " 1 N
N
X
i=1
|xn − ˆ x|p #1/p
Descriptive statistics: Dispersion
- Sample variance
- Why N-1?
- Sample standard deviation
- Mean absolute deviation
s2 = 1 N −1 xi − x
( )
2 i=1 N
∑
1 N xi − x
i=1 N
∑
Example: Localization
I find that . Is that convincing? Is the apparent bias real? To answer this, we need tools from probability…
x ≠ 0
Probability: notation
let X, Y, Z be random variables they can take on values (like ‘heads’ or ‘tails’; or integers 1-6; or real-valued numbers) let x, y, z stand generically for values they can take, and also, in shorthand, for events like X = x we write the probability that X takes on value x as P(X = x), or PX(x), or sometimes just P(x) P(x) is a function over x, which we call the probability “distribution” function (pdf) (or, for continuous variables only, “density”)
A distribution (the sum of 2 dice rolls) Another distribution (IQ or a randomly chosen person) P(x) p(x)
Discrete pdf Continuous pdf
Normalization
P(x) p(x)
0 < P(x) <1 P(xi) = 1
i
∑
0 < p(x) p(x)dx = 1
−∞ ∞
∫
Probability basics
- discrete probability distributions
- continuous probability densities
- cumulative distributions
- translation and scaling of distributions
- monotonic nonlinear transformations
- drawing samples from a distribution.
- Uniform. Inverse cumulative mapping
- example densities/distributions
[on board]
1 0.1 0.2 0.3 0.4 0.5 0.6 0.7
1 2 3 4 5 6 7 8 9 10 11 0.05 0.1 0.15 0.2 0.25 200 400 600 800 1000 0.02 0.04 0.06 0.08 0.1 2 3 4 5 6 7 8 9 10 11 12 0.05 0.1 0.15 0.2 1 2 3 4 5 6 0.05 0.1 0.15 0.2
a not-quite-fair coin sum of two rolled fair dice clicks of a Geiger counter, in a fixed time interval horizontal velocity of gas molecules exiting a fan ... and, time between clicks
Example distributions
roll of a fair die
- 1
2 4 5 3 6 7 8 9 10
Expected value - discrete
[the mean, ]
µ E(X ) = xi p(xi)
i=1 N
∑
1 2 3 4 # of credit cards 0.5 1 1.5 2 2.5 # of students 104 1 2 3 4 # of credit cards 0.1 0.2 0.3 0.4 0.5 0.6 0.7 P(x)
Expected value - continuous
E(x) = Z x p(x) dx E(x2) = Z x2 p(x) dx E
- (x − µ)2
= Z (x − µ)2 p(x) dx = Z x2 p(x) dx − µ2 E (f(x)) = Z f(x) p(x) dx [the mean, ] µ [the “second moment”] σ2 [the variance, ] note: an inner product, and thus linear, i.e., E(af (X ) + bg(X )) = aE( f (X )) + bE(g(X ))
Joint and conditional probability - discrete Joint and conditional probability - discrete
P(Ace) P(Heart) P(Ace & Heart) P(Ace | Heart) P(not Jack of Diamonds) P(Ace | not Jack of Diamonds) “Independence”
- Joint distributions
- Marginals (integrating)
- Conditionals (slicing)
- Bayes’ Rule (inverting)
- Statistical independence (separability)
Multi-variate probability
[on board]
p(x) = Z p(x, y)dy p(x, y)
Marginal distribution
Conditional probability
A B A & B
p(A| B) = probability of A given that B is asserted to be true = p(A& B) p(B)
Neither A nor B
p(x, y) p(x|y = 68)
Conditional distribution
p(x|y = 68) = p(x, y = 68) Z p(x, y = 68)dx = p(x, y = 68) . p(y = 68)
P(x|Y=68)
Conditional distribution
slice joint distribution normalize (by marginal)
- p(x|y) = p(x, y)/p(y)
More generally:
Bayes’ Rule
A B A & B
p(A& B) = p(B)p(A| B) = p(A)p(B | A) ⇒ p(A| B) = p(B | A)p(A) p(B) p(A| B) = probability of A given that B is asserted to be true = p(A& B) p(B)
Bayes’ Rule
p(x|y) = p(y|x) p(x)/p(y)
(a direct consequence of the definition of conditional probability)
P(x|Y=120) P(x)
Conditional vs. marginal
In general, these differ. When are they they same? In particular, when are all conditionals equal to the marginal?
- Statistical independence
Random variables X and Y are statistically independent if (and only if): Independence implies that all conditionals are equal to the corresponding marginal: p(x, y) = p(x)p(y) ∀ x, y
p(x | y) = p(x, y) / p(y) = p(x) ∀ x, y
[note: for discrete distributions, this is an outer product!]
Sums of independent RVs
Suppose X and Y are independent, then
E(XY) = E(X )E(Y) σ X +Y
2
= E X +Y
( )− µX + µY ( )
( )
2
⎛ ⎝ ⎞ ⎠ = σ X
2 +σ Y 2
and is a convolution
pX +Y (z)
Implications: (1) Sums of Gaussians are Gaussian, (2) Properties of the sample average
E(X +Y) = E(X ) + E(Y)
For any two random variables (independent or not):
- Mean and variance summarize centroid/width
- translation and rescaling of random variables
- nonlinear transformations - “warping”
- Mean/variance of weighted sum of random variables
- The sample average
- ... converges to true mean (except for bizarre distributions)
- ... with variance
- ... most common common choice for an estimate ...
Mean and variance
- Estimator: Any function of the data, intended to compute
an estimate of the true value of a parameter
- The most common estimator is the sample average, used
to estimate the true mean of the distribution.
- Statistically-motivated examples:
- Maximum likelihood (ML):
- Max a posteriori (MAP):
- Min Mean Squared Error
(MMSE):
Point Estimates
Example: Estimate the bias of a coin
Bayes’ Rule and Estimation
p(parameter value |data) = p(data | parameter value)p(parameter value) p(data)
Posterior Prior Likelihood Nuisance normalizing term
Likelihood: 1 head Likelihood: 1 tail
More heads More tails
T=0 1 2 3 2 3 1 H=0
Posteriors, p(H,T|x), assuming prior p(x)=1
example
infer whether a coin is fair by flipping it repeatedly here, x is the probability of heads (50% is fair) y1...n are the outcomes of flips Consider three different priors: suspect fair suspect biased no idea
prior fair prior biased prior uncertain X likelihood (heads) = posterior
previous posteriors X likelihood (heads) = new posterior previous posteriors X likelihood (tails) = new posterior
Posteriors after observing 75 heads, 25 tails àprior differences are ultimately overwhelmed by data
PDFs CDFs
10H / 5T 20H / 10T 2H / 1T .975 .025 .19 .93 .49 .80
Confidence
Bias & Variance
- Mean squared error = bias^2 + variance
- Bias is difficult to assess (requires knowing the “true”
value). Variance is easier.
- Classical statistics generally aims for an unbiased
estimator, with minimal variance (“MVUE”).
- The MLE is asymptotically unbiased (under fairly
general conditions), but this is only useful if
- the likelihood model is correct
- the optimum can be computed
- you have enough data
- More general/modern view: estimation is about trading
- ff bias and variance, through model selection,
“regularization”, or Bayesian priors.
- Is the coin fair? Compared to what?
- Point hypotheses:
Bayesian Model Comparison
M1 : p = p1 = 0.5 M2 : p = p2 = 0.6 p(M1 | D) = p(D | M1)P(M1) p(D) = p(D | M1)P( p1) p(D)
Assuming equal priors over models the Bayes factor is
p(M1 | D) p(M2 | D) = p(D | M1)P(M1) p(D | M2)P(M2) = p(D | M1)P( p1) p(D | M2)P( p2)
- Is the coin fair? Compared to what?
- Alternative hypothesis:
Bayesian Model Comparison
M1 : p = p1 = 0.5 M2 : p ≠ 0.5 p(M2 | D) = p(D | M2)p(M2) p(D) = p( pcoin | D)p( pcoin)dpcoin
1
∫
= p(D | M2, pcoin)p( pcoin)dpcoin
1
∫
P(M2) p(D)
Compute Bayes factor as before.
The Gaussian
- parameterized by mean and stdev (position / width)
- joint density of two indep Gaussian RVs is circular! [easy]
- product of two Gaussians is Gaussian! [easy]
- conditionals of a Gaussian are Gaussian! [easy]
- sum of Gaussian RVs is Gaussian! [moderate]
- marginals of a Gaussian are Gaussian! [moderate]
- central limit theorem: sum of many RVs is Gaussian! [hard]
- most random (max entropy) density with this variance! [moderate]
Product of Gaussians is Gaussian
Completing the square shows that this posterior is also Gaussian, with
(average, weighted by inverse variances!)
∝
Product of Gaussians is Gaussian
Completing the square shows that this posterior is also Gaussian, with
(average, weighted by inverse variances!)
p(x|y) ∝ p(y|x)p(x) ∝ e
− 1
2
1 σ2 n (x−y)2
- e
− 1
2
1 σ2 x (x−µx)2
- =
e
− 1
2
✓
1 σ2 n + 1 σ2 x
◆ x2−2 ✓
y σ2 n + µx σ2 x
◆ x+...
mean: [0.2, 0.8] cov: [1.0 -0.3;
- 0.3 0.4]
Gaussian densities
~ x ∼ N(~ µ, C), let P = C−1 Gaussian, with:
Conditional:
p(x1) = Z p(~ xdx2
Marginal:
Gaussian, with:
(known as the “precision” matrix)
ˆ u z = ˆ uT~ x µz = ˆ uT ~ µx 2
z
= ˆ uT Cxˆ u z p(z)
Generalized marginals of a Gaussian
x1 x2 w
is Gaussian, with: true mean: [0 0.8] true cov: [1.0 -0.25
- 0.25 0.3]
sample mean: [-0.05 0.83] sample cov: [0.95 -0.23
- 0.23 0.29]
700 samples Measurement (sampling) Inference true density
−4 −3 −2 −1 1 2 3 4 50 100 150 200 250 300 350 400 450 500 (u+u+u+u)/sqrt(4) −4 −3 −2 −1 1 2 3 4 50 100 150 200 250 104 samples of uniform dist −4 −3 −2 −1 1 2 3 4 100 200 300 400 500 600 10 u’s divided by sqrt(10) −4 −3 −2 −1 1 2 3 4 50 100 150 200 250 300 350 400 450 (u+u)/sqrt(2)
Central limit for a uniform distribution...
10k samples, uniform density (sigma=1)
0.2 0.4 0.6 0.8 1 1000 2000 3000 4000 5000 6000
- ne coin
0.2 0.4 0.6 0.8 1 1000 2000 3000 4000 avg of 4 coins 0.2 0.4 0.6 0.8 1 500 1000 1500 2000 avg of 16 coins 0.2 0.4 0.6 0.8 1 500 1000 1500 2000 avg of 64 coins 0.2 0.4 0.6 0.8 1 500 1000 1500 2000 2500 avg of 256 coins
Central limit for a binary distribution...