Probability & Statistics: Intro, summary statistics, probability - - PDF document

probability statistics intro summary statistics
SMART_READER_LITE
LIVE PREVIEW

Probability & Statistics: Intro, summary statistics, probability - - PDF document

1 Mathematical Tools for Neural and Cognitive Science Fall semester, 2018 Probability & Statistics: Intro, summary statistics, probability 2 - Efron & Tibshirani, Introduction to the Bootstrap , 1998 3 Some history 1600s:


slide-1
SLIDE 1

Mathematical Tools for Neural and Cognitive Science

Probability & Statistics: Intro, summary statistics, probability

Fall semester, 2018

1

  • Efron & Tibshirani, Introduction to the Bootstrap, 1998

2

slide-2
SLIDE 2

Some history…

  • 1600’s: Early notions of data summary/averaging
  • 1700’s: Bayesian prob/statistics (Bayes, Laplace)
  • 1920’s: Frequentist statistics for science (e.g., Fisher)
  • 1940’s: Statistical signal analysis and communication,

estimation/decision theory (e.g., Shannon, Wiener, etc)

  • 1950’s: Return of Bayesian statistics (e.g., Jeffreys,

Wald, Savage, Jaynes…)

  • 1970’s: Computation, optimization, simulation (e.g,.

Tukey)

  • 1990’s: Machine learning (large-scale computing +

statistical inference + lots of data)

  • Since 1950’s! : statistical neural/cognitive models

3

Scientific process

Summarize/fit model(s), compare with predictions Create/modify hypothesis/model Generate predictions, design experiment Observe / measure data

4

slide-3
SLIDE 3

Descriptive statistics: Central tendency

5

Descriptive statistics: Central tendency

  • We often summarize data with the average. Why?
  • Average minimizes the squared error (as in regression!):
  • Generalize: minimize Lp norm:

– minimize L1 norm: median, – minimize L0 norm: mode – minimize norm: midpoint of range

  • Issues: outliers, asymmetry, bimodality
  • How do we choose?

L∞ m(~ x) µ(~ x) = arg min

c

1 N

N

X

n=1

  • xn − c

2 = 1 N

N

X

n=1

xn arg min

c

" 1 N

N

X

n=1

|xn − c|p #1/p

6

slide-4
SLIDE 4

Descriptive statistics: Dispersion

7

Descriptive statistics: Dispersion

  • Sample standard deviation


  • Mean absolute deviation (MAD) about the median

  • Quantiles

d(~ x) = 1 N

N

X

n=1

  • xn − m(~

x)

  • (~

x) = min

c

" 1 N

N

X

n=1

(xn − c)2 #1/2 = " 1 N

N

X

n=1

(xn − µ(~ x))2 #1/2

8

slide-5
SLIDE 5

Descriptive statistics: Dispersion

Summary statistics (eg: sample mean/var) can be interpreted as estimates of model parameters To formalize this, we need tools from probability…

9

data

{xn}

histogram

{ck, hk}

probability distribution

p(x)

10

slide-6
SLIDE 6

data { ⃗ x n} probabilistic model pθ( ⃗ x )

Measurement Inference

11

You pick a family at random and discover that one

  • f the children is a girl.

Probabilistic Middleville

The stork delivers boys and girls randomly, with family probability {BB,BG,GB,GG}={0.2,0.3,0.2,0.3} probabilistic model In Middleville, every family has two children, brought by the stork. What are the chances that the other child is a girl?

data

inference

12

slide-7
SLIDE 7

Statistical Middleville

In Middleville, every family has two children, brought by the stork. In a survey of 100 of the Middleville families, 32 have two girls, 23 have two boys, and the remainder one of each.

data

inference The stork delivers boys and girls randomly, with family probability {BB,BG,GB,GG}={0.2,0.3,0.2,0.3} You pick a family at random and discover that one

  • f the children is a girl.

What are the chances that the other child is a girl?

13

Probability basics (outline)

  • distributions: discrete and continuous
  • expected value, moments
  • cumulative distributions. Quantiles, Q-Q plots,

drawing samples.

  • transformations: affine, monotonic nonlinear

14

slide-8
SLIDE 8

Probability: Definitions/notation

let X, Y, Z be random variables they can take on values (like ‘heads’ or ‘tails’; or integers 1-6; or real-valued numbers) let x, y, z stand generically for values they can take, and denote events such as X = x write the probability that X takes on value x as P(X = x), or PX(x), or sometimes just P(x) P(x) is a function over values x, which we call the probability “distribution” function (pdf) (for continuous variables, “density”)

Useful to have this notation up on slid, while introducing concepts on board

15

Probability distributions

P(x) p(x)

0 < P(xi) <1, ∀i P(xi) = 1

i

0 < p(x) p(x)dx = 1

−∞ ∞

Discrete random variable Continuous random variable

16

slide-9
SLIDE 9

1 0.1 0.2 0.3 0.4 0.5 0.6 0.7

1 2 3 4 5 6 7 8 9 10 11 0.05 0.1 0.15 0.2 0.25 200 400 600 800 1000 0.02 0.04 0.06 0.08 0.1 2 3 4 5 6 7 8 9 10 11 12 0.05 0.1 0.15 0.2 1 2 3 4 5 6 0.05 0.1 0.15 0.2

a not-quite-fair coin sum of two rolled fair dice clicks of a Geiger counter, in a fixed time interval horizontal velocity of gas molecules exiting a fan ... and, time between clicks

Example distributions

roll of a fair die

  • 1

2 4 5 3 6 7 8 9 10

17

Expected value - discrete

[the mean, ] µ E(X ) = xi p(xi)

i=1 N

1 2 3 4 # of credit cards 0.1 0.2 0.3 0.4 0.5 0.6 0.7 P(x)

µ E( f (X )) = f (xi)p(xi)

i=1 N

More generally:

18

slide-10
SLIDE 10

Expected value - continuous

E(x) = Z x p(x) dx E(x2) = Z x2 p(x) dx E

  • (x − µ)2

= Z (x − µ)2 p(x) dx = Z x2 p(x) dx − µ2 E (f(x)) = Z f(x) p(x) dx [mean, ] µ [“second moment”, m2] σ2 [variance, ] Note: this is an inner product, and thus linear: [equal to m2 minus ] μ2 E (af(x) + bg(x)) = aE (f(x)) + bE (g(x)) [“expected value of f ”]

19

Cumulatives

2 3 4 5 6 7 8 9 101112 0.05 0.1 0.15 0.2 x p(x) 2 4 6 8 10 12 1 x c(x) 50 100 150 x p(x) 50 100 150 0.5 1 x c(x)

c(y) = Z y

−∞

p(x)dx

20

slide-11
SLIDE 11

Drawing samples - discrete

0.125 0.25 0.375 0.5 0.25 0.5 0.75 1

21

  • joint distributions
  • marginals (integrating)
  • conditionals (slicing)
  • Bayes’ rule (inverse probability)
  • statistical independence (separability)
  • linear transformations

Multi-variate probability

[on board]

22

slide-12
SLIDE 12

Joint and conditional probability - discrete

23

Joint and conditional probability - discrete

P(Ace) P(Heart) P(Ace & Heart) P(Ace | Heart) P(not Jack of Diamonds) P(Ace | not Jack of Diamonds) “Independence”

24

slide-13
SLIDE 13
slide-14
SLIDE 14

Conditional probability

A B A & B

p(A| B) = probability of A given that B is asserted to be true = p(A& B) p(B)

Neither A nor B

27

p(x, y) p(x|y = 68)

Conditional distribution

28

slide-15
SLIDE 15

p(x|y = 68) = p(x, y = 68) Z p(x, y = 68)dx = p(x, y = 68) . p(y = 68)

P(x|Y=68)

Conditional distribution

p(x|y) = p(x, y)/p(y) More generally:

normalize (by marginal) slice joint distribution

29

Bayes’ Rule

A B A & B

p(A& B) = p(B)p(A| B) = p(A)p(B | A) ⇒ p(A| B) = p(B | A)p(A) p(B) p(A| B) = probability of A given that B is asserted to be true = p(A& B) p(B)

30

slide-16
SLIDE 16

Bayes’ Rule

p(x|y) = p(y|x) p(x)/p(y)

(a direct consequence of the definition of conditional probability)

31

P(x|Y=120) P(x)

Conditional vs. marginal

In general, the marginals for different Y values differ. When are they they same? In particular, when are all conditionals equal to the marginal?

32

slide-17
SLIDE 17

Statistical independence

Random variables X and Y are statistically independent if (and only if): Independence implies that all conditionals are equal to the corresponding marginal: p(x, y) = p(x)p(y) ∀ x, y

p(x | y) = p(x, y) / p(y) = p(x) ∀ x, y

[note: for discrete distributions, this is an outer product!]

33

Sums of RVs

In addition, if X and Y are independent, then

E(XY) = E(X )E(Y) σ Z

2 = E

X +Y

( )− µX + µY ( )

( )

2

( ) = σ X

2 +σ Y 2

and is a convolution of and

pZ(z) E(X +Y) = E(X ) + E(Y)

Let Z = X + Y. Since expectation is linear:

pX (x) pY (y)

[on board]

34

slide-18
SLIDE 18
  • Mean and variance summarize the centroid/width
  • Translation and rescaling of random variables
  • Mean/variance of weighted sum of random variables
  • The sample average
  • ... converges to true mean (except for bizarre distributions)
  • ... with variance
  • ... most common common choice for an estimate ...

Mean and variance

35

−4 −3 −2 −1 1 2 3 4 50 100 150 200 250 300 350 400 450 500 (u+u+u+u)/sqrt(4) −4 −3 −2 −1 1 2 3 4 50 100 150 200 250 104 samples of uniform dist −4 −3 −2 −1 1 2 3 4 100 200 300 400 500 600 10 u’s divided by sqrt(10) −4 −3 −2 −1 1 2 3 4 50 100 150 200 250 300 350 400 450 (u+u)/sqrt(2)

Central limit for a uniform distribution...

10k samples, uniform density (sigma=1)

36

slide-19
SLIDE 19

0.2 0.4 0.6 0.8 1 1000 2000 3000 4000 5000 6000

  • ne coin

0.2 0.4 0.6 0.8 1 1000 2000 3000 4000 avg of 4 coins 0.2 0.4 0.6 0.8 1 500 1000 1500 2000 avg of 16 coins 0.2 0.4 0.6 0.8 1 500 1000 1500 2000 avg of 64 coins 0.2 0.4 0.6 0.8 1 500 1000 1500 2000 2500 avg of 256 coins

Central limit for a binary distribution...

37