Chapter II.2: Basic Probability Theory and Statistics 1. What is a - - PowerPoint PPT Presentation

chapter ii 2 basic probability theory and statistics
SMART_READER_LITE
LIVE PREVIEW

Chapter II.2: Basic Probability Theory and Statistics 1. What is a - - PowerPoint PPT Presentation

Chapter II.2: Basic Probability Theory and Statistics 1. What is a probability? 1.1. Probability spaces, events, and random variables 2. Distributions 2.1. Discrete distributions 2.2. Continuous distributions 3. Moments, independence, and


slide-1
SLIDE 1

IR&DM, WS'13/14 II.2- 24 October 2013

Chapter II.2: Basic Probability Theory and Statistics

1

  • 1. What is a probability?

1.1. Probability spaces, events, and random variables

  • 2. Distributions

2.1. Discrete distributions 2.2. Continuous distributions

  • 3. Moments, independence, and Bayes’ rule

3.1. Expectation, variance, and higher moments 3.2. Independence 3.3. Bayes’ rule

  • 4. Bounds and convergence
  • 5. Statistical inference

Wasserman, Ch. 1–5

slide-2
SLIDE 2

II.2- IR&DM, WS'13/14 24 October 2013

What is a probability

  • “If I throw a dice, I will probably get

4 or less”

  • “I’ll probably go running after this lecture”
  • The term “probability” here means different things

– The outcome of a repeatable experiment – My personal belief

2

slide-3
SLIDE 3

24 October 2013 II.2- IR&DM, WS'13/14

Views on probability

3

  • In classical definition, probability is equally shared

among all outcomes, provided the outcomes are equally likely

– “Equally likely” is decided based on physical symmetries or the like

  • In frequentism, a probability is the frequency of

which something happens over repeated experiments

– Requires infinite number of repetitions

  • In subjectivism (Bayesianism), probability refers to

my subjective “degree of belief”

– But everybody’s belief is different

slide-4
SLIDE 4

24 October 2013 II.2- IR&DM, WS'13/14

Axiomatic approach: sample spaces and events

  • A sample space Ω is a set of all possible outcomes of an

experiment

– Element e ∈ Ω is a sample outcome or realization

  • Subsets E ⊆ Ω are events
  • Examples:

– If we toss a coin twice, Ω = {HH, HT, TH, TT}

  • Event “Second toss is tails” is A = {HT, TT}

– If we toss a coin until we get tails, Ω = {T, HT, HHT, HHHT, HHHHT, HHHHHT, …} – If we measure a temperature in Kelvins, Ω = {x ∈ ℝ, x ≥ 0}

4

slide-5
SLIDE 5

24 October 2013 II.2- IR&DM, WS'13/14

Axiomatic approach: probability measures

  • Collection 𝒝 ⊆ 2Ω is a σ-algebra of Ω if

– Ω ∈ 𝒝 – If A ∈ 𝒝, then (Ω \ A) ∈ 𝒝 – If A1, A2, A3, … ∈ 𝒝, then (∪i Ai) ∈ 𝒝

  • Function Pr: 𝒝 → [0, 1] is a probability measure if

– Axiom 1: Pr[A] ≥ 0 for every A ∈ 𝒝 – Axiom 2: Pr[Ω] = 1 – Axiom 3: If A1, A2, … are disjoint, then Pr[∪i Ai] = ∑i Pr[Ai] (countably many Ais)

5

slide-6
SLIDE 6

24 October 2013 II.2- IR&DM, WS'13/14

Intermission: some combinatorics

  • The power set of a set A, 2A (or 𝒬(A)) is a collection
  • f all subsets of A

– If A = {1, 2, 3}, then 2A = {∅, {1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 3}} – The size of the power set is 2|A|

  • If A is finite, this is a natural number
  • If A = ℕ, this is the same cardinality as the real numbers
  • If A = ℝ, this is the next cardinal number
  • The number of size-k subsets of A is

6

✓|A| k ◆ = |A|! k!(|A|−k)!

slide-7
SLIDE 7

24 October 2013 II.2- IR&DM, WS'13/14

Axiomatic approach: probability spaces and further properties

  • A probability space is a triple (Ω, 𝒝, Pr)

– 𝒝 contains all the events we can assign a probability

  • If Ω is finite or countably infinite, we can have 𝒝 = 2Ω
  • If Ω is uncountable, it contains sets that cannot have probability

(unmeasurable sets)

  • From the axioms we can derive that

– Pr[∅] = 0 – If A ⊆ B, then Pr[A] ≤ Pr[B] – Pr[Ω \ A] = 1 – Pr[A] – Pr[A ∪ B] = Pr[A] + Pr[B] – Pr[A ∩ B]

7

slide-8
SLIDE 8

24 October 2013 II.2- IR&DM, WS'13/14

Axiomatic approach: random variables

  • A random variable (r.v.) is a function X: 𝒝 → ℝ

such that {e ∈ Ω : X(e) ≤ r} ∈ 𝒝 for all r ∈ ℝ

– This is needed to define probabilities like Pr[a ≤ X ≤ b] – Pr[X = x] is a shorthand for Pr[{e ∈ Ω : X(e) = x}]

  • An r.v. is discrete if it takes at most countably infinite

different discrete values

– None of the complexities applies

  • An r.v. is continuous if it varies continuously in one
  • r more intervals

– These are the ones that cause problems

8

slide-9
SLIDE 9

24 October 2013 II.2- IR&DM, WS'13/14

Example r.v.’s

  • Indicator variable 𝟚E or χE for event E ∈ 𝒝

– 𝟚E(x) = 1 if x ∈ E and 𝟚E(x) = 0 otherwise – Pr[E] = Pr[𝟚E = 1]

  • Let r.v. X be the number of heads in 10 coin flips

– If e = HTTTTTHHTT, then X(e) = 3 – Discrete r.v.

  • Let r.v. Y be the room temperature of my kitchen (in

Celsius)

– if e = “00:22 on 22 Oct”, then X(e) = 22,7 – Continuous r.v.

9

slide-10
SLIDE 10

24 October 2013 II.2- IR&DM, WS'13/14

Some diagrams (1)

  • The Venn diagram is a way to visualize the

combinatorial relationships of three sets

10

A B C A∩B A∩C B∩C A∩B ∩C The inclusion–exclusion principle for three sets: Pr[A ∪ B ∪ C] = Pr[A] + Pr[B] + Pr[C] – Pr[A ∩ B] – Pr[A ∩ C] – Pr[B ∩ C] + Pr[A ∩ B ∩ C]

slide-11
SLIDE 11

24 October 2013 II.2- IR&DM, WS'13/14

Some diagrams (2)

  • R.v. X that takes finite number of values partitions the

sample space into finite sets (the pre-image of X)

– If X is a roll of dice, we have E1 = {e ∈ Ω : X(e) = 1} = X–1(1), and similarly for E2, E3, …, E6 – If Y is indicator variable for “X ≥ 2”, we get

11

1 2 3 4 5 6 1

slide-12
SLIDE 12

II.2- IR&DM, WS'13/14 24 October 2013

Distributions

  • The cumulative distribution function (cdf) of r.v. X

is a function FX: ℝ → [0, 1], FX(x) = Pr[X ≤ x]

  • If X is discrete, the probability mass function (pmf)
  • f X is fX(x) = Pr[X = x]
  • If X is continuous, the probability density function

(pdf) of X is a function fX for which

– fX(x) ≥ 0 for all x – – We have that

12

R ∞

−∞ f X (x)dx = 1

FX (x) = R x

−∞ f X (t)dt

slide-13
SLIDE 13

II.2- IR&DM, WS'13/14 24 October 2013

Example of a CDF and PDF

13

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5 0,25 0,5 0,75 1

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5 0,1 0,2 0,3 0,4 0,5

CDF: PDF:

slide-14
SLIDE 14

24 October 2013 II.2- IR&DM, WS'13/14

Some discrete distributions

14

  • Uniform distribution over {1, 2, …, m}

– Pr[X = k] = 1/m for 1 ≤ k ≤ m

  • Bernoulli distribution with parameter p

– Binary, single coin toss – Pr[X = k] = pk(1 – p)1 – k for k ∈ {0, 1}

  • Binomial distribution with parameters p and n

– n repeated Bernoulli experiments with parameter p – for 0 ≤ k ≤ n

  • Geometric distribution with parameter p

– Pr[X = k] = (1 – p)kp for k ≥ 0

  • Poisson distribution with rate parameter λ

– Pr[X = k] = e−λλk/k!

Pr[X = k] = ⇣n

k

⌘ pk (1–p)n−k

slide-15
SLIDE 15

24 October 2013 II.2- IR&DM, WS'13/14

Some continuous distributions

  • Uniform distribution in the interval [a, b]

– for x ∈ [a, b]

  • Exponential distribution with rate λ

– Time between two events in a Poisson process – for x ≥ 0

  • t-distribution with ν degrees of freedom

– Typical distribution for test statistics –

  • χ2 distribution with k degrees of freedom

15

f X (x) =

1 b−a

f X (x) = λe−λx f X (x) =

Γ( ν+1

2 )

√νπΓ( ν

2 )

⇣ 1 + x2

ν

⌘ − ν+1

2

f X (x) =

1 2k/2Γ(k/2) x

k 2 −1e− x 2

slide-16
SLIDE 16

24 October 2013 II.2- IR&DM, WS'13/14

Normal (Gaussian) distribution

  • Two parameters, µ (mean) and σ2 (variance)

  • For standard normal distribution µ = 0 and σ2 = 1
  • Many, many applications
  • R.v. X is log-normally distributed if its logarithm is

normally distributed

16

f X (x) =

1 √ 2πσ2 e− (x−µ)2

2σ2

  • 5
  • 4
  • 3
  • 2
  • 1
1 2 3 4 5 0,25 0,5 0,75 1
  • 5
  • 4
  • 3
  • 2
  • 1
1 2 3 4 5 0,1 0,2 0,3 0,4 0,5
slide-17
SLIDE 17

24 October 2013 II.2- IR&DM, WS'13/14

Multivariate distributions

  • If X and Y are two discrete variables, their joint mass

function is fX,Y(x, y) = Pr[X = x, Y = y]

– For continuous variables it is a non-negative function s.t.

  • for any A ∈ ℝ × ℝ,
  • The marginal distribution (mass function) for X is

– for discrete X – for continuous X

  • All these concepts extend naturally to more than two

variables

17

f X,Y (x, y) = R

R

R

R f (x, y)dxdy = 1

Pr[(X,Y) ∈ A] = !

A f X,Y (x, y)dxdy

f X (x) = Pr[X = x] = P

y f X,Y (x, y)

f X (x) = R

R f X,Y (x, y)dy

slide-18
SLIDE 18

24 October 2013 II.2- IR&DM, WS'13/14

Multivariate normal distribution

  • A.k.a. multidimensional Gaussian distribution
  • Two variables, vector µ and matrix Σ

– For n variables, µ ∈ ℝn and Σ ∈ ℝn×n

  • The density function is
  • In the standard multivariate normal distribution, µ

is all-zeros and Σ is the identity, giving

18

f (x) =

1 (2π)k/2 exp

n 1

2 xT x

  • f (x; µ,Σ) =

1 (2π)k/2|Σ|1/2 exp

n 1

2 (x − µ)T Σ−1(x − mu)

slide-19
SLIDE 19

II.2- IR&DM, WS'13/14 24 October 2013

Bivariate normal distribution

19

slide-20
SLIDE 20

II.2- IR&DM, WS'13/14 24 October 2013

Independence, moments & Bayes’

  • Two events A and B are independent if

Pr[A ∩ B] = Pr[A]Pr[B]

  • Two r.v.’s X and Y are independent if

fX,Y(x, y) = fX(x)fY(y) for all x, y

  • The conditional probability of A given B is

Pr[A | B] = Pr[A ∩ B]/Pr[B]

– Assumes Pr[B] > 0 – If A and B are independent, Pr[A | B] = Pr[A]

  • The conditional pmf/pdf is fX|Y(x | y) = fX,Y(x, y)/fY(y)

– For independent X and Y, fX|Y(x | y) = fX(x)

  • A and B are conditionally independent given C if

Pr[A ∩ B | C] = Pr[A | C]Pr[B | C]

20

slide-21
SLIDE 21

24 October 2013 II.2- IR&DM, WS'13/14

Example

  • Test for sickness with outcomes + and –
  • Test seems to work:

– Pr[+ | sick] = Pr[+ ∩ sick]/Pr[sick] = 0.9 – Pr[– | healthy] ≈ 0.9

  • But what is the probability that you are sick if you

get +?

– Pr[sick | +] = Pr[+ ∩ sick]/Pr[+] ≈ 0.08

21

sick healthy + – 0.009 0.099 0.001 0.891

slide-22
SLIDE 22

24 October 2013 II.2- IR&DM, WS'13/14

Bayes’ theorem and total probability

  • The law of total probability states that if A1, A2, …,

Ak partition Ω, then for any event B

– Sum B piece-wise over Ai’s

  • The Bayes’ theorem states that if A1, A2, …, Ak is

partition of Ω s.t. Pr[Ai] > 0 for all i, then for any B s.t. Pr[B] > 0 and for each i = 1, …, k

– Pr[Ai] is the prior probability and Pr[Ai | B] the posterior probability

22

Pr[B] = Pk

i=1 Pr[B | Ai] Pr[Ai]

Pr[Ai | B] =

Pr[B| Ai] Pr[Ai] Pk

j=1 Pr[B| A j] Pr[A j]

slide-23
SLIDE 23

24 October 2013 II.2- IR&DM, WS'13/14

Expectation and variance

23

  • The expected value or r.v. X is

– for discrete X – for continuous X

  • Exists only if
  • The i-th moment is

– Assuming that

  • The variance of X is V[X] = E[(X – E[X])2]

= E[X2] – E[X]2

– Also denoted by σ2 – Standard deviation sd(X) is √V[X] E[X] = P

k k f X (k)

E[X] = R

R x f X (x)dx

R |x| f X (x)dx < ∞

E[X i] = R

R xi f X (x)dx

R

  • xi
  • f X (x)dx < ∞
slide-24
SLIDE 24

24 October 2013 II.2- IR&DM, WS'13/14

Properties of expectation and variance

  • E[aX + b] = aE[X] + b for constants a and b
  • E[X1 + X2 + … + Xn] = E[X1] + E[X2] + … + E[Xn]

– Linearity of expectation – Works for any Xi’s (e.g. don’t have to be independent)

  • E[XY] = E[X]E[Y] for independent X and Y
  • V[aX + b] = a2V[X] for constants a and b
  • V[X1 + X2 + … + Xn] = V[X1] + V[X2] + … + V[Xn]

– For independent Xi’s

24

slide-25
SLIDE 25

24 October 2013 II.2- IR&DM, WS'13/14

Correlation and covariance

  • The covariance between r.v.’s X and Y is

Cov(X, Y) = E[(X – E[X])(Y – E[Y])]

– Cov(X, Y) = E[XY] – E[X]E[Y]

  • Cov(X, X) = V[X]

– If X and Y are independent, then Cov(X, Y) = 0

  • The converse is not generally true
  • The correlation between X and Y is

ρX,Y = Cov(X, Y)/(sd(X)×sd(Y))

– We have –1 ≤ ρX,Y ≤ 1 – If Y = aX + b for some constants a and b, then ρX,Y = sign(a) (i.e. either –1 or 1)

25

slide-26
SLIDE 26

24 October 2013 II.2- IR&DM, WS'13/14

Conditional expectation

  • The conditional expectation of X given Y = y is

– E[X | Y = y] = ∑xfX|Y(x | y) for discrete X – E[X | Y = y] = ∫xfX|Y(x | y)dx for continuous X

  • The conditional expectation E[X | Y] is a r.v. of Y

– It only becomes a number when we observe Y = y – If X is a roll of dice and Y is an indicator variable for event “X ≥ 5”, then E[X | Y] is

  • (1 + 2 + 3 + 4)×(1/6)/(4/6) = 2.5 if Y = 0
  • (5 + 6)×(1/6)/(2/6) = 5.5 if Y = 1

26

slide-27
SLIDE 27

II.2- IR&DM, WS'13/14 24 October 2013

Bounds and convergence

  • Sometimes we don’t know everything about a r.v., but

we want to still study its behaviour

– E.g. we want to bound the “tail probability”

  • Trivial bound: If E[X] exists, then Pr[X ≤ E[X]] > 0

– Also Pr[X ≥ E[X]] > 0

  • Markov’s inequality: Pr[X ≥ t] ≤ E[X]/t

– Assumes X is nonnegative and t > 0

  • Chebyshev’s inequality: Pr[|X – E[X]| ≥ t] ≤ V[X]/t2

– Any X, t > 0 – Corollary of Markov’s with (X – E[X])2 as the r.v.

27

slide-28
SLIDE 28

24 October 2013 II.2- IR&DM, WS'13/14

More bounds

  • Chernoff–Hoeffding: If X1, … Xn ~ Bernoulli(p),

then for any ε > 0,

– – A large family of inequalities for different settings

  • Mill’s inequality:

for Z ~ N(0, 1) and t > 0

  • Cauchy–Schwartz: |E[XY]|2 ≤ E[X2]E[Y2]

– Assumes finite variances

  • Jensen’s inequality: E[g(X)] ≥ g(E[X]) for convex g

and E[g(X)] ≤ g(E[X]) for concave g

28

Pr[| ¯ Xn − p| > ε] ≤ 2e−2nε2

¯ Xn = n−1 Pn

i=1 Xi

Pr[|Z| > t] ≤ q

2 π exp{−t2/2} t

slide-29
SLIDE 29

24 October 2013 II.2- IR&DM, WS'13/14

Convergence

  • A sequence X1, X2, … of r.v.’s can converge to r.v. X

in the following senses

– Xn converges to X almost surely, Xn →a.s. X, if Pr[limn→∞ Xn = X] = 1 – Xn converges to X in probability, Xn →P X, if for every ε > 0, Pr[|Xn – X| > ε] → 0 as n → ∞ – Xn converges to X in distribution, Xn →D X, if limn→∞Fn(x) = F(x) at all points where F(x) is continuous

  • Fn is the cdf of Xn and F the cdf of X
  • Almost sure convergence implies convergence in

probability implies convergence in distribution

29

slide-30
SLIDE 30

24 October 2013 II.2- IR&DM, WS'13/14

Laws of large numbers

  • The weak law of large numbers states that if X1, X2,

…, Xn are independent and identically distributed (i.i.d.) r.v.’s with mean µ, then

  • The strong law of large numbers replaces the

convergence in probability with almost sure convergence

  • The laws of large numbers show that the expected

value is the average value over infinite number of repetitions

30

¯ Xn = n−1 Pn

i=1 Xi →P µ .

slide-31
SLIDE 31

24 October 2013 II.2- IR&DM, WS'13/14

Central limit theorem

  • If X1, X2, …, Xn are i.i.d. with mean µ and variance σ2,

and if X ~ N(µ, σ2/n), then per the central limit theorem,

– Does not depend on distributions of Xi

  • Except that they must have mean and variance

– One main reason why normal distribution is ubiquitous

31

¯ Xn →D X .

slide-32
SLIDE 32

II.2- IR&DM, WS'13/14 24 October 2013

Statistical inference

  • A statistical model M is a set of distributions

– All smooth distributions, all unimodal distributions, all discrete distributions with mean 1, …

  • M is parametric model if it can be completely

described with a finite number of parameters

– E.g. the family of Normal distributions with parameters µ and σ2 M = {N(µ, σ2) : µ ∈ ℝ, σ2 ∈ ℝ+}

32

slide-33
SLIDE 33

24 October 2013 II.2- IR&DM, WS'13/14

Statistical inference

  • Given a parametric model M and a sample X1, …, Xn,

how do we infer the parameters of M?

  • The sample mean is
  • The sample variance is
  • The bias of the estimator for parameter θ is

– The estimator is unbiased if it has bias 0

33

¯ Xn = n−1 Pn

i=1 Xi

S2

Xn = (n − 1)−1 Pn i=1(Xi − ¯

Xn)2 ˆ θ E[ ˆ θ] − θ

slide-34
SLIDE 34

II.2- IR&DM, WS'13/14 24 October 2013

Summary

  • What “probability” means is debatable

– Axiomatic approach side-steps interpretation issues

  • With discrete r.v.’s, most of prob. theory is simple

combinatorics

– Continuous variables are more problematic

  • Conditional expectation is a random variable!

34