Basic Statistics and Probability Theory Based on Foundations of - - PowerPoint PPT Presentation

basic statistics and probability theory
SMART_READER_LITE
LIVE PREVIEW

Basic Statistics and Probability Theory Based on Foundations of - - PowerPoint PPT Presentation

0. Basic Statistics and Probability Theory Based on Foundations of Statistical NLP C. Manning & H. Sch utze, ch. 2, MIT Press, 2002 Probability theory is nothing but common sense reduced to calculation. Pierre Simon,


slide-1
SLIDE 1

Basic Statistics and Probability Theory

Based on “Foundations of Statistical NLP”

  • C. Manning & H. Sch¨

utze, ch. 2, MIT Press, 2002

“Probability theory is nothing but common sense reduced to calculation.” Pierre Simon, Marquis de Laplace (1749-1827)

0.

slide-2
SLIDE 2

PLAN

  • 1. Elementary Probability Notions:
  • Event Space, and Probability Function
  • Conditional Probabiblity
  • Bayes’ Theorem
  • Independence of Probabilistic Events
  • 2. Random Variables:
  • Discrete Variables and Continuous Variables
  • Mean, Variance and Standard Deviation
  • Standard Distributions
  • Joint, Marginal and and Conditional Distributions
  • Independence of Random Variables
  • 3. Limit Theorems
  • 4. Estimating the parameters of probab. models from data
  • 5. Elementary Information Theory

1.

slide-3
SLIDE 3
  • 1. Elementary Probability Notions
  • sample/event space: Ω (either discrete or continuous)
  • event: A ⊆ Ω

– the certain event: Ω – the impossible event: ∅

  • event space: F = 2Ω (or a subspace of 2Ω that contains ∅ and is closed

under complement and countable union)

  • probability function/distribution: P : F → [0, 1] such that:

– P(Ω) = 1 – the “countable additivity” property: ∀A1, ..., Ak disjoint events, P(∪Ai) = P(Ai) Consequence: for a uniform distribution in a finite sample space: P(A) = #favorable events #all events

2.

slide-4
SLIDE 4

Conditional Probabiblity

  • P(A | B) = P(A ∩ B)

P(B)

Note: P(A | B) is called the a posteriory probability of A, given B.

  • The “multiplication” rule:

P(A ∩ B) = P(A | B)P(B) = P(B | A)P(A)

  • The “chain” rule:

P(A1 ∩ A2 ∩ . . . ∩ An) = P(A1)P(A2 | A1)P(A3 | A1, A2) . . . P(An | A1, A2, . . . , An−1)

3.

slide-5
SLIDE 5
  • The “total probability” formula:

P(A) = P(A | B)P(B) + P(A | ¬B)P(¬B)

More generally: if A ⊆ ∪Bi and ∀i = j Bi ∩ Bj = ∅, then P(A) =

i P(A | Bi)P(Bi)

  • Bayes’ Theorem:

P(B | A) = P(A | B) P(B) P(A)

  • r P(B | A) =

P(A | B) P(B) P(A | B)P(B) + P(A | ¬B)P(¬B)

  • r ...

4.

slide-6
SLIDE 6

Independence of Probabilistic Events

  • Independent events: P(A ∩ B) = P(A)P(B)

Note: When P(B) = 0, the above definition is equivalent to P(A|B) = P(A).

  • Conditionally independent events:

P(A ∩ B | C) = P(A | C)P(B | C), assuming, of course, that P(C) = 0. Note: When P(B ∩ C) = 0, the above definition is equivalent to P(A|B, C) = P(A|C).

5.

slide-7
SLIDE 7
  • 2. Random Variables

2.1 Basic Definitions

Let Ω be a sample space, and P : 2Ω → [0, 1] a probability function.

  • A random variable of distribution P is a function

X : Ω → Rn

  • For now, let us consider n = 1.
  • The cumulative distribution function of X is F : R → [0, ∞) defined by

F(x) = P(X ≤ x) = P({ω ∈ Ω | X(ω) ≤ x})

6.

slide-8
SLIDE 8

2.2 Discrete Random Variables

Definition: Let P : 2Ω → [0, 1] be a probability function, and X be a random variable of distribution P.

  • If Image(X) is either finite or unfinite countable, then

X is called a discrete random variable.

  • For such a variable we define the probability mass function (pmf)

p : R → [0, 1] as p(x)

def

= p(X = x) = P({ω ∈ Ω | X(ω) = x}). (Obviously, it follows that

xi∈Image(X) p(xi) = 1.)

Mean, Variance, and Standard Deviation:

  • Expectation / mean of X:

E(X)

not.

= E[X] =

x xp(x) if X is a discrete random variable.

  • Variance of X: Var(X)

not.

= Var[X] = E((X − E(X))2).

  • Standard deviation: σ =
  • Var(X).

Covariance of X and Y , two random variables of distribution P:

  • Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]

7.

slide-9
SLIDE 9

Exemplification:

  • the Binomial distribution: b(r; n, p) = Cr

n pr(1 − p)n−r (0 ≤ r ≤ n)

mean: np, variance: np(1 − p)

  • the Bernoulli distribution: b(r; 1, p)

The probability mass function and the cumulative distribution function of the Binomial distribution:

8.

slide-10
SLIDE 10

2.3 Continuous Random Variables

Definitions: Let P : 2Ω → [0, 1] be a probability function, and X : Ω → R be a random variable of distribution P.

  • If Image(X) is unfinite non-countable set, and

F, the cumulative distribution function of X is continuous, then X is called a continuous random variable. (It follows, naturally, that P(X = x) = 0, for all x ∈ R.)

  • If there exists p : R → [0, ∞) such that F(x) =

x

−∞ p(t)dt,

then X is called absolutely continuous. In such a case, p is called the probability density function (pdf) of X.

  • For B ⊆ R for which
  • B p(x)dx exists,

Pr(B)

def

= P({ω ∈ Ω | X(ω) ∈ B}) =

  • B p(x)dx.
  • In particular,

+∞

−∞ p(x)dx = 1.

  • Expectation / mean of X: E(X)

not.

= E[X] =

  • xp(x)dx.

9.

slide-11
SLIDE 11

Exemplification:

  • Normal (Gaussean) distribution: N(x; µ, σ) =

1

√ 2πσe

−(x − µ)2

2σ2

mean: µ, variance: σ2

  • Standard Normal distribution: N(x; 0, 1)
  • Remark:

For n, p such that np(1 − p) > 5, the Binomial distributions can be approximated by Normal distributions.

10.

slide-12
SLIDE 12

The Normal distribution: the probability density function and the cumulative distribution function

11.

slide-13
SLIDE 13

2.4 Basic Properties of Random Variables

Let P : 2Ω → [0, 1] be a probability function, X : Ω → Rn be a random discrete/continuous variable of distribution P.

  • If g : Rn → Rm is a function, then g(X) is a random variable.

If g(X) is discrete, then E(g(X)) =

x g(x)p(x).

If g(X) is continuous, then E(g(X)) =

  • g(x)p(x)dx.
  • E(aX + b) = aE(X) + b.
  • If g is non-linear ⇒ E(g(X)) = g(E(X)).
  • E(X + Y ) = E(X) + E(Y ).
  • Var(X) = E(X2) − E2(X).
  • Var(aX) = a2Var(X).
  • Cov(X, Y ) = E[XY ] − E[X]E[Y ].

12.

slide-14
SLIDE 14

2.5 Joint, Marginal and Conditional Distributions

Exemplification for the bi-variate case: Let Ω be a sample space, P : 2Ω → [0, 1] a probability function, and V : Ω → R2 be a random variable of distribution P. One could naturally see V as a pair of two random variables X : Ω → R and Y : Ω → R. (More precisely, V (ω) = (x, y) = (X(ω), Y (ω)).)

  • the joint pmf/pdf of X and Y is defined by

p(x, y)

not.

= pX,Y (x, y) = P(X = x, Y = y) = P(ω ∈ Ω | X(ω) = x, Y (ω) = y).

  • the marginal pmf/pdf functions of X and Y are:

for the discrete case: pX(x) =

y p(x, y),

pY (y) =

x p(x, y)

for the continuous case: pX(x) =

  • y p(x, y) dy,

pY (y) =

  • x p(x, y) dx
  • the conditional pmf/pdf of X given Y is:

pX|Y (x | y) = pX,Y (x, y) pY (y)

13.

slide-15
SLIDE 15

2.6 Independence of Random Variables

Definitions:

  • Let X, Y be random variables of the same type (i.e. either discrete or

continuous), and pX,Y their joint pmf/pdf. X and Y are are said to be independent if pX,Y (x, y) = pX(x) · pY (y) for all possible values x and y of X and Y respectively.

  • Similarly, let X, Y and Z be random variables of the same type, and p

their joint pmf/pdf. X and Y are conditionally independent given Z if pX,Y |Z(x, y | z) = pX|Z(x | z) · pY |Z(y | z) for all possible values x, y and z of X, Y and Z respectively.

14.

slide-16
SLIDE 16

Properties of random variables pertaining to independence

  • If X, Y are independent, then

Var(X + Y ) = Var(X) + Var(Y ).

  • If X, Y are independent, then

E(XY ) = E(X)E(Y ), i.e. Cov(X, Y ) = 0.

  • Cov(X, Y ) = 0 ⇒ X, Y are independent.
  • The covariance matrix corresponding to a vector of random variables

is symmetric and positive semi-definite.

  • If the covariance matrix of a multi-variate Gaussian distribution is

diagonal, then the marginal distributions are independent.

15.

slide-17
SLIDE 17
  • 3. Limit Theorems

[ Sheldon Ross, A first course in probability, 5th ed., 1998 ] “The most important results in probability theory are limit theo-

  • rems. Of these, the most important are...

laws of large numbers, concerned with stating conditions under which the average of a sequence of random variables converge (in some sense) to the expected average; central limit theorems, concerned with determining the conditions under which the sum of a large number of random variables has a probability distribution that is approximately normal.”

16.

slide-18
SLIDE 18

Two basic inequalities and the weak law of large numbers Markov’s inequality:

If X is a random variable that takes only non-negative values, then for any value a > 0, P(X ≥ a) ≤ E[X] a

Chebyshev’s inequality:

If X is a random variable with finite mean µ and variance σ2, then for any value k > 0, P(| X − µ |≥ k) ≤ σ2 k2

The weak law of large numbers (Bernoulli; Khintchine):

Let X1, X2, . . . , Xn be a sequence of independent and identically dis- tributed random variables, each having a finite mean E[Xi] = µ. Then, for any value ǫ > 0, P

  • X1 + . . . + Xn

n − µ

  • ≥ ǫ
  • → 0 as n → ∞

17.

slide-19
SLIDE 19

The central limit theorem for i.i.d. random variables

[ Pierre Simon, Marquis de Laplace; Liapunoff in 1901-1902 ] Let X1, X2, . . . , Xn be a sequence of independent random variables, each having mean µ and variance σ2. Then the distribution of X1 + . . . + Xn − nµ σ √n tends to be the standard normal (Gaussian) as n → ∞. That is, for −∞ < a < ∞, P X1 + . . . + Xn − nµ σ √n ≤ a

1 √ 2π a

−∞

e−x2/2dx as n → ∞

18.

slide-20
SLIDE 20

The central limit theorem for independent random variables

Let X1, X2, . . ., Xn be a sequence of independent and identically distributed random variables having respective means µi and variances σ2

i .

If (a) the variables Xi are uniformly bounded, i.e. for some M ∈ R+ P(| Xi |< M) = 1 for all i, and (b) ∞

i=1 σ2 i = ∞,

then P n

i=1(Xi − µi)

n

i=1 σ2 i

≤ a

  • → Φ(a) as n → ∞

where Φ is the cumulative distribution function for the standard normal (Gaussian) distribution.

19.

slide-21
SLIDE 21

The strong law of large numbers

Let X1, X2, . . . , Xn be a sequence of independent and identically distributed random variables, each having a finite mean E[Xi] = µ. Then, with probability 1, X1 + . . . + Xn n → µ as n → ∞ That is, P

  • lim

n→∞(X1 + . . . + Xn)/n = µ

  • = 1

20.

slide-22
SLIDE 22

Other inequalities

One-sided Chebyshev inequality: If X is a random variable with mean 0 and finite variance σ2, then for any a > 0, P(X ≥ a) ≤ σ2 σ2 + a2 Corollary: If E[X] = µ, Var(X) = σ2, then for a > 0 P(X ≥ µ + a) ≤ σ2 σ2 + a2 P(X ≤ µ − a) ≤ σ2 σ2 + a2 Chernoff bounds: Let M(t)

not

= E[etX]. Then P(X ≥ a) ≤ e−taM(t) for all t > 0 P(X ≥ a) ≤ e−taM(t) for all t < 0

21.

slide-23
SLIDE 23
  • 4. Estimation/inference of the parameters of

probabilistic models from data

(based on [Durbin et al, Biological Sequence Analysis, 1998],

  • p. 311-313, 319-321)

A probabilistic model can be anything from a simple distribution to a complex stochastic grammar with many implicit probability

  • distributions. Once the type of the model is chosen, the parame-

ters have to be inferred from data. We will first consider the case of the multinomial distribution, and then we will present the different strategies that can be used in general.

22.

slide-24
SLIDE 24

A case study: Estimation of the parameters of a multinomial distribution from data

Assume that the observations — for example, when rolling a die about which we don’t know whether it is fair or not, or when counting the number

  • f times the amino acid i occurs in a column of a multiple sequence align-

ment — can be expressed as counts ni for each outcome i (i = 1, l . . . , K), and we want to estimate the probabilities θi of the underlying distribution.

Case 1:

When we have plenty of data, it is natural to use the maximum likeli-

hood (ML) solution, i.e. the observed frequency θML

i

= ni

  • j nj

not.

= ni N . Note: it is easy to show that indeed P(n | θML) > P(n | θ) for any θ = θML.

log P(n | θML) P(n | θ) = log Πi(θML

i

)ni Πiθni

i

=

  • i

ni log θML

i

θi = N

  • i

θML

i

log θML

i

θi > 0

The inequality follows from the fact that the relative entropy is always positive except when the two distributions are identical.

23.

slide-25
SLIDE 25

Case 2:

When the data is scarce, it is not clear what is the best estimate. In general, we should use prior knowledge, via Bayesian statistics. For instance, one can use the Dirichlet distribution with parameters α.

P(θ | n) = P(n | θ)D(θ | α) P(n)

It can be shown (see calculus on R. Durbin et. al. BSA book, pag. 320) that the posterior mean estimation (PME) of the parameters is θPME

i def.

=

  • θP(θ | n)dθ =

ni + αi N +

j αj

The α′s are like pseudocounts added to the real counts. (If we think of the α′s as extra observations added to the real ones, this is precisely the ML estimate!) This makes the Dirichlet regulariser very intuitive. How to use the pseudocounts: If it is fairly obvious that a certain residue, let’s say i, is very common, than we should give it a very high pseudocount αi; if the residue j is generaly rare, we should give it a low pseudocount.

24.

slide-26
SLIDE 26

Strategies to be used in the general case

  • A. The Maximum Likelihood (ML) Estimate

When we wish to infer the parameters θ = (θi) for a model M from a set

  • f data D, the most obvious strategy is to maximise P(D | θ, M) over all

possible values of θ. Formally: θML = argmax

θ

P(D | θ, M) Note: Generally speaking, when we treat P(x | y) as a function of x (and y is fixed), we refer to it as a probability. When we treat P(x | y) as a function of y (and x is fixed), we call it a likelihood. Note that a likelihood is not a probability distribution or density; it is simply a function of the variable y. A serious drawback of maximum likelihood is that it gives poor results when data is scarce. The solution then is to introduce more prior knowl- edge, using Bayes’ theorem. (In the Bayesian framework, the parameters are themselves seen as random variables!)

25.

slide-27
SLIDE 27
  • B. The Maximum A posteriori Probability (MAP) Estimate

θMAP

def.

= argmax

θ

P(θ | D, M) = argmax

θ

P(D | θ, M)P(θ | M) P(D | M) = argmax

θ

P(D | θ, M)P(θ | M)

The prior probability P(θ | M) has to be chosen in some reasonable manner, and this is the art of Bayesian estimation (although this freedom to choose a prior has made Bayesian statistics controversial at times...).

  • C. The Posterior Mean Estimator (PME)

θPME =

  • θP(θ | D, M)dθ

where the integral is over all probability vectors, i.e. all those that sum to

  • ne.
  • D. Yet another solution is to use the posterior probability P(θ | D, M) to

sample from it (see [Durbin et al, 1998], section 11.4) and thereby locate

regions of high probability for the model parameters.

26.

slide-28
SLIDE 28
  • 5. Elementary Information Theory

Definitions:

Let X and Y be discrete random variables.

  • Entropy: H(X)

def.

=

x p(x) log 1 p(x) = − x p(x)log p(x) = Ep[−log p(X)].

  • Specific Conditional entropy: H(Y | X = x)

def.

= −

y∈Y p(y)log p(y | x).

  • Average conditional entropy:

H(Y | X)

def.

=

x∈X p(x)H(Y | X = x) imed.

= −

x∈X

  • y∈Y p(x, y)log p(y | x).
  • Joint entropy:

H(X, Y )

def.

= −

x p(x, y) log p(x, y) dem.

= H(X)+H(Y |X)

dem.

= H(Y )+H(X |Y ).

  • Mutual information (or: Information gain):

IG(X; Y )

def.

= H(X) − H(X | Y )

imed.

= H(Y ) − H(Y | X)

imed.

= H(X, Y ) − H(X | Y ) − H(Y | X).

27.

slide-29
SLIDE 29

Basic properties of Entropy, Conditional Entropy, Joint Entropy and Mutual Information / Information Gain

  • 0 ≤ H(p1, . . . , pn) ≤ H

1 n, . . . , 1 n

  • ;

H(X) = 0 iff X is a constant random variable.

  • H(X, Y ) ≤ H(X) + H(Y );

H(X, Y ) = H(X) + H(Y ) iff X and Y are independent; H(X | Y ) = H(X) iff X and Y are independent.

  • IG(X; Y ) ≥ 0;

IG(X; Y ) = 0 iff X and Y are independent.

28.

slide-30
SLIDE 30

The Relationship between Entropy, Conditional Entropy, Joint Entropy and Mutual Information

I(X,Y) H(X|Y) H(Y|X) H(X,Y) H(X) H(Y)

29.

slide-31
SLIDE 31

Other definitions

Let X and Y be discrete random variables, and p and q their respective pmf’s.

  • Relative entropy (or, Kullback-Leibler divergence):

KL(p || q) = −

x∈X p(x)log q(x)

p(x) = Ep

  • log p(X)

q(X)

  • Cross-entropy: CH(X, q) = −

x∈X p(x)log q(x) = Ep

  • log

1 q(X)

  • 30.
slide-32
SLIDE 32

Basic properties of relative entropy and cross-entropy

  • KL(p || q) ≥ 0 for all p and q;

KL(p || q) = 0 iff p and q are identical.

  • KL is NOT a distance metric (because it is not symmetric)!!
  • The quantity

d(X, Y )

def

= H(X, Y ) − IG(X; Y ) = H(X) + H(Y ) − 2IG(X; Y ) = H(X | Y ) + H(Y | X) known as variation of information, is a distance metric.

  • IG(X; Y ) = KL(pXY || pX pY ) =

x

  • y p(x, y) log
  • p(x)p(y)

p(x,y)

  • .
  • If X is a discrete random variable, p its pmf and q another pmf (usually

a model of p), then CH(X, q) = H(X) + KL(p || q), and therefore CH(X, q) ≥ H(X) ≥ 0.

31.

slide-33
SLIDE 33
  • 6. Recommended Exercises
  • From [Manning & Sch¨

utze, 2002 , ch. 2:] Examples 1, 2, 4, 5, 7, 8, 9 Exercises 2.1, 2.3, 2.4, 2.5

  • From [Sheldon Ross, 1998 , ch. 8:]

Examples 2a, 2b, 3a, 3b, 3c, 5a, 5b

32.

slide-34
SLIDE 34

Addenda

  • A. Other Examples of Probabilistic Distributions

33.

slide-35
SLIDE 35

Multinomial distribution:

generalises the binomial distribution to the case where there are K inde- pendent outcomes with probabilities θi, i = 1, . . . , K. The probability of getting ni occurrence of outcome i is given by P(n | θ) = n! Πini!ΠK

i=1θni i

where n = n1 + . . . + nK, θ = (θ1, . . . , θK).

Example: The outcome of rolling a die N times is described by a multi-

  • nomial. The probabilities of each of the 6 outcomes are θ1, . . . , θ6. For a

fair die, θ1 = . . . = θ6, and the probability of rolling it 12 times and getting each outcome twice is: 12! (2!)6 1 6 12 = 3.4 × 10−3

Note: The particular case n = 1 represents the categorical distribu-

  • tion. This is a generalisation of the Bernoulli distribution.

34.

slide-36
SLIDE 36

Poisson distribution (or, Poisson law of small numbers):

p(k) = λk k! · e−λ, with k ∈ N and parameter λ > 0. Mean = variance = λ. The probability mass function and the cumulative distribution function:

35.

slide-37
SLIDE 37

Exponential distribution (a.k.a. the negative exponential distri-

bution): p(x) = λe−λx for x ≥ 0 and parameter λ > 0. Mean = λ−1, variance = λ−2. The probability density function and the cumulative distribution function:

36.

slide-38
SLIDE 38

Gamma distribution:

xk−1 e−x/θ Γ(k)θk for x ≥ 0 and parameters k > 0 (shape) and θ > 0 (scale). Mean = kθ, variance = kθ2. The gamma function is a generalisation of the factorial function to real

  • values. For any positive real number x, Γ(x+1) = xΓ(x). (Thus, for integers

Γ(n) = (n − 1)!.) The probability density function and the cumulative distribution function:

37.

slide-39
SLIDE 39

Student’s t distribution:

p(x) = Γ( ν+1

2 )

√νπ Γ( ν

2)

  • 1 + x2

ν

− ν+1

2

for x ∈ R and thye parameter ν > 0 (the degree

  • f freedom).

Mean = 0 for ν > 1, otherwise undefined. Variance =

ν ν−2 for ν > 2, ∞ for 1 < ν ≤ 2, otherwise undefined.

The probability density function and the cumulative distribution function:

Note [from Wiki]: The t-distribution is symmetric and bell-shaped, like the normal distribution, but it has havier tails, meaning that it is more prone to producing values that fall far from its mean.

38.

slide-40
SLIDE 40

Dirichlet distribution:

D(θ | α) = 1 Z(α)ΠK

i=1θαi−1 i

δ(K

i=1 θi − 1)

where α = α1, . . . , αK with αi > 0 are the parameters, θi satisfy 0 ≤ θi ≤ 1 and sum to 1, this being indicated by the delta function term δ(

i θi − 1), and

the normalising factor can be expressed in terms of the gamma function: Z(α) =

  • ΠK

i=1θαi−1 i

δ(

i −1)dθ = ΠiΓ(αi)

Γ(

i αi)

Mean of θi: αi

  • j αj

. For K = 2, the Dirichlet distribution reduces to beta distribution, an dthe normalising constant is thebeta function.

39.

slide-41
SLIDE 41

Remark:

Concerning the multinomial and Dirichlet distributions:

The algebraic expression for the parameters θi is similar in the two distri- butions. However, the multinomial is a distribution over its exponents ni, whereas the Dirichlet is a distributionover the numbers θi that are exponentiated. The two distributions are said to be conjugate distributions and their close formal relationship leads to a harmonious interplay in many estimated problems. Similarly, the gamma distribution is the conjugate of the Poisson distribu- tion.

40.

slide-42
SLIDE 42

Addenda

  • B. Some Proofs

41.

slide-43
SLIDE 43

E[X + Y ] = E[X] + E[Y ]

where X and Y are random variables of the same type (i.e. either discrete or cont.)

The discrete case: E[X + Y ] =

  • ω∈Ω

(X(ω) + Y (ω)) · P(ω) =

  • ω

X(ω) · P(ω) +

  • ω

Y (ω) · P(ω) = E[X] + E[Y ] The continuous case: E[X + Y ] =

  • x
  • y

(x + y)pXY (x, y)dydx =

  • x
  • y

xpXY (x, y)dydx +

  • x
  • y

ypXY (x, y)dydx =

  • x

x

  • y

pXY (x, y)dydx +

  • y

y

  • x

pXY (x, y)dxdy =

  • x

xpX(x)dx +

  • y

ypY (y)dy = E[X] + E[Y ]

42.

slide-44
SLIDE 44

X and Y are independent ⇒ E[XY ] = E[X] · E[Y ],

X and Y being random variables of the same type (i.e. either discrete or continuous)

The discrete case:

E[XY ] =

  • x∈V al(X)
  • y∈V al(Y )

xyP(X = x, Y = y) =

  • x∈V al(X)
  • y∈V al(Y )

xyP(X = x) · P(Y = y) =

  • x∈V al(X)

 xP(X = x)

  • y∈V al(Y )

yP(Y = y)   =

  • x∈V al(X)

xP(X = x)E[Y ] = E[X] · E[Y ]

The continuous case:

E[XY ] =

  • x
  • y

xy p(X = x, Y = y)dydx =

  • x
  • y

xy p(X = x) · p(Y = y)dydx =

  • x

x p(X = x)

  • y

y p(Y = y)dy

  • dx =
  • x

x p(X = x)E[Y ]dx = E[Y ] ·

  • x

x p(X = x)dx = E[X] · E[Y ] 43.

slide-45
SLIDE 45

Binomial distribution: b(r; n, p)

def.

= Cr

n pr(1 − p)n−r Significance: b(r; n, p) is the number of heads in n independent flips of a coin having the head probability p. b(r; n, p) indeed represents a probability distribution:

  • b(r; n, p) = Cr

n pr(1 − p)n−r ≥ 0 for all p ∈ [0, 1], n ∈ N and r ∈ {0, 1, . . ., n},

  • n

r=0 b(r; n, p) = 1:

(1 − p)n + C1

np(1 − p)n−1 + · · · + Cn−1 n

pn−1(1 − p) + pn = [p + (1 − p)]n = 1

44.

slide-46
SLIDE 46

Binomial distribution: calculating the mean

E[b(r; n, p)]

def.

=

n

  • r=0

r · b(r; n, p) = = 1 · C1

np(1 − p)n−1 + 2 · C2 np2(1 − p)n−2 + · · · + (n − 1) · Cn−1 n

pn−1(1 − p) + n · pn = p

  • C1

n(1 − p)n−1 + 2 · C2 np(1 − p)n−2 + · · · + (n − 1) · Cn−1 n

pn−2(1 − p) + n · pn−1 = np

  • (1 − p)n−1 + C1

n−1p(1 − p)n−2 + · · · + Cn−2 n−1pn−2(1 − p) + Cn−1 n−1pn−1

= np[p + (1 − p)]n−1 = np

45.

slide-47
SLIDE 47

Binomial distribution: calculating the variance

following www.proofwiki.org/wiki/Variance of Binomial Distribution, which cites “Probability: An Introduction”, by Geoffrey Grimmett and Dominic Welsh, Oxford Science Publications, 1986

We will make use of the formula Var[X] = E[X2] − E2[X]. By denoting q = 1 − p, it follows: E[b2(r; n, p)]

def.

=

n

  • r=0

r2Cr

nprqn−r = n

  • r=0

r2n(n − 1) . . . (n − r + 1) r! =

n

  • r=1

rn(n − 1) . . .(n − r + 1) (r − 1)! prqn−r =

n

  • r=1

rn Cr−1

n−1 prqn−r

= np

n

  • r=1

r Cr−1

n−1 pr−1q(n−1)−(r−1)

46.

slide-48
SLIDE 48

Binomial distribution: calculating the variance (cont’d)

By denoting j = r − 1 and m = n − 1, we’ll get:

E[b2(r; n, p)] = np

m

  • j=0

(j + 1) Cj

m pjqm−j

= np  

m

  • j=0

j Cj

m pjqm−j + m

  • j=0

Cj

m pjqm−j

  = np  

m

  • j=0

j m · . . . · (m − j + 1) j! pjqm−j + (p + q

1

)m   = np  

m

  • j=1

m Cj−1

m−1 pjqm−j + 1

  = np  mp

m

  • j=1

Cj−1

m−1 pj−1q(m−1)−(j−1) + 1

  = np[(n − 1)p(p + q

1

)m−1 + 1] = np[(n − 1)p + 1] = n2p2 − np2 + np

Finally,

Var[X] = E[b2(r; n, p)] − (E[b(r; n, p)])2 = n2p2 − np2 + np − n2p2 = np(1 − p) 47.

slide-49
SLIDE 49

Binomial distribution: calculating the variance Another solution

  • se demonstreaz˘

a relativ u¸ sor c˘ a orice variabil˘ a aleatoare urmˆ and distribut ¸ia binomial˘ a b(r; n, p) poate fi v˘ azut˘ a ca o sum˘ a de n vari- abile independente care urmeaz˘ a distribut ¸ia Bernoulli de parametru p;a

  • ¸

stim (sau, se poate dovedi imediat) c˘ a variant ¸a distribut ¸iei Bernoulli de parametru p este p(1 − p);

  • t

¸inˆ and cont de proprietatea de liniaritate a variant ¸elor — Var[X1 + X2 . . . + Xn] = Var[X1] + Var[X2] . . . + Var[Xn], dac˘ a X1, X2, . . ., Xn sunt variabile independente —, rezult˘ a c˘ a Var[X] = np(1 − p).

aVezi www.proofwiki.org/wiki/Bernoulli Process as Binomial Distribution, care citeaz˘

a de asemenea ca surs˘ a “Proba- bility: An Introduction” de Geoffrey Grimmett ¸ si Dominic Welsh, Oxford Science Publications, 1986.

48.

slide-50
SLIDE 50

The Gaussian distribution: p(X = x) =

1 √ 2πσe

−(x − µ)2 2σ2

Calculating the mean

E[Nµ,σ(x)]

def.

= ∞

−∞

xp(x)dx = 1 √ 2πσ ∞

−∞

x · e

−(x − µ)2

2σ2 dx Using the variable transformation v = x − µ σ will imply x = σv + µ and dx = σdv, so: E[X] = 1 √ 2πσ ∞

−∞

(σv + µ)e

−v2

2 (σdv) = σ √ 2πσ   σ ∞

−∞

ve

−v2

2 dv + µ ∞

−∞

e

−v2

2 dv    = 1 √ 2π   −σ ∞

−∞

(−v)e

−v2

2 dv + µ ∞

−∞

e

−v2

2 dv    = 1 √ 2π         −σ e

−v2

2

−∞

  • =0

+µ ∞

−∞

e

−v2

2 dv         = µ √ 2π ∞

−∞

e

−v2

2 dv. The last integral is computed as shown on the next slide. 49.

slide-51
SLIDE 51

The Gaussian distribution: calculating the mean (Cont’d)

B @ Z ∞

v=−∞

e

−v2

2 dv 1 C A

2

= B @ Z ∞

x=−∞

e

−x2

2 dx 1 C A · B @ Z ∞

y=−∞

e

−y2

2 dy 1 C A = Z ∞

x=−∞

Z ∞

y=−∞

e

−x2 + y2

2 dydx = ZZ

R2 e −x2 + y2

2 dydx By switching from x, y to polar coordinates r, θ, it follows: B @ Z ∞

v=−∞

e

−v2

2 dv 1 C A

2

= Z ∞

r=0

Z 2π

θ=0

e

−r2

2 (rdrdθ) = Z ∞

r=0

re

−r2

2 „Z 2π

θ=0

dθ « dr = Z ∞

r=0

re

−r2

2 θ|2π

0 dr

= 2π Z ∞

r=0

re

−r2

2 dr = 2π (−e

−r2

2 ) ˛ ˛ ˛ ˛ ˛ ˛ ˛

= 2π(0 − (−1)) = 2π Note: x = r cos θ and y = r sin θ, with r ≥ 0 and r ∈ [0, 2π). Therefore, x2 + y2 = r2, and the Jacobian matrix is ∂(x, y) ∂(r, θ) = ˛ ˛ ˛ ˛ ˛ ˛ ˛ ˛ ∂x ∂r ∂x ∂θ ∂y ∂r ∂y ∂θ ˛ ˛ ˛ ˛ ˛ ˛ ˛ ˛ = ˛ ˛ ˛ ˛ ˛ cos θ −r sin θ sin θ r cos θ ˛ ˛ ˛ ˛ ˛ = r cos2 θ + r sin2 θ = r ≥ 0. So, dxdy = rdrdθ.

50.

slide-52
SLIDE 52

The Gaussian distribution: calculating the variance

We will make use of the formula Var[X] = E[X2] − E2[X]. E[X2] = ∞

−∞

x2p(x)dx = 1 √ 2πσ ∞

−∞

x2 · e

−(x − µ)2

2σ2 dx Again, using v = x − µ σ will imply x = σv + µ, and dx = σdv, therefore: E[X2] = 1 √ 2πσ ∞

−∞

(σv + µ)2 e

−v2

2 (σdv) = σ √ 2π ∞

−∞

(σ2v2 + 2σµv + µ2) e

−v2

2 dv = 1 √ 2π   σ2 ∞

−∞

v2 e

−v2

2 dv + 2σµ ∞

−∞

v e

−v2

2 dv + µ2 ∞

−∞

e

−v2

2 dv    Note that we have already computed ∞

−∞ ve −v2

2 dv = 0 and ∞

−∞ e −v2

2 dv = √ 2π. 51.

slide-53
SLIDE 53

The Gaussian distribution: calculating the variance (Cont’d)

Therefore, we only need to compute ∞

−∞

v2e

−v2

2 dv = ∞

−∞

(−v)   −ve

−v2

2    dv = ∞

−∞

(−v)   e

−v2

2   

dv = (−v) e

−v2

2

−∞

− ∞

−∞

(−1)e

−v2

2 dv = 0 + ∞

−∞

e

−v2

2 dv = √ 2π. So, E[X2] = 1 √ 2π

  • σ2√

2π + 2σµ · 0 + µ2√ 2π

  • = σ2 + µ2.

Finally, Var[X] = E[X2] − (E[X])2 = (σ2 + µ2) − µ2 = σ2. 52.

slide-54
SLIDE 54

The covariance matrix Σ corresponding to a vector X made

  • f n random variables is symmetric and positive semi-definite
  • a. Cov(X)i,j

def.

= Cov(Xi, Xj), for all i, j ∈ {1, . . . , n}, and Cov(Xi, Xj)

def.

= E[(X − E[Xi])(X − E[Xj])] = Cov(Xj, Xi), therefore Cov(X) is a symmetric matrix.

  • b. We will show that zT Σz ≥ 0 for any z ∈ Rn:

zT Σz =

n

  • i=1

n

  • j=1

(ziΣijzj) =

n

  • i=1

n

  • j=1

(zi Cov[Xi, Xj] zj) =

n

  • i=1

n

  • j=1

(zi E[(Xi − E[Xi])(Xj − E[Xj])] zj) =

n

  • i=1

n

  • j=1

(E[(Xi − E[Xi])(Xj − E[Xj])] zizj) = E  

n

  • i=1

n

  • j=1

(Xi − E[Xi])(Xj − E[Xj]) zizj   = E[((X − E[X])T · z)2] ≥ 0 53.

slide-55
SLIDE 55

If the covariance matrix of a multi-variate Gaussian distribution is diagonal, then the density of this is equal to the product of independent univariate Gaussian densities

Let’s consider X = [X1 . . . Xn]T , µ ∈ Rn and Σ ∈ Sn

+, where Sn + is the set of symmetric

positive definite matrices (which implies |Σ| = 0 and (x − µ)T Σ−1(x − µ) > 0, therefore −

1 2(x − µ)T Σ−1(x − µ) < 0).

The probability density function of a multi-variate Gaussian distribution of parameters µ and Σ is: p(x; µ, Σ) = 1 (2π)n/2|Σ|1/2 exp

  • −1

2(x − µ)T Σ−1(x − µ)

  • ,

Notation: X ∼ N(µ, Σ) We will make the proof for n = 2: x =

  • x1

x2

  • µ =
  • µ1

µ2

  • Σ =
  • σ2

1

σ2

2

  • 54.
slide-56
SLIDE 56

A property of ulti-variate Gaussians whose covariance matrices are diagonal (Cont’d)

p(x; µ, Σ) = 1 2π

  • σ2

1

σ2

2

  • 1

2 exp

  • −1

2 x1 − µ1 x2 − µ2 T σ2

1

σ2

2

−1 x1 − µ1 x2 − µ2

  • =

1 2π σ2

1σ2 2

exp   −1 2

  • x1 − µ1

x2 − µ2 T    1 σ2

1

1 σ2

2

  

  • x1 − µ1

x2 − µ2

  = 1 2π σ2

1σ2 2

exp    −1 2 x1 − µ1 x2 − µ2 T     1 σ2

1

(x1 − µ1) 1 σ2

2

(x2 − µ2)         = 1 2π σ2

1σ2 2

exp

  • − 1

2σ2

1

(x1 − µ1)2 − 1 2σ2

2

(x2 − µ2)2

  • =

p(x1; µ1, σ2

1) p(x2; µ2, σ2 2).

55.

slide-57
SLIDE 57

Derivation of entropy definition, starting from a set of desirable properties

CMU, 2005 fall, T. Mitchell, A. Moore, HW1, pr. 2.2

56.

slide-58
SLIDE 58

Remark:

The definition Hn(X) = − i pi log pi is not very intuitive.

Theorem:

If ψn(p1, . . . , pn) satisfies the following axioms

  • A1. Hn should be continuous in pi and symmetric in its arguments;
  • A2. if pi = 1/n then Hn should be a monotonically increasing function of n;

(If all events are equally likely, then having more events means being more uncertain.)

  • A3. if a choice among N events is broken down into successive choices,

then entropy should be the weighted sum of the entropy at each stage; then ψn(p1, . . ., pn) = −K

i pi log pi where K is a positive constant.

Note: We will restrict the proof to the p1, . . ., pn ∈ Q case.

57.

slide-59
SLIDE 59

Example for the axiom A3:

Encoding 1 (a,b,c) 1/2 1/3 1/6 b c a Encoding 2 (a,b,c) a 1/2 1/2 (b,c) b c 2/3 1/3

H „1 2, 1 3, 1 6 « = 1 2 log 2 + 1 3 log 3 + 1 6 log 6 = „1 2 + 1 6 « log 2 + „1 3 + 1 6 « log 3 = 2 3 + 1 2 log 3 H „1 2, 1 2 « + 1 2H „2 3, 1 3 « = 1 + 1 2 „2 3 log 3 2 + 1 3 log 3 « = 1 + 1 2 „ log 3 − 2 3 « = 2 3 + 1 2 log 3

The next 3 slides:

Case 1: pi = 1/n for i = 1, . . ., n; proof steps

58.

slide-60
SLIDE 60
  • a. A(n)

not.

= ψ(1/n, 1/n, . . ., 1/n) implies A(sm) = m A(s) for any s, m ∈ N∗. (1)

  • b. If s, m ∈ N⋆ (fixed), s = 1, and t, n ∈ N⋆ such that sm ≤ tn ≤ sm+1, then

| m

n − log t log s |≤ 1 n. (2)

  • c. For sm ≤ tn ≤ sm+1 as above, it follows (imediately)

ψsm 1 sm , . . . , 1 sm

  • ≤ ψtn

1 tn , . . . , 1 tn

  • ≤ ψsm+1
  • 1

sm+1 , . . . , 1 sm+1

  • i.e. A(sm) ≤ A(tn) ≤ A(sm+1)
  • c. Show that

| m

n − A(t) A(s) |≤ 1 n for s = 1. (3)

  • d. Combining (2) + (3) gives imediately

| A(t)

A(s) − log t log s |≤ 2 n pentru s = 1 (4)

  • d. Show that this inequation implies

A(t) = K log t with K > 0 (due to A2). (5) 59.

slide-61
SLIDE 61

Proof

a.

m

s 1

m

s 1

m

s 1

. . . . . . . . . . . . . . . . . . . . . . . .

1/s 1/s 1/s 1/s 1/s 1/s nivel 2 nivel 1 1/s 1/s 1/s nivel m de s ori

. . .

de s ori de s ori

Applying the axion A3 on the right encoding from above gives: A(sm) = A(s) + s · 1 sA(s) + s2 · 1 s2 A(s) + . . . + sm−1 · 1 sm−1 A(s) = A(s) + A(s) + A(s) + . . . + A(s)

  • m times

= mA(s) 60.

slide-62
SLIDE 62

Proof (cont’d)

b. sm ≤ tn ≤ sm+1 ⇒ m log s ≤ n log t ≤ (m + 1) log s ⇒ m n ≤ log t log s ≤ m n + 1 n ⇒ 0 ≤ log t log s − m n ≤ 1 n ⇒

  • log t

log s − m n

  • ≤ 1

n c. A(sm) ≤ A(tn) ≤ A(sm+1)

1

⇒ m A(s) ≤ n A(t) ≤ (m + 1) A(s)

s=1

⇒ m n ≤ A(t) A(s) ≤ m n + 1 n ⇒ 0 ≤ A(t) A(s) − m n ≤ 1 n ⇒

  • A(t)

A(s) − m n

  • ≤ 1

n

  • d. Consider again sm ≤ tn ≤ sm+1 with s, t fixed. If m → ∞ then n → ∞ and

from

  • A(t)

A(s) − log t log s

  • ≤ 1

n it follows that

  • A(t)

A(s) − log t log s

  • → 0.

Therefore

  • A(t)

A(s) − log t log s

  • = 0 and so A(t)

A(s) = log t log s. Finally, A(t) = A(s) log s log t = K log t, where K = A(s) log s > 0 (if s = 1).

61.

slide-63
SLIDE 63

Case 2: pi ∈ Q for i = 1, . . ., n

Let’s consider a set of N equiprobable random events, and P = (S1, S2, . . . , Sk) a partition of this set. Let’s denote pi =| Si | /N. A “natural” two-step ecoding (as shown in the nearby figure) leads to A(N) = ψk(p1, . . . , pk) +

i piA(| Si |),

based on the axiom A3. Finally, using the result A(t) = K log t, gives:

K log N = ψk(p1, . . . , pk) + K

i pi log | Si |

S i 1/|S |

i

1/|S |

i

1/|S |

i

|S |/N

2

|S |/N

i

S 1 S 2 S k |S |/N

k

|S |/N

1

level 1 level 2

. . . . . . . . . . . . . . . . . .

⇒ ψk(p1, . . . , pk) = K[ log N −

  • i

pi log | Si | ] = K[ log N

  • i

pi −

  • i

pi log | Si | ] = −K

  • i

pi log | Si | N = −K

  • i

pi log pi 62.

slide-64
SLIDE 64

Addenda

  • C. Some Examples

63.

slide-65
SLIDE 65

Exemplifying the computation of expected values for random variables and the [use of] sensitivity of a test in a real-world application CMU, 2009 fall, Geoff Gordon, HW1, pr. 2

64.

slide-66
SLIDE 66

There is a disease which affects 1 in 500 people. A 100.00 dollar blood test can help reveal whether a person has the disease. A positive outcome indicates that the person may have the disease. The test has perfect sensitivity (true positive rate), i.e., a person who has the disease tests positive 100% of the

  • time. However, the test has 99% specificity (true negative rate), i.e., a healthy

person tests positive 1% of the time.

  • a. A randomly selected individual is tested and the result is positive. What

is the probability of the individual having the disease?

  • b. There is a second more expensive test which costs 10, 000.00 dollars but is

exact with 100% sensitivity and specificity. If we require all people who test positive with the less expensive test to be tested with the more expensive test, what is the expected cost to check whether an individual has the disease?

  • c. A pharmaceutical company is attempting to decrease the cost of the second

(perfect) test. How much would it have to make the second test cost, so that the first test is no longer needed? That is, at what cost is it cheaper simply to use the perfect test alone, instead of screening with the cheaper test as described in part b? 65.

slide-67
SLIDE 67

Random variables: B:

  • 1/true

for persons having this disease 0/false

  • therwise;

T1: the result of the first test: + (in case of disease) or −; T2: the result of the second test: again + or −. Known facts:

P(B) = 1 500 P(T1 = + | B) = 1, P(T1 = + | ¯ B) = 1 100, P(T2 = + | B) = 1, P(T2 = + | ¯ B) = 0

a.

P(B | T1 = +) = P(T1 = + | B) · P(B) P(T1 = + | B) · P(B) + P(T1 = + | ¯ B) · P( ¯ B) = 1 · 1 500 1 · 1 500 + 1 100 · 499 500 = 100 599 ≈ 0.1669 66.

slide-68
SLIDE 68

b. C = c1 if the person is tested only with the first test c1 + c2 if the person is tested with both tests ⇒ P(C = c1) = P(T1 = −) ¸ si P(C = c1 + c2) = P(T1 = +) ⇒ E[C] = c1 · (1 − P(T1 = +)) + (c1 + c2) · P(T1 = +) = c1 − c1 · P(T1 = +) + c1 · P(T1 = +) + c2 · P(T1 = +) = c1 + c2 · P(T1 = +) = 100 + 10000 · 599 50000 = 219.8 ≈ 220$ P(T1 = +) = P(T1 = + | B) · P(B) + P(T1 = + | ¯ B) · P( ¯ B) = 1 · 1 500 + 1 100 · 499 500 = 599 50000 = 0.01198 ⇒ E[C] = c1 · (1 − P(T1 = +)) + (c1 + c2) · P(T1 = +) = c1 − c1 · P(T1 = +) + c1 · P(T1 = +) + c2 · P(T1 = +) = c1 + c2 · P(T1 = +) = 100 + 10000 · 599 50000 = 219.8 ≈ 220$ 67.

slide-69
SLIDE 69

c. cn the new price forr the second test (T ′

2)

cn ≤ E[C′] = c1 · P(C = c1) + (c1 + cn) · P(C = c1 + cn) = c1 + cn · P(T1 = +) = 100 + cn · 599 50000 cn = 100 + cn · 0.01198 ⇒ cn ≈ 101.2125.

68.

slide-70
SLIDE 70

Using the Central Limit Theorem (the i.i.d. version) to compute the real error of a classifier CMU, 2008 fall, Eric Xing, HW3, pr. 3.3

69.

slide-71
SLIDE 71

Chris recently adopts a new (binary) classifier to filter email spams. He wants to quantitively evaluate how good the classi- fier is. He has a small dataset of 100 emails on hand which, you can assume, are randomly drawn from all emails. He tests the classifier on the 100 emails and gets 83 classified correctly, so the error rate on the small dataset is 17%. However, the number on 100 samples could be either higher or lower than the real error rate just by chance. With a confidence level of 95%, what is likely to be the range of the real error rate? Please write down all important steps. (Hint: You need some approximation in this problem.)

70.

slide-72
SLIDE 72

Notations:

Let Xi, i = 1, . . . , n = 100 be defined as: Xi = 1 if the email i was incorrectly classified, and 0 otherwise; E[Xi]

not.

= µ

not.

= ereal ; Var(Xi)

not.

= σ2 esample

not.

= X1 + . . . + Xn n = 0.17 Zn = X1 + . . . + Xn − nµ √n σ (the standardized form of X1 + . . . + Xn)

Key insight:

Calculating the real error of the classifier (more exactly, a symmetric interval around the real error p

not.

= µ) with a “confidence” of 95% amounts to finding a > 0 sunch that P(|Zn| ≤ a) ≥ 0.95. 71.

slide-73
SLIDE 73

Calculus: | Zn |≤ a ⇔

  • X1 + . . . + Xn − nµ

√n σ

  • ≤ a ⇔
  • X1 + . . . + Xn − nµ

a √n ⇔

  • X1 + . . . + Xn − nµ

n

  • ≤ aσ

√n ⇔

  • X1 + . . . + Xn

n − µ

  • ≤ aσ

√n ⇔ |esample − ereal| ≤ aσ √n ⇔ |ereal − esample| ≤ aσ √n ⇔ − aσ √n ≤ ereal − esample ≤ aσ √n ⇔ esample − aσ √n ≤ ereal ≤ esample + aσ √n ⇔ ereal ∈

  • esample − aσ

√n , esample + aσ √n

  • 72.
slide-74
SLIDE 74

Important facts: The Central Limit Theorem: Zn → N(0; 1) Therefore, P(|Zn| ≤ a) ≈ P(|X| ≤ a) = Φ(a) − Φ(−a), where X ∼ N(0; 1) and Φ is the cumulative function distribution of N(0; 1). Calculus: Φ(−a) + Φ(a) = 1 ⇒ P(| Zn |≤ a) = Φ(a) − Φ(−a) = 2Φ(a) − 1 P(| Zn |≤ a)=0.95⇔2Φ(a)−1=0.95⇔Φ(a)=0.975 ⇔ a ∼ = 1.97 (see Φ table) Finally: σ2 not. = Varreal ≈ Varsample due to the above theorem, and Varsample = esample(1 − esample) because Xi are Bernoulli variables. ⇒ aσ √n = 1.97 ·

  • 0.17(1 − 0.17)

√ 100 ∼ = 0.07 |ereal − esample| ≤ 0.07 ⇔ |ereal − 0.17| ≤ 0.07 ⇔ −0.07 ≤ ereal − 0.17 ≤ 0.07 ⇔ ereal ∈ [0.10, 0.24]

73.