Probability Distributions Sargur N. Srihari 1 Srihari Machine - - PowerPoint PPT Presentation

probability distributions
SMART_READER_LITE
LIVE PREVIEW

Probability Distributions Sargur N. Srihari 1 Srihari Machine - - PowerPoint PPT Presentation

Srihari Machine Learning Probability Distributions Sargur N. Srihari 1 Srihari Machine Learning Distributions: Landscape Discrete- Binary Bernoulli Binomial Beta Discrete- Multivalued Multinomial Dirichlet Continuous Gaussian


slide-1
SLIDE 1

Machine Learning Srihari

1

Probability Distributions

Sargur N. Srihari

slide-2
SLIDE 2

Machine Learning Srihari

2

Distributions: Landscape

Discrete- Binary Discrete- Multivalued Continuous Bernoulli Multinomial Gaussian Angular Von Mises Binomial Beta Dirichlet Gamma Wishart Student’s-t Exponential Uniform

slide-3
SLIDE 3

Machine Learning Srihari

3

Distributions: Relationships

Discrete- Binary Discrete- Multi-valued Continuous Bernoulli

Single binary variable

Multinomial

One of K values = K-dimensional binary vector

Gaussian Angular Von Mises Binomial

N samples of Bernoulli

Beta

Continuous variable between {0,1]

Dirichlet

K random variables between [0.1]

Gamma

ConjugatePrior of univariate Gaussian precision

Wishart

Conjugate Prior of multivariate Gaussian precision matrix

Student’s-t

Generalization of Gaussian robust to Outliers Infinite mixture of Gaussians

Exponential

Special case of Gamma

Uniform

N=1

Conjugate Prior Conjugate Prior

Large N K=2

Gaussian-Gamma

Conjugate prior of univariate Gaussian Unknown mean and precision

Gaussian-Wishart

Conjugate prior of multi-variate Gaussian Unknown mean and precision matrix

slide-4
SLIDE 4

Machine Learning Srihari

4

Binary Variables

Bernoulli, Binomial and Beta

slide-5
SLIDE 5

Machine Learning Srihari

5

Bernoulli Distribution

  • Expresses distribution of Single binary-valued random variable x ε {0,1}
  • Probability of x=1 is denoted by parameter µ, i.e.,

p(x=1|µ)=µ

  • Therefore

p(x=0|µ)=1-µ

  • Probability distribution has the form Bern(x|µ)=µ x (1-µ) 1-x
  • Mean is shown to be E[x]=µ
  • Variance is Var[x]=µ(1-µ)
  • Likelihood of n observations independently drawn from p(x|µ) is
  • Log-likelihood is
  • Maximum likelihood estimator

  • btained by setting derivative of ln p(D|µ) wrt µ equal to zero is
  • If no of observations of x=1 is m then µML=m/N

Jacob Bernoulli 1654-1705

slide-6
SLIDE 6

Machine Learning Srihari

6

Binomial Distribution

  • Related to Bernoulli distribution
  • Expresses Distribution of m

– No of observations for which x=1

  • It is proportional to Bern(x|µ)
  • Add up all ways of obtaining heads
  • Mean and Variance are

Histogram of Binomial for N=10 and µ=0.25

slide-7
SLIDE 7

Machine Learning Srihari

7

Beta Distribution

  • Beta distribution
  • Where the Gamma

function is defined as

  • a and b are

hyperparameters that control distribution of parameter µ

  • Mean and Variance

a=0.1, b=0.1 a=1, b=1 a=2, b=3 a=8, b=4 Beta distribution as function of µ For values of hyperparameters a and b

slide-8
SLIDE 8

Machine Learning Srihari

8

Bayesian Inference with Beta

  • MLE of µ in Bernoulli is fraction of observations with x=1

– Severely over-fitted for small data sets

  • Likelihood function takes products of factors of the form

µx(1-µ)(1-x)

  • If prior distribution of µ is chosen to be proportional to

powers of µ and 1-µ, posterior will have same functional form as the prior

– Called conjugacy

  • Beta has form suitable for a prior distribution of p(µ)
slide-9
SLIDE 9

Machine Learning Srihari

9

Bayesian Inference with Beta

  • Posterior obtained by multiplying beta

prior with binomial likelihood yields

– where l=N-m, which is no of tails – m is no of heads

  • It is another beta distribution

– Effectively increase value of a by m and b by l – As number of observations increases distribution becomes more peaked

a=2, b=2 N=m=1, with x=1 a=3, b=2 Illustration of

  • ne step in process

µ1(1-µ)0

slide-10
SLIDE 10

Machine Learning Srihari

10

Predicting next trial outcome

  • Need predictive distribution of x given observed D

– From sum and products rule

  • Expected value of the posterior distribution can be

shown to be

– Which is fraction of observations (both fictitious and real) that correspond to x=1

  • Maximum likelihood and Bayesian results agree in

the limit of infinite observations

– On average uncertainty (variance) decreases with

  • bserved data

p(x =1| D) = p(x =1,µ | D)dµ

1

= p(x =1|µ)p(µ | D)dµ

1

= = µp(µ | D)dµ

1

= E[µ | D]

slide-11
SLIDE 11

Machine Learning Srihari

11

Summary

  • Single Binary variable distribution is

represented by Bernoulli

  • Binomial is related to Bernoulli

– Expresses distribution of number of

  • ccurrences of either 1 or 0 in N trials
  • Beta distribution is a conjugate prior for

Bernoulli

– Both have the same functional form

slide-12
SLIDE 12

Machine Learning Srihari

12

Multinomial Variables

Generalized Bernoulli and Dirichlet

slide-13
SLIDE 13

Machine Learning Srihari

13

Generalization of Bernoulli

  • Discrete variable that takes one of K

values (instead of 2)

  • Represent as 1 of K scheme

– Represent x as a K-dimensional vector – If x=3 then we represent it as x=(0,0,1,0,0,0)T – Such vectors satisfy

  • If probability of xk=1 is denoted µk then

distribution of x is given by

Generalized Bernoulli

slide-14
SLIDE 14

Machine Learning Srihari

14

Likelihood Function

  • Given a set of D of N independent
  • bservations x1,..xN
  • The likelihood function has the form
  • Where mk=Σn xnk is the number of
  • bservations of xk=1
  • The maximum likelihood solution (obtained

by log-likelihood and derivative wrt zero) is

which is fraction of N observations for which xk=1

slide-15
SLIDE 15

Machine Learning Srihari

15

Generalized Binomial Distribution

  • Multinomial distribution
  • Where the normalization coefficient is the

no of ways of partitioning N objects into K groups of size

  • Given by
slide-16
SLIDE 16

Machine Learning Srihari

16

Dirichlet Distribution

  • Family of prior distributions for

parameters µk of multinomial distribution

  • By inspection of multinomial, form of

conjugate prior is

  • Normalized form of Dirichlet distribution

Lejeune Dirichlet 1805-1859

slide-17
SLIDE 17

Machine Learning Srihari

17

Dirichlet over 3 variables

  • Due to summation

constraint

– Distribution over space of {µk} is confined to the simplex of dimensionality K-1 – For K=3

αk=0.1 αk=1 αk=10 Plots of Dirichlet distribution over the simplex for various settings of parameters αk

slide-18
SLIDE 18

Machine Learning Srihari

18

Dirichlet Posterior Distribution

  • Multiplying prior by likelihood
  • Which has the form of the Dirichlet distribution
slide-19
SLIDE 19

Machine Learning Srihari

19

Summary

  • Multinomial is a generalization of Bernoulli

– Variable takes on one of K values instead of 2

  • Conjugate prior of Multinomial is Dirichlet

distribution