 
              Machine Learning for Signal Processing Expectation Maximization Mixture Models Bhiksha Raj 27 Oct 2016 11755/18797 1
Learning Distributions for Data • Problem: Given a collection of examples from some data, estimate its distribution • Solution: Assign a model to the distribution – Learn parameters of model from data • Models can be arbitrarily complex – Mixture densities, Hierarchical models. 11755/18797 2
A Thought Experiment 6 3 1 5 4 1 2 4 … • A person shoots a loaded dice repeatedly • You observe the series of outcomes • You can form a good idea of how the dice is loaded – Figure out what the probabilities of the various numbers are for dice • P(number) = count(number)/count(rolls) • This is a maximum likelihood estimate – Estimate that makes the observed sequence of numbers most probable 11755/18797 3
The Multinomial Distribution • A probability distribution over a discrete collection of items is a Multinomial  ( : belongs to a discrete set ) ( ) P X X P X • E.g. the roll of dice – X : X in (1,2,3,4,5,6) • Or the toss of a coin – X : X in (head, tails) 11755/18797 4
Maximum Likelihood Estimation n 2 n 4 n 1 n 5 n 6 n 3 p 6 p 3 p 4 p 1 p 2 p 2 p 4 p 5 p 1 p 5 p 6 p 3 • Basic principle: Assign a form to the distribution – E.g. a multinomial – Or a Gaussian • Find the distribution that best fits the histogram of the data 11755/18797 5
Defining “Best Fit” • The data are generated by draws from the distribution – I.e. the generating process draws from the distribution • Assumption: The world is a boring place – The data you have observed are very typical of the process • Consequent assumption: The distribution has a high probability of generating the observed data – Not necessarily true • Select the distribution that has the highest probability of generating the data – Should assign lower probability to less frequent observations and vice versa 11755/18797 6
Maximum Likelihood Estimation: Multinomial • Probability of generating (n 1 , n 2 , n 3 , n 4 , n 5 , n 6 )   n ( , , , , , ) P n n n n n n Const p i 1 2 3 4 5 6 i i • Find p 1 ,p 2 ,p 3 ,p 4 ,p 5 ,p 6 so that the above is maximized • Alternately maximize        log ( , , , , , ) log( ) log P n n n n n n Const n p 1 2 3 4 5 6 i i i – Log() is a monotonic function – argmax x f(x) = argmax x log(f(x)) • Solving for the probabilities gives us EVENTUALLY n – Requires constrained optimization to  i p ITS JUST  i ensure probabilities sum to 1 n COUNTING! j j 11755/18797 7
Segue: Gaussians   1  X  m Q    m Q  m 1 T ( ) ( ; , ) exp 0 . 5 ( ) ( ) P X N X X  Q d ( 2 ) | | • Parameters of a Gaussian: – Mean m , Covariance Q 11755/18797 8
Maximum Likelihood: Gaussian  Given a collection of observations ( X 1 , X 2 ,…), estimate mean m and covariance Q   1      m Q  m 1 T ( , ,...) exp 0 . 5 ( ) ( ) P X X X X 1 2 i i  Q d ( 2 ) | | i           Q   m Q  m 1 T log ( , ,...) 0 . 5 log | | ( ) ( ) P X X C X X 1 2 i i i • Maximizing w.r.t m and Q gives us ITS STILL 1 1      JUST m  Q   m  m T X X X i i i COUNTING! N N i i 11755/18797 9
Laplacian  m   1 | | x  m     ( ) ( ; , ) exp P x L x b   2 b b • Parameters: Median m , scale b ( b > 0) – m is also the mean, but is better viewed as the median 11755/18797 10
Maximum Likelihood: Laplacian  Given a collection of observations ( x 1 , x 2 ,…), estimate mean m and scale b  m | |   x     i log ( , ,...) log( ) P x x C N b 1 2 b i • Maximizing w.r.t m and b gives us Still just counting 1  m    m ({ }) | | median x b x i i N i 11755/18797 11
Dirichlet (from wikipedia) log of the density as we change α from α=(0.3, 0.3, 0.3) to (2.0, 2.0, 2.0), keeping all the individual αi's equal to each other.  K =3. Clockwise from top left:  a ( ) α =(6, 2, 2), (3, 7, 5), (6, 2, 6), (2, 3, 4) i  a   a  1 ( ) ( ; ) i P X D X x i   i    a  i • Parameters are a s i   i – Determine mode and curvature • Defined only of probability vectors – X = [x 1 x 2 .. x K ] , S i x i = 1, x i >= 0 for all i 11755/18797 12
Maximum Likelihood: Dirichlet  Given a collection of observations ( X 1 , X 2 ,…), estimate a                 a    a   a   log ( , ,...) ( 1 ) log( ) log log P X X X N N   1 2 , i j i i i     j i i i • No closed form solution for a s. – Needs gradient ascent • Several distributions have this property: the ML estimate of their parameters have no closed form solution 11755/18797 13
Continuing the Thought Experiment 6 3 1 5 4 1 2 4 … 4 4 1 6 3 2 1 2 … • Two persons shoot loaded dice repeatedly – The dice are differently loaded for the two of them • We observe the series of outcomes for both persons • How to determine the probability distributions of the two dice? 11755/18797 14
Estimating Probabilities 6 4 5 1 2 3 4 5 2 2 1 4 3 4 6 2 1 6 … • Observation: The sequence of numbers from the two dice – As indicated by the colors, we know who rolled what number 11755/18797 15
Estimating Probabilities 6 4 5 1 2 3 4 5 2 2 1 4 3 4 6 2 1 6 … • Observation: The sequence of numbers from the two dice – As indicated by the colors, we know who rolled what number 4 1 3 5 2 4 4 2 6.. 6 5 2 4 2 1 3 6 1.. • Segregation: Separate the blue Collection of “blue” Collection of “red” observations from the red numbers numbers 11755/18797 16
Estimating Probabilities • Observation: The sequence of 6 4 5 1 2 3 4 5 2 2 1 4 3 4 6 2 1 6 … numbers from the two dice – As indicated by the colors, we know who rolled what number • Segregation: Separate the blue 4 1 3 5 2 4 4 2 6.. 6 5 2 4 2 1 3 6 1.. observations from the red • From each set compute probabilities for each of the 6 0.3 0.3 0.25 0.25 possible outcomes 0.2 0.2 0.15 0.15 0.1 0.1 no. of times number was rolled 0.05 0.05  ( ) 0 0 P number 1 2 3 4 5 6 1 2 3 4 5 6 total number of observed rolls 11755/18797 17
A Thought Experiment 6 4 1 5 3 2 2 2 … 6 3 1 5 4 1 2 4 … 4 4 1 6 3 2 1 2 … • Now imagine that you cannot observe the dice yourself • Instead there is a “ caller ” who randomly calls out the outcomes – 40% of the time he calls out the number from the left shooter, and 60% of the time, the one from the right (and you know this) • At any time, you do not know which of the two he is calling out • How do you determine the probability distributions for the two dice? 18 11755/18797
A Thought Experiment 6 4 1 5 3 2 2 2 … 6 3 1 5 4 1 2 4 … 4 4 1 6 3 2 1 2 … • How do you now determine the probability distributions for the two sets of dice … • .. If you do not even know what fraction of time the blue numbers are called, and what fraction are red? 11755/18797 19
A Mixture Multinomial • The caller will call out a number X in any given callout IF – He selects “RED”, and the Red die rolls the number X – OR – He selects “BLUE” and the Blue die rolls the number X • P(X) = P(Red)P(X|Red) + P(Blue)P(X|Blue) – E.g. P(6) = P(Red)P(6|Red) + P(Blue)P(6|Blue) • A distribution that combines (or mixes ) multiple multinomials is a mixture multinomial   ( ) ( ) ( | ) P X P Z P X Z Z Mixture weights Component multinomials 11755/18797 20
Mixture Distributions Mixture Gaussian    m Q  ( ) ( ) ( ; , ) P X P Z N X ( ) ( ) ( | ) P X P Z P X Z z z Z Z Mixture weights Component distributions Mixture of Gaussians and Laplacians     m Q  m ( ) ( ) ( ; , ) ( ) ( ; , ) P X P Z N X P Z L X b , z z i z z i Z Z i • Mixture distributions mix several component distributions – Component distributions may be of varied type • Mixing weights must sum to 1.0 • Component distributions integrate to 1.0 • Mixture distribution integrates to 1.0 11755/18797 21
Maximum Likelihood Estimation   ( ) ( ) ( | ) P X P Z P X Z • For our problem: Z – Z = color of dice n   X        n ( , , , , , ) ( ) ( ) ( | ) P n n n n n n Const P X Const P Z P X Z X 1 2 3 4 5 6   X X Z • Maximum likelihood solution: Maximize         log( ( , , , , , )) log( ) log ( ) ( | ) P n n n n n n Const n P Z P X Z 1 2 3 4 5 6 X   X Z • No closed form solution (summation inside log)! – In general ML estimates for mixtures do not have a closed form – USE EM! 11755/18797 22
Recommend
More recommend