mle part 2
play

MLE part 2 18 / 70 Gaussian Mixture Model Suppose data is drawn - PowerPoint PPT Presentation

MLE part 2 18 / 70 Gaussian Mixture Model Suppose data is drawn from k Gaussians, meaning Y = j Discrete ( 1 , . . . , k ) , X = x | Y = j N ( j , j ) , and the parameters are = (( 1 , 1 , 1 ) , . . . , ( k ,


  1. Latent variable models. k p θ ( y i = j | x i ) and p θ ( y i = j | x i ) = p θ ( y i = j, x i ) � Since 1 = , then p θ ( x i ) j =1 n n n k � � � � L ( θ ) = ln p θ ( x i ) = 1 · ln p θ ( x i ) = p θ ( y i = j | x i ) ln p θ ( x i ) i =1 i =1 i =1 j =1 n k p θ ( y i = j | x i ) ln p θ ( x i , y i = j ) � � = p θ ( y i = j | x i ) . i =1 j =1 Therefore: define augmented likelihood n k R ij ln p θ ( x i , y i = j ) � � L ( θ ; R ) := ; R ij i =1 j =1 note that R ij := p θ ( y i = j | x i ) implies L ( θ ; R ) = L ( θ ) . 29 / 70

  2. E-M method for latent variable models Define augmented likelihood n k R ij ln p θ ( x i , y i = j ) � � L ( θ ; R ) := , R ij i =1 j =1 with responsibility matrix R ∈ R n,k := { R ∈ [0 , 1] n × k : R 1 k = 1 n } . Alternate two steps: ◮ E-step: set ( R t ) ij := p θ t − 1 ( y i = j | x i ) . ◮ M-step: set θ t = arg max θ ∈ Θ L ( θ ; R t ) . 30 / 70

  3. E-M method for latent variable models Define augmented likelihood n k R ij ln p θ ( x i , y i = j ) � � L ( θ ; R ) := , R ij i =1 j =1 with responsibility matrix R ∈ R n,k := { R ∈ [0 , 1] n × k : R 1 k = 1 n } . Alternate two steps: ◮ E-step: set ( R t ) ij := p θ t − 1 ( y i = j | x i ) . ◮ M-step: set θ t = arg max θ ∈ Θ L ( θ ; R t ) . Soon: we’ll see this gives nondecreasing likelihood! 30 / 70

  4. E-M for Gaussian mixtures Initialization: a standard choice is π j = 1 / k , Σ j = I , and ( µ j ) k j =1 given by k -means. ◮ E-step: Set R ij = p θ ( y i = j | x i ) , meaning π j p µ j , Σ j ( x i ) R ij = p θ ( y i = j | x i ) = p θ ( y i = j, x i ) = . � k p θ ( x i ) l =1 π l p µ l , Σ l ( x i ) ◮ M-step: solve arg max θ ∈ Θ L ( θ ; R ) , meaning � n � n i =1 R ij i =1 R ij π j := = , � n � k n l =1 R il i =1 � n � n i =1 R ij x i i =1 R ij x i µ j := = , � n nπ j i =1 R ij � n i =1 R ij ( x i − µ j )( x i − µ j ) T Σ j := . nπ j (These are as before.) 31 / 70

  5. Demo: spherical clusters 6 4 2 0 2 2 0 2 4 6 8 10 (Initialized with k -means, thus not so dramatic.) 32 / 70

  6. Demo: spherical clusters 6 4 2 0 2 2 0 2 4 6 8 10 (Initialized with k -means, thus not so dramatic.) 32 / 70

  7. Demo: spherical clusters 6 4 2 0 2 2 0 2 4 6 8 10 (Initialized with k -means, thus not so dramatic.) 32 / 70

  8. Demo: spherical clusters 6 4 2 0 2 2 0 2 4 6 8 10 (Initialized with k -means, thus not so dramatic.) 32 / 70

  9. Demo: spherical clusters 6 4 2 0 2 2 0 2 4 6 8 10 (Initialized with k -means, thus not so dramatic.) 32 / 70

  10. Demo: spherical clusters 6 4 2 0 2 2 0 2 4 6 8 10 (Initialized with k -means, thus not so dramatic.) 32 / 70

  11. Demo: spherical clusters 6 4 2 0 2 2 0 2 4 6 8 10 (Initialized with k -means, thus not so dramatic.) 32 / 70

  12. Demo: spherical clusters 6 4 2 0 2 2 0 2 4 6 8 10 (Initialized with k -means, thus not so dramatic.) 32 / 70

  13. Demo: spherical clusters 6 4 2 0 2 2 0 2 4 6 8 10 (Initialized with k -means, thus not so dramatic.) 32 / 70

  14. Demo: spherical clusters 6 4 2 0 2 2 0 2 4 6 8 10 (Initialized with k -means, thus not so dramatic.) 32 / 70

  15. Demo: elliptical clusters E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70

  16. Demo: elliptical clusters E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70

  17. Demo: elliptical clusters E. . . M. . . E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70

  18. Demo: elliptical clusters E. . . M. . . E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70

  19. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70

  20. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70

  21. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70

  22. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70

  23. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70

  24. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70

  25. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70

  26. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70

  27. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70

  28. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70

  29. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70

  30. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70

  31. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70

  32. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70

  33. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70

  34. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70

  35. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70

  36. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70

  37. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70

  38. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70

  39. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70

  40. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70

  41. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70

  42. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70

  43. Theorem. Suppose ( R 0 , θ 0 ) ∈ R n,k × Θ arbitrary, thereafter ( R t , θ t ) given by E-M: ( R t ) ij := p θ t − 1 ( y = j | x i ) . and θ t := arg max L ( θ ; R t ) θ ∈ Θ Then L ( θ t ; R t ) ≤ R ∈R n × k L ( θ t ; R ) = L ( θ t ; R t +1 ) = L ( θ t ) max ≤ L ( θ t +1 ; R t +1 ) . In particular, L ( θ t ) ≤ L ( θ t +1 ) . 34 / 70

  44. Theorem. Suppose ( R 0 , θ 0 ) ∈ R n,k × Θ arbitrary, thereafter ( R t , θ t ) given by E-M: ( R t ) ij := p θ t − 1 ( y = j | x i ) . and θ t := arg max L ( θ ; R t ) θ ∈ Θ Then L ( θ t ; R t ) ≤ R ∈R n × k L ( θ t ; R ) = L ( θ t ; R t +1 ) = L ( θ t ) max ≤ L ( θ t +1 ; R t +1 ) . In particular, L ( θ t ) ≤ L ( θ t +1 ) . Remarks. ◮ We proved a similar guarantee for k -means, which is also an alternating minimization scheme. ◮ Similarly, MLE for Gaussian mixtures is NP-hard; it is also known to need exponentially many samples in k to information-theoretically recover the parameters. 34 / 70

  45. Proof. We’ve already shown: ◮ L ( θ t ; R t +1 ) = L ( θ t ) ; ◮ L ( θ t ; R t +1 ) = max θ ∈ Θ L ( θ ; R t +1 ) ≤ L ( θ t +1 ; R t +1 ) by definition of θ t +1 . We still need to show: L ( θ t ; R t +1 ) = max R ∈R n,k L ( θ t +1 ; R ) . We’ll give two proofs. 35 / 70

  46. Proof. We’ve already shown: ◮ L ( θ t ; R t +1 ) = L ( θ t ) ; ◮ L ( θ t ; R t +1 ) = max θ ∈ Θ L ( θ ; R t +1 ) ≤ L ( θ t +1 ; R t +1 ) by definition of θ t +1 . We still need to show: L ( θ t ; R t +1 ) = max R ∈R n,k L ( θ t +1 ; R ) . We’ll give two proofs. By concavity of ln (“Jensen’s inequality” in convexity lectures), for any R ∈ R n,k , n k R ij ln p θ t ( x i , y i = j ) � � L ( θ t ; R ) = R ij i =1 j =1   n k p θ t ( x i , y i = j ) � � ≤ ln R ij   R ij i =1 j =1 n � = ln p θ t ( x i ) = L ( θ t ) = L ( θ t ; R t +1 ) . i =1 Since R was arbitrary, max R ∈R L ( θ t ; R ) = L ( θ t ; R t +1 ) . 35 / 70

  47. Proof (continued). Here’s a second proof of that missing fact. To evaluate arg max R ∈R n,k L ( θ ; R ) , consider Lagrangian   n k k � k � � � � �  . R ij ln p θ ( x i , y = j ) − R ij ln R ij + λ i R ij − 1  i =1 j =1 j =1 j =1 Fixing i and taking the gradient with respect to R ij for any j , 0 = ln p θ ( x i , y i = j ) − ln R ij − 1 + λ i , giving R ij = p θ ( x i , y = j ) exp( λ i − 1) . Since moreover � � R ij = exp( λ i − 1) p θ ( x i , y = j ) = exp( λ i − 1) p θ ( x i ) , 1 = j j it follows that exp( λ i − 1) = 1 / p θ ( x i ) , and the optimal R satisfies R ij = p θ ( x i ,y = j ) / p θ ( x i ) = p θ ( y = j | x i ) . � 36 / 70

  48. Related issues. 37 / 70

  49. Parameter constraints. E-M for GMMs still works if we freeze or constrain some parameters. 38 / 70

  50. Parameter constraints. E-M for GMMs still works if we freeze or constrain some parameters. Examples: ◮ No weights: initialize π = ( 1 / k , . . . , 1 / k ) and never update it. ◮ Diagonal covariance matrices: update everything as before, except Σ j := diag(( σ j ) 2 1 , . . . , ( σ j ) 2 d ) where � n i =1 R ij ( x i − µ j ) 2 ( σ j ) 2 l l := ; nπ j that is: we use coordinate-wise sample variances weighted by R . Why is this a good idea? 38 / 70

  51. Parameter constraints. E-M for GMMs still works if we freeze or constrain some parameters. Examples: ◮ No weights: initialize π = ( 1 / k , . . . , 1 / k ) and never update it. ◮ Diagonal covariance matrices: update everything as before, except Σ j := diag(( σ j ) 2 1 , . . . , ( σ j ) 2 d ) where � n i =1 R ij ( x i − µ j ) 2 ( σ j ) 2 l l := ; nπ j that is: we use coordinate-wise sample variances weighted by R . Why is this a good idea? Computation (of inverse), sample complexity, . . . 38 / 70

  52. Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70

  53. Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70

  54. Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70

  55. Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70

  56. Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70

  57. Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70

  58. Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70

  59. Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70

  60. Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70

  61. Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70

  62. Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70

  63. Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70

  64. Gaussian Mixture Model with diagonal covariances. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 39 / 70

  65. Gaussian Mixture Model with diagonal covariances. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 39 / 70

  66. Gaussian Mixture Model with diagonal covariances. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 39 / 70

  67. Gaussian Mixture Model with diagonal covariances. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 39 / 70

  68. Gaussian Mixture Model with diagonal covariances. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 39 / 70

  69. Gaussian Mixture Model with diagonal covariances. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 39 / 70

  70. Gaussian Mixture Model with diagonal covariances. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 39 / 70

  71. Gaussian Mixture Model with diagonal covariances. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 39 / 70

  72. Gaussian Mixture Model with diagonal covariances. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 39 / 70

  73. Gaussian Mixture Model with diagonal covariances. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 39 / 70

  74. Gaussian Mixture Model with diagonal covariances. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 39 / 70

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend