final exam review
play

Final exam review CS 446 Selected lecture slides 1 / 61 - PowerPoint PPT Presentation

Final exam review CS 446 Selected lecture slides 1 / 61 Hoeffdings inequality Theorem (Hoeffdings inequality). Given IID Z i [ a, b ] , 2 n 2 1 exp Pr Z i E Z 1 . ( b a ) 2 n


  1. Rademacher complexity examples Definition. Given examples ( x 1 , . . . , x n ) and functions F , � n 1 Rad ( F ) = E ε max ǫ i f ( x i ) , n f ∈F i =1 where ( ǫ 1 , . . . , ǫ n ) are IID Rademacher rv ( Pr[ ǫ i = 1] = Pr[ ǫ i = − 1] = 1 2 ). Examples. T w : � w � ≤ W } ) ≤ RW ◮ If � x � ≤ R , then Rad ( { x �→ x √ n . � For SVM, we can set W = 2 / λ . � ◮ For deep networks, we have Rad ( F ) ≤ Lipschitz · Junk /n ; still very loose. 7 / 61

  2. Un supervised learning Now we only receive ( x i ) n i =1 , and the goal is. . . ? 8 / 61

  3. Un supervised learning Now we only receive ( x i ) n i =1 , and the goal is. . . ? ◮ Encoding data in some compact representation (and decoding this). ◮ Data analysis; recovering “hidden structure” in data (e.g., recovering cliques or clusters). ◮ Features for supervised learning. ◮ . . . ? 8 / 61

  4. Un supervised learning Now we only receive ( x i ) n i =1 , and the goal is. . . ? ◮ Encoding data in some compact representation (and decoding this). ◮ Data analysis; recovering “hidden structure” in data (e.g., recovering cliques or clusters). ◮ Features for supervised learning. ◮ . . . ? The task is less clear-cut. In 2019 we still have people trying to formalize it! 8 / 61

  5. SVD reminder 9 / 61

  6. SVD reminder T u = s v . 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M 9 / 61

  7. SVD reminder T u = s v . 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . 9 / 61

  8. SVD reminder T u = s v . 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . T . 3. Full factorization SVD: M = USV 9 / 61

  9. SVD reminder T u = s v . 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . T . 3. Full factorization SVD: M = USV 4. “Operational” view of SVD: for M ∈ R n × d ,   s 1 0     ⊤   ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ...    0    ·   · · · · · · · · · · · · · . u 1 u r u r +1 u n v 1 v r v r +1 v d   0 s r   ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ 0 0 9 / 61

  10. SVD reminder T u = s v . 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . T . 3. Full factorization SVD: M = USV 4. “Operational” view of SVD: for M ∈ R n × d ,   s 1 0     ⊤   ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ...    0    ·   · · · · · · · · · · · · · . u 1 u r u r +1 u n v 1 v r v r +1 v d   0 s r   ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ 0 0 First part of U , V span the col / row space (respectively), second part the left / right nullspaces (respectively). 9 / 61

  11. SVD reminder T u = s v . 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . T . 3. Full factorization SVD: M = USV 4. “Operational” view of SVD: for M ∈ R n × d ,   s 1 0     ⊤   ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ...    0    ·   · · · · · · · · · · · · · . u 1 u r u r +1 u n v 1 v r v r +1 v d   0 s r   ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ 0 0 First part of U , V span the col / row space (respectively), second part the left / right nullspaces (respectively). New: let ( U k , S k , V k ) denote the truncated SVD with U k ∈ R d × k (first k columns of U ), similarly for the others. 9 / 61

  12. PCA properties T and integer k ≤ r be Theorem. Let X ∈ R n × d with SVD X = USV given. � T � � 2 � X − XED � 2 � X − XDD min F = min F D ∈ R k × d D ∈ R d × k E ∈ R d × k D T D = I r � � � � 2 � X − XV k V s 2 = T F = i . k i = k +1 Additionally, � T � � 2 � X − XDD F = � X � 2 � XD � 2 min F − max F D ∈ R d × k D ∈ R d × k D T D = I D T D = I k � = � X � 2 F −� XV k � 2 F = � X � 2 s 2 F − i . i =1 10 / 61

  13. PCA properties T and integer k ≤ r be Theorem. Let X ∈ R n × d with SVD X = USV given. � T � � 2 � X − XED � 2 � X − XDD min F = min F D ∈ R k × d D ∈ R d × k E ∈ R d × k D T D = I r � � � � 2 � X − XV k V s 2 = T F = i . k i = k +1 Additionally, � T � � 2 � X − XDD F = � X � 2 � XD � 2 min F − max F D ∈ R d × k D ∈ R d × k D T D = I D T D = I k � = � X � 2 F −� XV k � 2 F = � X � 2 s 2 F − i . i =1 Remark 1. SVD is unique, but � r i =1 s 2 i unique. 10 / 61

  14. PCA properties T and integer k ≤ r be Theorem. Let X ∈ R n × d with SVD X = USV given. � T � � 2 � X − XED � 2 � X − XDD min F = min F D ∈ R k × d D ∈ R d × k E ∈ R d × k D T D = I r � � � � 2 � X − XV k V s 2 = T F = i . k i = k +1 Additionally, � T � � 2 � X − XDD F = � X � 2 � XD � 2 min F − max F D ∈ R d × k D ∈ R d × k D T D = I D T D = I k � = � X � 2 F −� XV k � 2 F = � X � 2 s 2 F − i . i =1 Remark 1. SVD is unique, but � r i =1 s 2 i unique. Remark 2. As written, this is not a convex optimization problem! 10 / 61

  15. PCA properties T and integer k ≤ r be Theorem. Let X ∈ R n × d with SVD X = USV given. � T � � 2 � X − XED � 2 � X − XDD min F = min F D ∈ R k × d D ∈ R d × k E ∈ R d × k D T D = I r � � � � 2 � X − XV k V s 2 = T F = i . k i = k +1 Additionally, � T � � 2 � X − XDD F = � X � 2 � XD � 2 min F − max F D ∈ R d × k D ∈ R d × k D T D = I D T D = I k � = � X � 2 F −� XV k � 2 F = � X � 2 s 2 F − i . i =1 Remark 1. SVD is unique, but � r i =1 s 2 i unique. Remark 2. As written, this is not a convex optimization problem! Remark 3. The second form is interesting. . . 10 / 61

  16. Centered PCA Some treatments replace X with X − 1 µ T , � with mean µ = 1 i =1 x i . n 11 / 61

  17. Centered PCA Some treatments replace X with X − 1 µ T , � with mean µ = 1 i =1 x i . n 1 T X ∈ R d × d is data covariance; n X 11 / 61

  18. Centered PCA Some treatments replace X with X − 1 µ T , � with mean µ = 1 i =1 x i . n 1 T X ∈ R d × d is data covariance; n X 1 T ( XD ) is data covariance after projection; n ( XD ) 11 / 61

  19. Centered PCA Some treatments replace X with X − 1 µ T , � with mean µ = 1 i =1 x i . n 1 T X ∈ R d × d is data covariance; n X 1 T ( XD ) is data covariance after projection; n ( XD ) lastly k � � � 1 F = 1 = 1 n � XD � 2 T ( XD ) T ( XDe i ) , n tr ( XD ) ( XDe i ) n i =1 therefore PCA is maximizing the resulting per-coordinate variances! 11 / 61

  20. Lloyd’s method revisited 1. Choose initial clusters ( S 1 , . . . , S k ) . 2. Repeat until convergence: 2.1 (Recenter.) Set µ j := mean( S j ) for j ∈ (1 , . . . , k ) . 2.2 (Reassign). Update S j := { x i : µ ( x i ) = µ j } for j ∈ (1 , . . . , k ) . (“ µ ( x i ) ” means “center closest to x i ”; break ties arbitrarily). 12 / 61

  21. Lloyd’s method revisited 1. Choose initial clusters ( S 1 , . . . , S k ) . 2. Repeat until convergence: 2.1 (Recenter.) Set µ j := mean( S j ) for j ∈ (1 , . . . , k ) . 2.2 (Reassign). Update S j := { x i : µ ( x i ) = µ j } for j ∈ (1 , . . . , k ) . (“ µ ( x i ) ” means “center closest to x i ”; break ties arbitrarily). Geometric perspective: ◮ Centers define a Voronoi diagram/partition : for each µ j , define cell V j := { x ∈ R d : µ ( x ) = µ j } (break ties arbitrarily). ◮ Reassignment leaves assignment consistent with Voronoi cells. ◮ Recentering might shift data outside Voronoi cells, except if we’ve converged! ◮ See http://mjt.cs.illinois.edu/htv/ for an interactive demo. 12 / 61

  22. Does Lloyd’s method solve the original problem? Theorem. ◮ For all t , φ ( C t ; A t − 1 ) ≥ φ ( C t ; A t ) ≥ φ ( C t +1 ; A t ) . ◮ The method terminates. 13 / 61

  23. Does Lloyd’s method solve the original problem? Theorem. ◮ For all t , φ ( C t ; A t − 1 ) ≥ φ ( C t ; A t ) ≥ φ ( C t +1 ; A t ) . ◮ The method terminates. Proof. ◮ This first property is from the earlier theorem and the definition of the algorithm: φ ( C t ; A t ) = φ ( C t ; A ( C t − 1 )) = min A ∈A φ ( C t ; A ) ≤ φ ( C t , A t − 1 ) , φ ( C t +1 ; A t ) = φ ( C ( A t ); A t ) = min C ∈C φ ( C ; A t ) ≤ φ ( C t , A t ) , ◮ Previous property implies: cost is nonincreasing. Combined with termination condition: all but final partition visited at most once. There are finitely many partitions of ( x i ) n i =1 . � 13 / 61

  24. Does Lloyd’s method solve the original problem? Theorem. ◮ For all t , φ ( C t ; A t − 1 ) ≥ φ ( C t ; A t ) ≥ φ ( C t +1 ; A t ) . ◮ The method terminates. Proof. ◮ This first property is from the earlier theorem and the definition of the algorithm: φ ( C t ; A t ) = φ ( C t ; A ( C t − 1 )) = min A ∈A φ ( C t ; A ) ≤ φ ( C t , A t − 1 ) , φ ( C t +1 ; A t ) = φ ( C ( A t ); A t ) = min C ∈C φ ( C ; A t ) ≤ φ ( C t , A t ) , ◮ Previous property implies: cost is nonincreasing. Combined with termination condition: all but final partition visited at most once. There are finitely many partitions of ( x i ) n i =1 . � (That didn’t answer the question. . . ) 13 / 61

  25. Seriously: does Lloyd’s method solve the original problem? ◮ In practice, Lloyd’s method seems to optimize well; In theory, output can have unboundedly poor cost . (Suppose width is c > 1 and height is 1.) 14 / 61

  26. Seriously: does Lloyd’s method solve the original problem? ◮ In practice, Lloyd’s method seems to optimize well; In theory, output can have unboundedly poor cost . (Suppose width is c > 1 and height is 1.) ◮ In practice, method takes few iterations; in theory: can take 2 Ω( √ n ) iterations! (Examples of this are painful; but note, problem is NP-hard, and convergence proof used number of partitions. . . ) 14 / 61

  27. Seriously: does Lloyd’s method solve the original problem? ◮ In practice, Lloyd’s method seems to optimize well; In theory, output can have unboundedly poor cost . (Suppose width is c > 1 and height is 1.) ◮ In practice, method takes few iterations; in theory: can take 2 Ω( √ n ) iterations! (Examples of this are painful; but note, problem is NP-hard, and convergence proof used number of partitions. . . ) So: in practice, yes; in theory, don’t know. . . 14 / 61

  28. Application: vector quantization. Vector quantization with k -means. ◮ Let ( x i ) n i =1 be given. ◮ run k -means to obtain ( µ 1 , . . . , µ k ) . ◮ Replace each ( x i ) n i =1 with ( µ ( x i )) n i =1 . Encoding size reduces from O ( nd ) to O ( kd + n ln( k )) . Examples. ◮ Audio compression. ◮ Image compression. 15 / 61

  29. 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61

  30. patch quantization, width 10, 8 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61

  31. patch quantization, width 10, 32 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61

  32. patch quantization, width 10, 128 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61

  33. patch quantization, width 10, 512 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61

  34. patch quantization, width 10, 2048 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61

  35. patch quantization, width 25, 8 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61

  36. patch quantization, width 25, 32 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61

  37. patch quantization, width 25, 128 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61

  38. patch quantization, width 25, 256 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61

  39. patch quantization, width 50, 8 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61

  40. patch quantization, width 50, 32 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61

  41. patch quantization, width 50, 64 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61

  42. Initialization matters! ◮ Easy choices: ◮ k random points from dataset. ◮ Random partition. ◮ Standard choice (theory and practice) : “ D 2 -sampling”/ kmeans++ 1. Choose µ 1 uniformly at random from data. 2. for j ∈ (2 , . . . , k ) : 2.1 Choose x i ∝ min l<j � x i − µ l � 2 2 . ◮ kmeans++ is randomized furthest-first traversal ; regular furthest-first fails with outliers. ◮ Scikits-learn and Matlab both default to kmeans++. 17 / 61

  43. Maximum likelihood: abstract formulation We’ve had one main “meta-algorithm” this semester: ◮ (Regularized) ERM principle: pick the model that minimizes an average loss over training data. 18 / 61

  44. Maximum likelihood: abstract formulation We’ve had one main “meta-algorithm” this semester: ◮ (Regularized) ERM principle: pick the model that minimizes an average loss over training data. We’ve also discussed another: the “Maximum likelihood estimation (MLE)” principle : ◮ Pick a set of probability models for your data: P := { p θ : θ ∈ Θ } . ◮ p θ will denote both densities and masses; the literature is similarly inconsistent. ◮ Given samples ( z i ) n i =1 , pick the model that maximized the likelihood n n � � max θ ∈ Θ L ( θ ) = max θ ∈ Θ ln p θ ( z i ) = max ln p θ ( z i ) , θ ∈ Θ i =1 i =1 where the ln( · ) is for mathematical convenience, and z i can be a labeled pair ( x i , y i ) or just x i . 18 / 61

  45. Example 1: coin flips. ◮ We flip a coin of bias θ ∈ [0 , 1] . ◮ Write down x i = 0 for tails, x i = 1 for heads; then p θ ( x i ) = x i θ + (1 − x i )(1 − θ ) , or alternatively p θ ( x i ) = θ x i (1 − θ ) 1 − x i . The second form will be more convenient. 19 / 61

  46. Example 1: coin flips. ◮ We flip a coin of bias θ ∈ [0 , 1] . ◮ Write down x i = 0 for tails, x i = 1 for heads; then p θ ( x i ) = x i θ + (1 − x i )(1 − θ ) , or alternatively p θ ( x i ) = θ x i (1 − θ ) 1 − x i . The second form will be more convenient. ◮ Writing H := � i x i and T := � i (1 − x i ) = n − H for convenience, n � � � L ( θ ) = x i ln θ + (1 − x i ) ln(1 − θ ) = H ln θ + T ln(1 − θ ) . i =1 19 / 61

  47. Example 1: coin flips. ◮ We flip a coin of bias θ ∈ [0 , 1] . ◮ Write down x i = 0 for tails, x i = 1 for heads; then p θ ( x i ) = x i θ + (1 − x i )(1 − θ ) , or alternatively p θ ( x i ) = θ x i (1 − θ ) 1 − x i . The second form will be more convenient. ◮ Writing H := � i x i and T := � i (1 − x i ) = n − H for convenience, n � � � L ( θ ) = x i ln θ + (1 − x i ) ln(1 − θ ) = H ln θ + T ln(1 − θ ) . i =1 Differentiating and setting to 0, 0 = H T θ − 1 − θ, T + H = H H which gives θ = N . ◮ In this way, we’ve justified a natural algorithm. 19 / 61

  48. Example 2: mean of a Gaussian ◮ Suppose x i ∼ N ( µ, σ 2 ) , so θ = ( µ, σ 2 ) , and � � − ( x i − µ ) 2 exp = − ( x i − µ ) 2 − ln(2 πσ 2 ) 2 σ 2 √ ln p θ ( x i ) = ln . 2 σ 2 2 2 πσ 2 20 / 61

  49. Example 2: mean of a Gaussian ◮ Suppose x i ∼ N ( µ, σ 2 ) , so θ = ( µ, σ 2 ) , and � � − ( x i − µ ) 2 exp = − ( x i − µ ) 2 − ln(2 πσ 2 ) 2 σ 2 √ ln p θ ( x i ) = ln . 2 σ 2 2 2 πσ 2 ◮ Therefore n � L ( θ ) = − 1 ( x i − µ ) 2 + stuff without µ ; 2 σ 2 i =1 � applying ∇ µ and setting to zero gives µ = 1 x i . n i � ◮ A similar derivation gives σ 2 = 1 i ( x i − µ ) 2 . n 20 / 61

  50. Example 4: Naive Bayes ◮ Let’s try a simple prediction setup, with (Bayes) optimal classifier arg max p ( Y = y | X = x ) . y ∈Y (We haven’t discussed this concept a lot, but it’s widespread in ML.) 21 / 61

  51. Example 4: Naive Bayes ◮ Let’s try a simple prediction setup, with (Bayes) optimal classifier arg max p ( Y = y | X = x ) . y ∈Y (We haven’t discussed this concept a lot, but it’s widespread in ML.) ◮ One way to proceed is to learn p ( Y | X ) exactly; that’s a pain. 21 / 61

  52. Example 4: Naive Bayes ◮ Let’s try a simple prediction setup, with (Bayes) optimal classifier arg max p ( Y = y | X = x ) . y ∈Y (We haven’t discussed this concept a lot, but it’s widespread in ML.) ◮ One way to proceed is to learn p ( Y | X ) exactly; that’s a pain. ◮ Let’s assume coordinates of X = ( X 1 , . . . , X d ) are independent given Y : = p ( X = x | Y = y ) p ( Y = y ) p ( Y = y | X = x ) = p ( Y = y, X = x ) p ( X = x ) p ( X = x ) p ( Y = y ) � d j =1 p ( X j = x j | Y = y ) = , p ( X = x ) and d � arg max p ( Y = y | X = x ) = arg max p ( Y = y ) p ( X = x ) | Y = y ) . y ∈Y y ∈Y j =1 21 / 61

  53. Example 4: Naive Bayes (part 2) d � arg max p ( Y = y | X = x ) = arg max p ( Y = y ) p ( X = x ) | Y = y ) . y ∈Y y ∈Y j =1 22 / 61

  54. Example 4: Naive Bayes (part 2) d � arg max p ( Y = y | X = x ) = arg max p ( Y = y ) p ( X = x ) | Y = y ) . y ∈Y y ∈Y j =1 Examples where this helps: ◮ Suppose X ∈ { 0 , 1 } d has an arbitrary distribution; \ it’s specified with 2 d − 1 numbers. \ The factored form above needs d numbers. To see how this can help, suppose x ∈ { 0 , 1 } d ; instead of having to learn a probability model of 2 d possibilities, we now have to learn d + 1 models each with 2 possibilities (binary labels). ◮ HW5 will use the standard “Iris dataset”. \ This data is continuous, Naive Bayes would approximate univariate distributions. 22 / 61

  55. Gaussian Mixture Model ◮ Suppose data is drawn from k Gaussians, meaning Y = j ∼ Discrete ( π 1 , . . . , π k ) , X = x | Y = j ∼ N ( µ j , Σ j ) , and the parameters are θ = (( π 1 , µ 1 , Σ 1 ) , . . . , ( π k , µ k , Σ k )) . (Note: this is a generative model, and we have a way to sample.) 23 / 61

  56. Gaussian Mixture Model ◮ Suppose data is drawn from k Gaussians, meaning Y = j ∼ Discrete ( π 1 , . . . , π k ) , X = x | Y = j ∼ N ( µ j , Σ j ) , and the parameters are θ = (( π 1 , µ 1 , Σ 1 ) , . . . , ( π k , µ k , Σ k )) . (Note: this is a generative model, and we have a way to sample.) ◮ The probability density (with parameters θ = (( π j , µ j , Σ j )) k j =1 ) at a given x is k k � � p θ ( x ) = p θ ( x | y = j ) p θ ( y = j ) = p µ j , Σ j ( x | Y = j ) π j , j =1 j =1 and the likelihood problem is n k π j � − 1 � � � T Σ − 1 ( x i − µ j ) L ( θ ) = ln exp 2( x i − µ j ) . � (2 π ) d | Σ | i =1 j =1 The ln and the exp are no longer next to each other; we can’t just take the derivative and set the answer to 0. 23 / 61

  57. Pearson’s crabs. Statistician Karl Pearson wanted to understand the distribution of “forehead breadth to body length” for 1000 crabs 25 20 15 10 5 0 0.58 0.60 0.62 0.64 0.66 0.68 24 / 61

  58. Pearson’s crabs. Statistician Karl Pearson wanted to understand the distribution of “forehead breadth to body length” for 1000 crabs 25 20 15 10 5 0 0.58 0.60 0.62 0.64 0.66 0.68 Doesn’t look Gaussian! 24 / 61

  59. Pearson’s crabs. Statistician Karl Pearson wanted to understand the distribution of “forehead breadth to body length” for 1000 crabs 25 20 15 10 5 0 0.58 0.60 0.62 0.64 0.66 0.68 Pearson fit a mixture of two Gaussians . 25 / 61

  60. Pearson’s crabs. Statistician Karl Pearson wanted to understand the distribution of “forehead breadth to body length” for 1000 crabs 25 20 15 10 5 0 0.58 0.60 0.62 0.64 0.66 0.68 Pearson fit a mixture of two Gaussians . Remark. Pearson did not use E-M. For this he invented the “method of moments” and obtained a solution by hand. 25 / 61

  61. Gaussian mixture likelihood with responsibility matrix R Let’s replace � n i =1 ln � k j =1 π j p µ j , Σ j ( x i ) with n k � � � � R ij ln π j p µ j , Σ j ( x i ) i =1 j =1 where R ∈ R n,k := { R ∈ [0 , 1] n × k : R 1 k = 1 n } is a responsibility matrix . 26 / 61

  62. Gaussian mixture likelihood with responsibility matrix R Let’s replace � n i =1 ln � k j =1 π j p µ j , Σ j ( x i ) with n k � � � � R ij ln π j p µ j , Σ j ( x i ) i =1 j =1 where R ∈ R n,k := { R ∈ [0 , 1] n × k : R 1 k = 1 n } is a responsibility matrix . Holding R fixed and optimizing θ gives � n � n i =1 R ij i =1 R ij π j := = ; � n � k n l =1 R il i =1 � n � n i =1 R ij x i i =1 R ij x i µ j := � n = , i =1 R ij nπ j � n i =1 R ij ( x i − µ j )( x i − µ j ) T Σ j := . nπ j (Should use new mean in Σ j so that all deriviatives 0.) 26 / 61

  63. Generalizing the assignment matrix to GMMs We introduced an assigment matrix A ∈ { 0 , 1 } n × k : ◮ For each x i , define µ ( x i ) to be a closest center: � x i − µ ( x i ) � = min � x i − µ j � . j ◮ For each i , set A ij = 1 [ µ ( x i ) = µ j ] . 27 / 61

  64. Generalizing the assignment matrix to GMMs We introduced an assigment matrix A ∈ { 0 , 1 } n × k : ◮ For each x i , define µ ( x i ) to be a closest center: � x i − µ ( x i ) � = min � x i − µ j � . j ◮ For each i , set A ij = 1 [ µ ( x i ) = µ j ] . ◮ Key property: by this choice, n k n � � � A ij � x i − µ j � 2 = � x i − µ j � 2 = φ ( C ); φ ( C ; A ) = min j i =1 j =1 i =1 therefore we can decrase φ ( C ) = φ ( C ; A ) first by optimizing C to get φ ( C ′ ; A ) ≤ φ ( C ; A ) , then setting A as above to get φ ( C ′ ) = φ ( C ′ ; A ′ ) ≤ φ ( C ′ ; A ) ≤ φ ( C ; A ) = φ ( C ) . In other words: we minimize φ ( C ) via φ ( C ; A ) . 27 / 61

  65. Generalizing the assignment matrix to GMMs We introduced an assigment matrix A ∈ { 0 , 1 } n × k : ◮ For each x i , define µ ( x i ) to be a closest center: � x i − µ ( x i ) � = min � x i − µ j � . j ◮ For each i , set A ij = 1 [ µ ( x i ) = µ j ] . ◮ Key property: by this choice, n k n � � � A ij � x i − µ j � 2 = � x i − µ j � 2 = φ ( C ); φ ( C ; A ) = min j i =1 j =1 i =1 therefore we can decrase φ ( C ) = φ ( C ; A ) first by optimizing C to get φ ( C ′ ; A ) ≤ φ ( C ; A ) , then setting A as above to get φ ( C ′ ) = φ ( C ′ ; A ′ ) ≤ φ ( C ′ ; A ) ≤ φ ( C ; A ) = φ ( C ) . In other words: we minimize φ ( C ) via φ ( C ; A ) . What fulfills the same role for L ? 27 / 61

  66. E-M method for latent variable models Define augmented likelihood n k � � R ij ln p θ ( x i , y i = j ) L ( θ ; R ) := , R ij i =1 j =1 with responsibility matrix R ∈ R n,k := { R ∈ [0 , 1] n × k : R 1 k = 1 n } . Alternate two steps: ◮ E-step: set ( R t ) ij := p θ t − 1 ( y i = j | x i ) . ◮ M-step: set θ t = arg max θ ∈ Θ L ( θ ; R t ) . 28 / 61

  67. E-M method for latent variable models Define augmented likelihood n k � � R ij ln p θ ( x i , y i = j ) L ( θ ; R ) := , R ij i =1 j =1 with responsibility matrix R ∈ R n,k := { R ∈ [0 , 1] n × k : R 1 k = 1 n } . Alternate two steps: ◮ E-step: set ( R t ) ij := p θ t − 1 ( y i = j | x i ) . ◮ M-step: set θ t = arg max θ ∈ Θ L ( θ ; R t ) . Soon: we’ll see this gives nondecreasing likelihood! 28 / 61

  68. E-M for Gaussian mixtures Initialization: a standard choice is π j = 1 / k , Σ j = I , and ( µ j ) k j =1 given by k -means. ◮ E-step: Set R ij = p θ ( y i = j | x i ) , meaning π j p µ j , Σ j ( x i ) R ij = p θ ( y i = j | x i ) = p θ ( y i = j, x i ) = . � k p θ ( x i ) l =1 π l p µ l , Σ l ( x i ) ◮ M-step: solve arg max θ ∈ Θ L ( θ ; R ) , meaning � n � n i =1 R ij i =1 R ij π j := = , � n � k n l =1 R il i =1 � n � n i =1 R ij x i i =1 R ij x i µ j := � n = , nπ j i =1 R ij � n i =1 R ij ( x i − µ j )( x i − µ j ) T Σ j := . nπ j (These are as before.) 29 / 61

  69. Demo: elliptical clusters E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 30 / 61

  70. Demo: elliptical clusters E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 30 / 61

  71. Demo: elliptical clusters E. . . M. . . E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 30 / 61

  72. Demo: elliptical clusters E. . . M. . . E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 30 / 61

  73. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 30 / 61

  74. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 30 / 61

  75. Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 30 / 61

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend