probabilistic modeling
play

Probabilistic modeling Subhransu Maji CMPSCI 689: Machine Learning - PowerPoint PPT Presentation

Probabilistic modeling Subhransu Maji CMPSCI 689: Machine Learning 3 March 2015 5 March 2015 Administrivia Mini-project 1 due Thursday, March 05 Turn in a hard copy In the next class Or in CS main office reception area by 4:00pm


  1. Probabilistic modeling Subhransu Maji CMPSCI 689: Machine Learning 3 March 2015 5 March 2015

  2. Administrivia Mini-project 1 due Thursday, March 05 � Turn in a hard copy � ‣ In the next class ‣ Or in CS main office reception area by 4:00pm (mention 689 hw) Clearly write your name and student id in the front page � Late submissions: � ‣ At most 48 hours at 50% deduction (by 4:00pm March 07) ‣ More than 48 hours get zero ‣ Submit a pdf via email to the TA: xiaojian@cs.umass.edu CMPSCI 689 Subhransu Maji (UMASS) 2 /32

  3. Overview So far the models and algorithms you have learned about are relatively disconnected � Probabilistic modeling framework unites the two � Learning can be viewed as statistical inference � Two kinds of data models � ‣ Generative ‣ Conditional Two kinds of probability models � ‣ Parametric ‣ Non-parametric CMPSCI 689 Subhransu Maji (UMASS) 3 /32

  4. Classification by density estimation The data is generated according to a distribution D � � ( x , y ) ∼ D ( x , y ) � Suppose you had access to D , then classification becomes simple: � � D (ˆ y = arg max ˆ x , y ) y � This is the Bayes optimal classifier which achieves the smallest expected loss among all classifiers � � ✏ (ˆ y ) = E ( x ,y ) ∼ D [ ` ( y, ˆ y )] : expected loss of a predictor � ⇢ 1 if y 6 = ˆ y y ∈ { 0 , 1 } � ` ( y, ˆ y ) = 0 otherwise � Unfortunately, we don’t have access to the distribution CMPSCI 689 Subhransu Maji (UMASS) 4 /32

  5. Classification by density estimation This suggests that one way to learn a classifier is to estimate D � � Training data parametric distribution � ( x 1 , y 1 ) ∼ D � ( x 2 , y 2 ) ∼ D � Estimation ˆ D � … � Gaussian: N ( µ, σ 2 ) ( x n , y n ) ∼ D � Estimate the parameters of the distribution � We will assume that each point is independently generated from D � ‣ A new point doesn’t depend on previous points ‣ Commonly referred to as the i.i.d assumption or independently and identically distributed assumption CMPSCI 689 Subhransu Maji (UMASS) 5 /32

  6. Statistical estimation Coin toss: observed sequence {H, T, H, H} � β Probability of H: � β What is the value of that best explains the observed data? � Maximum likelihood principle (MLE): pick parameters of the distribution that maximize the likelihood of the observed data � Likelihood of data: � p β (data) = p β (H,T,H,H) � = p β (H) p β (T) p β (H) p β (H) � i.i.d data � = β × (1 − β ) × β × β � = β 3 (1 − β ) � Maximize likelihood: = d β 3 (1 − β ) dp β (data) ⇒ β = 3 = 3 β 2 (1 − β ) + β 3 ( − 1) = 0 = d β d β 4 CMPSCI 689 Subhransu Maji (UMASS) 6 /32

  7. Log-likelihood It is convenient to maximize the logarithm of the likelihood instead � Log-likelihood of the observed data: � log p β (data) = log p β (H,T,H,H) � � = log p β (H) + log p β (T) + log p β (H) + log p β (H) � = log β + log(1 − β ) + log β + log β � = 3 log β + log(1 − β ) � Maximizing the log-likelihood is equivalent to maximizing likelihood � ‣ Log is a concave monotonic function ‣ Products become sums ‣ Numerically stable CMPSCI 689 Subhransu Maji (UMASS) 7 /32

  8. Log-likelihood Log-likelihood of observing H -many heads and T -many tails: � � log p β (data) = H log β + T log(1 − β ) � Maximizing the log-likelihood: d [H log β + T log(1 − β )] = H T 1 − β = 0 β − d β H ⇒ β = = H + T CMPSCI 689 Subhransu Maji (UMASS) 8 /32

  9. Rolling a die θ 1 , θ 2 , . . . , θ k Suppose you are rolling a k-sided die with parameters: � You observe: � x 1 , x 2 , . . . , x k Log-likelihood of the data: � X � log p (data) = x k log θ k � k Maximizing the log-likelihood by setting the derivative to zero: � � d log p (data) = x k = 0 = ⇒ θ k = ∞ � d θ k θ k � We need additional constraints: X θ k = 1 k CMPSCI 689 Subhransu Maji (UMASS) 9 /32

  10. Lagrangian multipliers Constrained optimization: � � X x k log θ k max θ 1 , θ 2 ..., θ k � k � X θ k = 1 subject to: � k � Unconstrained optimization: � ! � X X x k log θ k + λ θ k min max 1 − � λ { θ 1 , θ 2 ..., θ k } k k � x k ⇒ θ k = x k ‣ At optimality: = λ = θ k λ x k X λ = θ k = x k P k x k k CMPSCI 689 Subhransu Maji (UMASS) 10 /32

  11. Naive Bayes Consider the binary prediction problem � Let the data be distributed according to a probability distribution: � � p θ ( y, x ) = p θ ( y, x 1 , x 2 , . . . , x D ) � We can simplify this using the chain rule of probability: � p θ ( y, x ) = p θ ( y ) p θ ( x 1 | y ) p θ ( x 2 | x 1 , y ) . . . p θ ( x D | x 1 , x 2 , . . . , x D − 1 , y ) � � D Y = p θ ( y ) p θ ( x d | x 1 , x 2 , . . . , x d − 1 , y ) � � d =1 Naive Bayes assumption: � � p θ ( x d | x d 0 , y ) = p θ ( x d | y ) , 8 d 0 6 = d � � E.g., The words “free” and “money” are independent given spam CMPSCI 689 Subhransu Maji (UMASS) 11 /32

  12. Naive Bayes Naive Bayes assumption: � � p θ ( x d | x d 0 , y ) = p θ ( x d | y ) , 8 d 0 6 = d � � We can simplify the joint probability distribution as: � D � Y p θ ( y, x ) = p θ ( y ) p θ ( x d | x 1 , x 2 , . . . , x d − 1 , y ) � d =1 � D � Y = p θ ( y ) p θ ( x d | y ) // simpler distribution � d =1 � At this point we can start parametrizing the distribution CMPSCI 689 Subhransu Maji (UMASS) 12 /32

  13. Naive Bayes: a simple case Case: binary labels and binary features � } p θ ( y ) = Bernoulli ( θ 0 ) � � p θ ( x d | y = 1) = Bernoulli ( θ + d ) 1+2D parameters � p θ ( x d | y = − 1) = Bernoulli ( θ − d ) � Probability of the data: D Y p θ ( y, x ) = p θ ( y ) p θ ( x d | y ) d =1 = θ [ y =+1] (1 − θ 0 ) [ y = − 1] 0 D θ +[ x d =1 ,y =+1] Y (1 − θ + d ) [ x d =0 ,y =+1] ... × // label +1 d d =1 D θ − [ x d =1 ,y = − 1] Y d ) [ x d =0 ,y = − 1] ... × (1 − θ − // label -1 d d =1 CMPSCI 689 Subhransu Maji (UMASS) 13 /32

  14. Naive Bayes: parameter estimation Given data we can estimate the parameters by maximizing data likelihood � Similar to the coin toss example the maximum likelihood estimates are: � � P n [ y n = +1] // fraction of the data with label as +1 ˆ θ 0 = � N � � P n [ x d,n = 1 , y n = +1] ˆ // fraction of the instances with 1 among +1 θ + d = � P n [ y n = +1] � � P n [ x d,n = 1 , y n = − 1] ˆ // fraction of the instances with 1 among -1 d = θ − � P n [ y n = − 1] � Other cases: � inductive � ‣ Nominal features: Multinomial distribution (like rolling a die) bias ‣ Continuous features: Gaussian distribution CMPSCI 689 Subhransu Maji (UMASS) 14 /32

  15. Naive Bayes: prediction To make predictions compute the posterior distribution: � � y = arg max ˆ p θ ( y | x ) // Bayes optimal prediction y � p θ ( y, x ) � // Bayes rule = arg max p θ ( x ) � y = arg max p θ ( y, x ) � y � For binary labels we can also compute the likelihood ratio: � � LR = p θ (+1 , x ) ⇢ +1 LR ≥ 1 y = ˆ � − 1 otherwise p θ ( − 1 , x ) � Or the log likelihood ratio: ⇢ +1 LLR ≥ 0 LLR = log ( p θ (+1 , x )) − log ( p θ ( − 1 , x )) ˆ y = − 1 otherwise CMPSCI 689 Subhransu Maji (UMASS) 15 /32

  16. Naive Bayes: decision boundary LLR = log ( p θ (+1 , x )) − log ( p θ ( − 1 , x )) ! ! D D θ +[ x d =1] θ − [ x d =1] Y Y (1 − θ + d ) [ x d =0] d ) [ x d =0] = log − log (1 − θ 0 ) (1 − θ − θ 0 d d d =1 d =1 D X log θ + � � = log θ 0 − log(1 − θ 0 ) + [ x d = 1] d − log θ − d d =1 D X log(1 − θ + � � . . . + [ x d = 0] d ) − log(1 − θ − d ) d =1 D D ✓ θ + ✓ 1 − θ + ✓ ◆ ◆ ◆ θ 0 X X d d = log + [ x d = 1] log + [ x d = 0] log 1 − θ 0 1 − θ − θ − d d d =1 d =1 D D ✓ θ + ✓ 1 − θ + ✓ ◆ ◆ ◆ θ 0 X X d d = log + x d log + (1 − x d ) log 1 − θ 0 1 − θ − θ − d d d =1 d =1 D D ✓ θ + ✓ 1 − θ + ✓ 1 − θ + ✓ ◆ ✓ ◆ ◆◆ ◆ θ 0 X X d d d = log + log − log + log x d 1 − θ 0 1 − θ − 1 − θ − θ − d d d d =1 d =1 = w T x + b Naive bayes classifier has a linear decision boundary! CMPSCI 689 Subhransu Maji (UMASS) 16 /32

  17. Generative and conditional models Generative models: � ‣ Model the joint distribution p( x ,y) ‣ Use Bayes rule to compute the label posterior ‣ Need to make simplifying assumptions (e.g. Naive bayes) In most cases we are given x and are only interested in the labels y � Conditional models: � ‣ Model the distribution p(y | x ) ‣ Saves some modeling effort ‣ Can assume a simpler parametrization of the distribution p(y | x ) � ‣ Most of ML we did so far directly aimed at predicting y from x CMPSCI 689 Subhransu Maji (UMASS) 17 /32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend