Probabilistic modeling Subhransu Maji CMPSCI 689: Machine Learning - PowerPoint PPT Presentation

Probabilistic modeling Subhransu Maji CMPSCI 689: Machine Learning 3 March 2015 5 March 2015

Administrivia Mini-project 1 due Thursday, March 05 � Turn in a hard copy � ‣ In the next class ‣ Or in CS main office reception area by 4:00pm (mention 689 hw) Clearly write your name and student id in the front page � Late submissions: � ‣ At most 48 hours at 50% deduction (by 4:00pm March 07) ‣ More than 48 hours get zero ‣ Submit a pdf via email to the TA: xiaojian@cs.umass.edu CMPSCI 689 Subhransu Maji (UMASS) 2 /32

Overview So far the models and algorithms you have learned about are relatively disconnected � Probabilistic modeling framework unites the two � Learning can be viewed as statistical inference � Two kinds of data models � ‣ Generative ‣ Conditional Two kinds of probability models � ‣ Parametric ‣ Non-parametric CMPSCI 689 Subhransu Maji (UMASS) 3 /32

Classification by density estimation The data is generated according to a distribution D � � ( x , y ) ∼ D ( x , y ) � Suppose you had access to D , then classification becomes simple: � � D (ˆ y = arg max ˆ x , y ) y � This is the Bayes optimal classifier which achieves the smallest expected loss among all classifiers � � ✏ (ˆ y ) = E ( x ,y ) ∼ D [ ` ( y, ˆ y )] : expected loss of a predictor � ⇢ 1 if y 6 = ˆ y y ∈ { 0 , 1 } � ` ( y, ˆ y ) = 0 otherwise � Unfortunately, we don’t have access to the distribution CMPSCI 689 Subhransu Maji (UMASS) 4 /32

Classification by density estimation This suggests that one way to learn a classifier is to estimate D � � Training data parametric distribution � ( x 1 , y 1 ) ∼ D � ( x 2 , y 2 ) ∼ D � Estimation ˆ D � … � Gaussian: N ( µ, σ 2 ) ( x n , y n ) ∼ D � Estimate the parameters of the distribution � We will assume that each point is independently generated from D � ‣ A new point doesn’t depend on previous points ‣ Commonly referred to as the i.i.d assumption or independently and identically distributed assumption CMPSCI 689 Subhransu Maji (UMASS) 5 /32

Statistical estimation Coin toss: observed sequence {H, T, H, H} � β Probability of H: � β What is the value of that best explains the observed data? � Maximum likelihood principle (MLE): pick parameters of the distribution that maximize the likelihood of the observed data � Likelihood of data: � p β (data) = p β (H,T,H,H) � = p β (H) p β (T) p β (H) p β (H) � i.i.d data � = β × (1 − β ) × β × β � = β 3 (1 − β ) � Maximize likelihood: = d β 3 (1 − β ) dp β (data) ⇒ β = 3 = 3 β 2 (1 − β ) + β 3 ( − 1) = 0 = d β d β 4 CMPSCI 689 Subhransu Maji (UMASS) 6 /32

Log-likelihood It is convenient to maximize the logarithm of the likelihood instead � Log-likelihood of the observed data: � log p β (data) = log p β (H,T,H,H) � � = log p β (H) + log p β (T) + log p β (H) + log p β (H) � = log β + log(1 − β ) + log β + log β � = 3 log β + log(1 − β ) � Maximizing the log-likelihood is equivalent to maximizing likelihood � ‣ Log is a concave monotonic function ‣ Products become sums ‣ Numerically stable CMPSCI 689 Subhransu Maji (UMASS) 7 /32

Log-likelihood Log-likelihood of observing H -many heads and T -many tails: � � log p β (data) = H log β + T log(1 − β ) � Maximizing the log-likelihood: d [H log β + T log(1 − β )] = H T 1 − β = 0 β − d β H ⇒ β = = H + T CMPSCI 689 Subhransu Maji (UMASS) 8 /32

Rolling a die θ 1 , θ 2 , . . . , θ k Suppose you are rolling a k-sided die with parameters: � You observe: � x 1 , x 2 , . . . , x k Log-likelihood of the data: � X � log p (data) = x k log θ k � k Maximizing the log-likelihood by setting the derivative to zero: � � d log p (data) = x k = 0 = ⇒ θ k = ∞ � d θ k θ k � We need additional constraints: X θ k = 1 k CMPSCI 689 Subhransu Maji (UMASS) 9 /32

Lagrangian multipliers Constrained optimization: � � X x k log θ k max θ 1 , θ 2 ..., θ k � k � X θ k = 1 subject to: � k � Unconstrained optimization: � ! � X X x k log θ k + λ θ k min max 1 − � λ { θ 1 , θ 2 ..., θ k } k k � x k ⇒ θ k = x k ‣ At optimality: = λ = θ k λ x k X λ = θ k = x k P k x k k CMPSCI 689 Subhransu Maji (UMASS) 10 /32

Naive Bayes Consider the binary prediction problem � Let the data be distributed according to a probability distribution: � � p θ ( y, x ) = p θ ( y, x 1 , x 2 , . . . , x D ) � We can simplify this using the chain rule of probability: � p θ ( y, x ) = p θ ( y ) p θ ( x 1 | y ) p θ ( x 2 | x 1 , y ) . . . p θ ( x D | x 1 , x 2 , . . . , x D − 1 , y ) � � D Y = p θ ( y ) p θ ( x d | x 1 , x 2 , . . . , x d − 1 , y ) � � d =1 Naive Bayes assumption: � � p θ ( x d | x d 0 , y ) = p θ ( x d | y ) , 8 d 0 6 = d � � E.g., The words “free” and “money” are independent given spam CMPSCI 689 Subhransu Maji (UMASS) 11 /32

Naive Bayes Naive Bayes assumption: � � p θ ( x d | x d 0 , y ) = p θ ( x d | y ) , 8 d 0 6 = d � � We can simplify the joint probability distribution as: � D � Y p θ ( y, x ) = p θ ( y ) p θ ( x d | x 1 , x 2 , . . . , x d − 1 , y ) � d =1 � D � Y = p θ ( y ) p θ ( x d | y ) // simpler distribution � d =1 � At this point we can start parametrizing the distribution CMPSCI 689 Subhransu Maji (UMASS) 12 /32

Naive Bayes: a simple case Case: binary labels and binary features � } p θ ( y ) = Bernoulli ( θ 0 ) � � p θ ( x d | y = 1) = Bernoulli ( θ + d ) 1+2D parameters � p θ ( x d | y = − 1) = Bernoulli ( θ − d ) � Probability of the data: D Y p θ ( y, x ) = p θ ( y ) p θ ( x d | y ) d =1 = θ [ y =+1] (1 − θ 0 ) [ y = − 1] 0 D θ +[ x d =1 ,y =+1] Y (1 − θ + d ) [ x d =0 ,y =+1] ... × // label +1 d d =1 D θ − [ x d =1 ,y = − 1] Y d ) [ x d =0 ,y = − 1] ... × (1 − θ − // label -1 d d =1 CMPSCI 689 Subhransu Maji (UMASS) 13 /32

Naive Bayes: parameter estimation Given data we can estimate the parameters by maximizing data likelihood � Similar to the coin toss example the maximum likelihood estimates are: � � P n [ y n = +1] // fraction of the data with label as +1 ˆ θ 0 = � N � � P n [ x d,n = 1 , y n = +1] ˆ // fraction of the instances with 1 among +1 θ + d = � P n [ y n = +1] � � P n [ x d,n = 1 , y n = − 1] ˆ // fraction of the instances with 1 among -1 d = θ − � P n [ y n = − 1] � Other cases: � inductive � ‣ Nominal features: Multinomial distribution (like rolling a die) bias ‣ Continuous features: Gaussian distribution CMPSCI 689 Subhransu Maji (UMASS) 14 /32

Naive Bayes: prediction To make predictions compute the posterior distribution: � � y = arg max ˆ p θ ( y | x ) // Bayes optimal prediction y � p θ ( y, x ) � // Bayes rule = arg max p θ ( x ) � y = arg max p θ ( y, x ) � y � For binary labels we can also compute the likelihood ratio: � � LR = p θ (+1 , x ) ⇢ +1 LR ≥ 1 y = ˆ � − 1 otherwise p θ ( − 1 , x ) � Or the log likelihood ratio: ⇢ +1 LLR ≥ 0 LLR = log ( p θ (+1 , x )) − log ( p θ ( − 1 , x )) ˆ y = − 1 otherwise CMPSCI 689 Subhransu Maji (UMASS) 15 /32

Naive Bayes: decision boundary LLR = log ( p θ (+1 , x )) − log ( p θ ( − 1 , x )) ! ! D D θ +[ x d =1] θ − [ x d =1] Y Y (1 − θ + d ) [ x d =0] d ) [ x d =0] = log − log (1 − θ 0 ) (1 − θ − θ 0 d d d =1 d =1 D X log θ + � � = log θ 0 − log(1 − θ 0 ) + [ x d = 1] d − log θ − d d =1 D X log(1 − θ + � � . . . + [ x d = 0] d ) − log(1 − θ − d ) d =1 D D ✓ θ + ✓ 1 − θ + ✓ ◆ ◆ ◆ θ 0 X X d d = log + [ x d = 1] log + [ x d = 0] log 1 − θ 0 1 − θ − θ − d d d =1 d =1 D D ✓ θ + ✓ 1 − θ + ✓ ◆ ◆ ◆ θ 0 X X d d = log + x d log + (1 − x d ) log 1 − θ 0 1 − θ − θ − d d d =1 d =1 D D ✓ θ + ✓ 1 − θ + ✓ 1 − θ + ✓ ◆ ✓ ◆ ◆◆ ◆ θ 0 X X d d d = log + log − log + log x d 1 − θ 0 1 − θ − 1 − θ − θ − d d d d =1 d =1 = w T x + b Naive bayes classifier has a linear decision boundary! CMPSCI 689 Subhransu Maji (UMASS) 16 /32

Generative and conditional models Generative models: � ‣ Model the joint distribution p( x ,y) ‣ Use Bayes rule to compute the label posterior ‣ Need to make simplifying assumptions (e.g. Naive bayes) In most cases we are given x and are only interested in the labels y � Conditional models: � ‣ Model the distribution p(y | x ) ‣ Saves some modeling effort ‣ Can assume a simpler parametrization of the distribution p(y | x ) � ‣ Most of ML we did so far directly aimed at predicting y from x CMPSCI 689 Subhransu Maji (UMASS) 17 /32

Probabilistic modeling Subhransu Maji CMPSCI 689: Machine Learning - PowerPoint PPT Presentation

Probabilistic modeling Subhransu Maji CMPSCI 689: Machine Learning 3 March 2015 5 March 2015 Administrivia Mini-project 1 due Thursday, March 05 Turn in a hard copy In the next class Or in CS main office reception area by 4:00pm

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

Running Probabilistic Running Probabilistic Running Probabilistic Programs Backwards Programs

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Thesis

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

Table of Contents I Probabilistic Reasoning Classical Probabilistic Models Basic Probabilistic

Probabilistic Computation Lecture 12 Flipping coins, taking chances PP, BPP 1 Probabilistic

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Reconstruction

Probabilistic Computation Lecture 13 Understanding BPP 1 Recap 2 Recap Probabilistic

From Probabilistic Circuits to Probabilistic Programs and Back Guy Van den Broeck PROBPROG - Oct

Language Modeling CSE392 - Spring 2019 Special Topic in CS Task Probabilistic Modeling

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

Probabilistic Models of Human Parsing Parser Architectures Informatics 2A: Lecture 23 2

Probabilistic Morphable Models 2019: Hands-on part Ghazi Bouabene Probabilistic Morphable Models

Job 36:4 Be assured that my words are not false; one perfect in knowledge is with you. Job

Elements of Machine Intelligence - I Ken Kreutz-Delgado Nuno Vasconcelos ECE Department, UCSD

Logistics To contact Dan: dlizotte@cs.ualberta.ca

CS 559: Machine Learning Fundamentals and Applications 3 rd Set of Notes Instructor: Philippos

Wireless Networks Lecture 20 : Managing Wireless Networks Peter Steenkiste CS and ECE, Carnegie

the Company of Heaven Gods Invisible Creatures Angels are ministering spirits (Hebrews

Perceptrons From the heights of error, To the valleys of Truth Piyush Kumar Advanced

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification