Machine Learning - MT 2017 10. Classification : Generative Models - PowerPoint PPT Presentation

Machine Learning - MT 2017 10. Classification : Generative Models Varun Kanade University of Oxford October 30, 2017

Recap: Supervised Learning - Regression Discriminative Model: Linear Model (with Gaussian noise) p ( y | w , x ) = w · x + N (0 , σ 2 ) Other noise models possible, e.g., Laplace Non-linearities using basis expansion Regularisation to avoid overfitting: Ridge, Lasso (Cross)-Validation to choose hyperparameters Optimisation Algorithms for Model Fitting Least Squares Ridge Lasso 1800 2017 Legendre Gauss 1

Supervised Learning - Classification In classification problems, the target/output y is a category y ∈ { 1 , 2 , . . . , C } The input x = ( x 1 , . . . , x D ) , where ◮ Categorical: x i ∈ { 1 , . . . , K } ◮ Real-Valued: x i ∈ R Discriminative Model: Only model the conditional distribution p ( y | x , θ ) Generative Model: Model the full joint distribution p ( x , y | θ ) 2

Prediction Using Generative Models Suppose we have a model p ( x , y | θ ) over the joint distribution over inputs and outputs Given a new input x new , we can write the conditional distribution for y For c ∈ { 1 , . . . , C } , we write p ( y = c | θ ) · p ( x new | y = c, θ ) p ( y = c | x new , θ ) = � C c ′ =1 p ( y = c ′ | θ ) p ( x new | y = c ′ , θ ) The numerator is simply the joint probability p ( x new , c | θ ) and the denominator the marginal probability p ( x new | θ ) We can pick � y = argmax c p ( y = c | x new , θ ) 3

Toy Example Predict voter preference using in US elections Voted in Annual State Candidate 2012? Income Choice Y 50K OK Clinton N 173K CA Clinton Y 80K NJ Trump Y 150K WA Clinton N 25K WV Johnson Y 85K IL Clinton . . . . . . . . . . . . Y 1050K NY Trump N 35K CA Trump N 100K NY ? 4

Classification : Generative Model In order to fit a generative model, we’ll express the joint distribution as p ( x , y | θ , π ) = p ( y | π ) · p ( x | y, θ ) To model p ( y | π ) , we’ll use parameters π c such that � c π c = 1 p ( y = c | π ) = π c For class-conditional densities, for class c = 1 , . . . , C , we will have a model: p ( x | y = c, θ c ) 5

Classification : Generative Model So in our example, p ( y = clinton | π ) = π clinton p ( y = trump | π ) = π trump p ( y = johnson | π ) = π johnson Given that a voter supports Trump p ( x | y = trump , θ trump ) models the distribution over x given y = trump and θ trump Similarly, we have p ( x | y = clinton , θ clinton ) and p ( x | y = johnson , θ johnson ) We need to pick ‘‘model’’ for p ( x | y = c, θ c ) Estimate the parameters π c , θ c for c = 1 , . . . , C 6

Naïve Bayes Classifier (NBC) Assume that the features are conditionally independent given the class label D � p ( x | y = c, θ c ) = p ( x j | y = c, θ jc ) j =1 So, for example, we are ‘modelling’ that conditioned on being a trump supporter, the state , previous voting or annual income are conditionally independent Clearly, this assumption is ‘‘naïve’’ and never satisfied But model fitting becomes very very easy Although the generative model is clearly inadequate, it actually works quite well Goal is predicting class, not modelling the data! 7

Naïve Bayes Classifier (NBC) Real-Valued Features ◮ x j is real-valued e.g., annual income ◮ Example: Use a Gaussian model, so θ jc = ( µ jc , σ 2 jc ) ◮ Can use other distributions, e.g., age is probably not Gaussian! Categorical Features ◮ x j is categorical with values in { 1 , . . . , K } ◮ Use the multinoulli distribution, i.e. x j = i with probability µ jc,i K � µ jc,i = 1 i =1 ◮ The special case when x j ∈ { 0 , 1 } , use a single parameter θ jc ∈ [0 , 1] 8

Naïve Bayes Classifier (NBC) Assume that all the features are binary, i.e., every x j ∈ { 0 , 1 } If we have C classes, overall we have only O ( CD ) parameters, θ jc for each j = 1 , . . . , D and c = 1 , . . . , C Without the conditional independence assumption ◮ We have to assign a probability for each of the 2 D combination ◮ Thus, we have O ( C · 2 D ) parameters! ◮ The ‘naïve’ assumption breaks the curse of dimensionality and avoids overfitting! 9

Maximum Likelihood for the NBC Let us suppose we have data � ( x i , y i ) � N i =1 i.i.d. from some joint distribution p ( x , y ) The probability for a single datapoint is given by: C C D � � � π I ( y i = c ) p ( x ij | θ jc ) I ( y i = c ) p ( x i , y i | θ , π ) = p ( y i | π ) · p ( x i | θ , y i ) = · c c =1 c =1 j =1 Let N c be the number of datapoints with y i = c , so that � C c =1 N c = N We write the log-likelihood of the data as: C C D � � � � log p ( D | θ , π ) = N c log π c + log p ( x ij | θ jc ) c =1 c =1 j =1 i : y i = c The log-likelihood is easily separated into sums involving different parameters! 10

Maximum Likelihood for the NBC We have the log-likelihood for the NBC C C D � � � � log p ( D | θ , π ) = N c log π c + log p ( x ij | θ jc ) c =1 c =1 j =1 i : y i = c Let us obtain estimates for π . We get the following optimisation problem: � C maximise N c log π c c =1 C � subject to : π c = 1 c =1 This constrained optimisation problem can be solved using the method of Lagrange multipliers 11

Constrained Optimisation Problem Suppose f ( z ) is some function that we want to maximise subject to g ( z ) = 0 . Constrained Objective argmax f ( z ) , subject to : g ( z ) = 0 z Langrangian (Dual) Form Λ( z , λ ) = f ( z ) + λg ( z ) Any optimal solution to the constrained problem is a stationary point of Λ( z , λ ) 12

Constrained Optimisation Problem Any optimal solution to the constrained problem is a stationary point of Λ( z , λ ) = f ( z ) + λg ( z ) ∇ z Λ( z , λ ) = 0 ⇒ ∇ z f = − λ ∇ z g ∂ Λ( z ,λ ) = 0 ⇒ g ( z ) = 0 ∂λ 13

Maximum Likelihood for NBC Recall that we want to solve: C � maximise : N c log π c c =1 C � subject to : π c − 1 = 0 c =1 We can write the Lagrangean form:   C C � �   Λ( π , λ ) = N c log π c + λ π c − 1 c =1 c =1 We can write the partial derivatives and set them to 0 : = N c ∂ Λ( π ,λ ) π c + λ = 0 ∂π c � C ∂ Λ( π ,λ ) = π c − 1 = 0 ∂λ c =1 14

Maximum Likelihood for NBC The solution is obtained by setting N c π c + λ = 0 And so, π c = − N c λ As well as using the second condition, C C � � − N c π c − 1 = λ − 1 = 0 c =1 c =1 And thence, C � λ = − N c = − N c =1 Thus, we get the estimates, π c = N c N 15

Maximum Likelihood for the NBC We have the log-likelihood for the NBC � C � C � D � log p ( D | θ , π ) = N c log π c + log p ( x ij | θ jc ) c =1 c =1 j =1 i : y i = c We obtained the estimates, π c = N c N We can estimate θ jc by taking a similar approach To estimate θ jc we only need to use the j th feature of examples with y i = c Estimates depend on the model, e.g., Gaussian, Bernoulli, Multinoulli, etc. Fitting NBC is very very fast! 16

Summary: Naïve Bayes Classifier Generative Model: Fit the distribution p ( x , y | θ ) Make the naïve and obviously untrue assumption that features are conditionally independent given class! D � p ( x | y = c, θ c ) = p ( x j | y = c, θ jc ) j =1 Despite this classifiers often work quite well in practice The conditional independence assumption reduces the number of parameters and avoids overfitting Fitting the model is very straightforward Easy to mix and match different models for different features 17

NBC: Handling Missing Data at Test Time Let’s recall our example about trying to predict voter preferences Voted in Annual State Candidate 2012? Income Choice Y 50K OK Clinton N 173K CA Clinton Y 80K NJ Trump Y 150K WA Clinton N 25K WV Johnson Y 85K IL Clinton . . . . . . . . . . . . Y 1050K NY Trump N 35K CA Trump ? 100K NY ? Suppose a voter does not reveal whether or not they voted in 2012 For now, let’s assume we had no missing entries during training 18

NBC: Prediction for Examples With Missing Data The prediction rule in a generative model is p ( y = c | θ ) · p ( x new | y = c, θ ) p ( y = c | x new , θ ) = � C c ′ =1 p ( y = c ′ | θ ) p ( x new | y = c ′ , θ ) Let us suppose our datapoint is x new = (? , x 2 , . . . , x D ) , e.g., (? , 100K , NY ) π c · � D j =1 p ( x j | y = c, θ cj ) p ( y = c | x new , θ ) = � C c ′ =1 p ( y = c ′ | θ ) � D j =1 p ( x j | y = c ′ , θ jc ) Since x 1 is missing, we can marginalise it out, π c · � D j =2 p ( x j | y = c, θ cj ) p ( y = c | x new , θ ) = � C c ′ =1 p ( y = c ′ | θ ) � D j =2 p ( x j | y = c ′ , θ jc ) This can be done for other generative models, but marginalisation is requires summation/integration 19

NBC: Training With Missing Data For Naïve Bayes Classifiers, training with missing entries is quite easy Voted in Annual State Candidate 2012? Income Choice ? 50K OK Clinton N 173K CA Clinton ? 80K NJ Trump Y 150K WA Clinton N 25K WV Johnson Y 85K ? Clinton . . . . . . . . . . . . Y 1050K NY Trump N 35K CA Trump ? 100K NY ? Let’s say for Clinton voters, 103 had voted in 2012, 54 had not, and 25 , didn’t answer You can simply set θ = 103 157 as the probability that a voter had voted in 2012, conditioned on being a Clinton supporter 20

Machine Learning - MT 2017 10. Classification : Generative Models - PowerPoint PPT Presentation

Machine Learning - MT 2017 10. Classification : Generative Models Varun Kanade University of Oxford October 30, 2017 Recap: Supervised Learning - Regression Discriminative Model: Linear Model (with Gaussian noise) p ( y | w , x ) = w x + N

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

generative design systems Generative Brief Design Definitions Workshop Processes

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide

Generative Adversarial Nets(GANs) Troy Cary and Chenzhi Zhao A generative adversarial net is

Generative Learning INFO-4604, Applied Machine Learning University of Colorado Boulder November

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

CSC421/2516 Lecture 18: Generative Adversarial Networks Roger Grosse and Jimmy Ba Roger Grosse

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Generative and discriminative classification techniques Machine Learning and Category

Generative and discriminative classification techniques Machine Learning and Object Recognition

Generative and discriminative classification techniques Machine Learning and Category

Machine Learning Classification over Encrypted Data Raphael Bost, Raluca Ada Popa, Stephen Tu,

Probability, continued CMPUT 296: Basics of Machine Learning 2.2-2.4 Recap Probabilities

Probabilistic Graphical Models Lecture 2 Bayesian Networks Representation CS/CNS/EE 155

Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs,

Calculating probabilities of two events F OUN DATION S OF P ROBABILITY IN P YTH ON Alexander

Stochastic Simulation Markov Chain Monte Carlo Bo Friis Nielsen Institute of Mathematical

Multi-parameter models Applied Bayesian Statistics Dr. Earvin Balderama Department of

Longitudinal observations Bendix Carstensen Steno Diabetes Center, Gentofte, Denmark &

PHPE 400 Individual and Group Decision Making Eric Pacuit University of Maryland 1 / 22 The

Machine Learning - MT 2017 10. Classification : Generative Models - PowerPoint PPT Presentation

Machine Learning - MT 2017 10. Classification : Generative Models Varun Kanade University of Oxford October 30, 2017 Recap: Supervised Learning - Regression Discriminative Model: Linear Model (with Gaussian noise) p ( y | w , x ) = w x + N

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

generative design systems Generative Brief Design Definitions Workshop Processes

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide

Generative Adversarial Nets(GANs) Troy Cary and Chenzhi Zhao A generative adversarial net is

Generative Learning INFO-4604, Applied Machine Learning University of Colorado Boulder November

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

CSC421/2516 Lecture 18: Generative Adversarial Networks Roger Grosse and Jimmy Ba Roger Grosse

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Generative and discriminative classification techniques Machine Learning and Category

Generative and discriminative classification techniques Machine Learning and Object Recognition

Generative and discriminative classification techniques Machine Learning and Category

Machine Learning Classification over Encrypted Data Raphael Bost, Raluca Ada Popa, Stephen Tu,

Probability, continued CMPUT 296: Basics of Machine Learning 2.2-2.4 Recap Probabilities

Probabilistic Graphical Models Lecture 2 Bayesian Networks Representation CS/CNS/EE 155

Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs,

Calculating probabilities of two events F OUN DATION S OF P ROBABILITY IN P YTH ON Alexander

Stochastic Simulation Markov Chain Monte Carlo Bo Friis Nielsen Institute of Mathematical

Multi-parameter models Applied Bayesian Statistics Dr. Earvin Balderama Department of

Longitudinal observations Bendix Carstensen Steno Diabetes Center, Gentofte, Denmark &amp;

PHPE 400 Individual and Group Decision Making Eric Pacuit University of Maryland 1 / 22 The

Longitudinal observations Bendix Carstensen Steno Diabetes Center, Gentofte, Denmark &