Discriminative vs. Generative Learning
CS 760@UW-Madison
Discriminative vs. Generative Learning CS 760@UW-Madison Goals for - - PowerPoint PPT Presentation
Discriminative vs. Generative Learning CS 760@UW-Madison Goals for the lecture you should understand the following concepts the relationship between logistic regression and Nave Bayes the relationship between discriminative and
CS 760@UW-Madison
you should understand the following concepts
models
Discriminative approach:
𝑧 = ℎ(𝑦) or more generally, 𝑞 𝑧 𝑦 = ℎ(𝑦)
Generative approach:
created: 𝑞(𝑦, 𝑧) = ℎ(𝑦, 𝑧)
Maximum A Posteriori (MAP)
𝑞 𝑦, 𝑧 = 𝑞 𝑧 𝑞 𝑦 𝑧 = 𝑞 𝑧 ෑ
𝑗
𝑞(𝑦𝑗|𝑧)
How to design the hypotheses and the loss? Can design by a generative approach!
likelihood loss (or use MAP to derive the loss)
𝑞 𝑦 𝑧 = 𝑞 𝑦 𝑍 = 𝑧 = 𝑂 𝑦|𝜈𝑧, 𝐽 = 1 2𝜌 𝑒/2 exp − 1 2 𝑦 − 𝜈𝑧
2
𝑞 𝑍 = 𝑧|𝑦 = 𝑞 𝑦|𝑍 = 𝑧 𝑞(𝑍 = 𝑧) σ𝑙 𝑞 𝑦|𝑍 = 𝑙 𝑞(𝑍 = 𝑙) = exp(𝑏𝑧) σ𝑙 exp(𝑏𝑙) where 𝑏𝑙 ≔ ln 𝑞 𝑦 𝑍 = 𝑙 𝑞 𝑍 = 𝑙 = − 1 2 𝑦𝑈𝑦 + 𝑥𝑙
𝑈
𝑦 + 𝑐𝑙 with 𝑥𝑙 = 𝜈𝑙, 𝑐𝑙 = − 1 2 𝜈𝑙
𝑈𝜈𝑙 + ln 𝑞 𝑍 = 𝑙 + ln
1 2𝜌 𝑒/2
𝑞 𝑦 𝑧 = 𝑞 𝑦 𝑍 = 𝑧 = 𝑂 𝑦|𝜈𝑧, 𝐽 = 1 2𝜌 𝑒/2 exp − 1 2 𝑦 − 𝜈𝑧
2
1 2 𝑦𝑈𝑦, we have
𝑞 𝑍 = 𝑧|𝑦 = exp(𝑏𝑧) σ𝑙 exp(𝑏𝑙) , 𝑏𝑙 ≔ 𝑥𝑙 𝑈𝑦 + 𝑐𝑙 where 𝑥𝑙 = 𝜈𝑙, 𝑐𝑙 = − 1 2 𝜈𝑙
𝑈𝜈𝑙 + ln 𝑞 𝑍 = 𝑙 + ln
1 2𝜌 𝑒/2
𝑞 𝑦 𝑧 = 𝑞 𝑦 𝑍 = 𝑧 = 𝑂 𝑦|𝜈𝑧, 𝐽 = 1 2𝜌 𝑒/2 exp − 1 2 𝑦 − 𝜈𝑧
2
𝑞 𝑍 = 𝑧|𝑦 = exp( 𝑥𝑧 𝑈𝑦 + 𝑐𝑧) σ𝑙 exp( 𝑥𝑙 𝑈𝑦 + 𝑐𝑙) which is the hypothesis class for multiclass logistic regression
log-likelihood loss − 1 𝑛
𝑘=1 𝑛
log 𝑞 𝑧 = 𝑧(𝑘) 𝑦(𝑘)
generative story 𝑞 𝑦 𝑧 = 𝑞 𝑦 𝑍 = 𝑧 = 𝑂 𝑦|𝜈𝑧, 𝐽 = 1 2𝜌 𝑒/2 exp − 1 2 𝑦 − 𝜈𝑧
2
= 1 2𝜌 𝑒/2 ෑ
𝑗
exp − 1 2 𝑦𝑗 − 𝑣𝑧𝑗
2
which is a special case of Naïve Bayes!
regression? (Instead of the more special Normal distribution assumption)
consider Naïve Bayes for a binary classification task expanding denominator dividing everything by numerator
1 1 1 n n i i n
=
= = =
= = + = = = = =
n i i n i i n i i
Y x P Y P Y x P Y P Y x P Y P
1 1 1
) | ( ) ( ) 1 | ( ) 1 ( ) 1 | ( ) 1 (
= =
= = = = + =
n i i n i i
Y x P Y P Y x P Y P
1 1
) 1 | ( ) 1 ( ) | ( ) ( 1 1
applying exp(ln(a)) = a applying ln(a/b) = -ln(b/a)
= =
= = = = + = =
n i i n i i n
Y x P Y P Y x P Y P x x Y P
1 1 1
) 1 | ( ) 1 ( ) | ( ) ( 1 1 ) ,..., | 1 ( = = = = + =
= = n i i n i i
Y x P Y P Y x P Y P
1 1
) 1 | ( ) 1 ( ) | ( ) ( ln exp 1 1 = = = = − + =
= = n i i n i i
Y x P Y P Y x P Y P
1 1
) | ( ) ( ) 1 | ( ) 1 ( ln exp 1 1
converting log of products to sum of logs
Does this look familiar?
= = = = − + = =
= = n i i n i i n
Y x P Y P Y x P Y P x x Y P
1 1 1
) | ( ) ( ) 1 | ( ) 1 ( ln exp 1 1 ) ,..., | 1 ( = = − = = − + = =
= n i i i n
Y x P Y x P Y P Y P x x Y P
1 1
) | ( ) 1 | ( ln ) ( ) 1 ( ln exp 1 1 ) ,..., | 1 (
+ − + =
= n i i ix
w w x f
1
exp 1 1 ) ( = = − = = − + = =
= n i i i n
Y x P Y x P Y P Y P x x Y P
1 1
) | ( ) 1 | ( ln ) ( ) 1 ( ln exp 1 1 ) ,..., | 1 (
Naïve Bayes logistic regression Linearity assumption: the log-ratio is linear in 𝑦
+ − + =
= n i i ix
w w x f
1
exp 1 1 ) ( = = − = = − + = =
= n i i i n
Y x P Y x P Y P Y P x x Y P
1 1
) | ( ) 1 | ( ln ) ( ) 1 ( ln exp 1 1 ) ,..., | 1 (
Naïve Bayes logistic regression Summary: If we begin with a Naïve Bayes generative story to derive a discriminative approach (assuming linearity), we get logistic regression! Linearity assumption: the log-ratio is linear in 𝑦
+ − + =
= n i i ix
w w x f
1
exp 1 1 ) ( = = − = = − + = =
= n i i i n
Y x P Y x P Y P Y P x x Y P
1 1
) | ( ) 1 | ( ln ) ( ) 1 ( ln exp 1 1 ) ,..., | 1 (
Naïve Bayes logistic regression Summary: If we begin with a Naïve Bayes generative story to derive a discriminative approach (assuming linearity), we get logistic regression! Generative counterpart of logistic regression Discriminative counterpart of Naïve Bayes
Conditional Independence (Naïve Bayes assumption) Naïve Bayes method Logistic regression Discriminative approach (+ linearity assumption) Generative approach
Y Color=red Color=blue Size=big Size=small 1
ln P(Y =1) P(Y = 0) æ è ç ö ø ÷ ln P(red |Y =1) P(red |Y = 0) æ è ç ö ø ÷ ln P(blue |Y =1) P(blue |Y = 0) æ è ç ö ø ÷
The connection can give interpretation for the weights in logistic regression: weights correspond to log ratios
hypothesis space bias (recall our discussion of inductive bias)
In general, no. They use different methods to estimate the model parameters. Naïve Bayes uses MLE to learn the parameters 𝑞(𝑦𝑗|𝑧), whereas LR minimizes the loss to learn the parameters 𝑥𝑗.
asymptotic comparison (# training instances → ∞)
correct, NB and LR produce identical classifiers when conditional independence assumptions are incorrect
compensate for incorrect assumptions (e.g. what if we have two redundant but relevant features)
data
non-asymptotic analysis [Ng & Jordan, NIPS 2001]
instances are needed to get good estimates naïve Bayes: O(log n) logistic regression: O(n)
asymptotic estimates
n = # features
naïve Bayes logistic regression size of training set Ng and Jordan compared learning curves for the two approaches on 15 data sets (some w/discrete features, some w/continuous features)
naïve Bayes logistic regression general trend supports theory
large
for the same model class
features in NB) the two will produce identical classifiers in the limit (# training instances → ∞)
likely to be more accurate for large training sets
accurate because parameters converge to their asymptotic values more quickly (in terms of training set size)
for a generative or discriminative method? A: Empirically compare the two.
Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Elad Hazan, Tom Dietterich, and Pedro Domingos.