Lecture 9: Logistic Regression Discriminative vs. Generative - PowerPoint PPT Presentation

Lecture 9: − Logistic Regression − Discriminative vs. Generative Classification − Linear Discriminant Functions Aykut Erdem October 2017 Hacettepe University

Administrative • Midterm exam will be held on November 6. • Project proposals due is today! - No lecture next Thursday but we will talk about your proposals • Assignment 3 will be out next week • Make-up lecture will be on November 11 - will check the availability of the classrooms 2

Last time… Naïve Bayes Classifier – Given : – ,… – Class prior P(Y) – d conditionally independent features X 1 ,… X d given the – class label Y – For each X i feature, we have the conditional likelihood P(X i |Y) Naïve Bayes Decision rule: slide by Barnabás Póczos & Aarti Singh 3

Last time… Naïve Bayes Algorithm for discrete features discrete features We need to estimate these probabilities! Estimators For Class Prior For Likelihood slide by Barnabás Póczos & Aarti Singh NB Prediction for test data: 19 4

Last time… Text Classification MEDLINE Article MeSH Subject Category   Hierarchy • Antogonists and Inhibitors • Blood Supply ? • Chemistry • Drug Therapy • Embryology • Epidemiology • … How to represent a text document? slide by Dan Jurafsky 5

Last time… Bag of words model Typical additional assumption: Position in ¡document ¡doesn’t ¡matter : P(X i =x i |Y=y) = P(X k =x i |Y=y) – “Bag ¡of ¡words” ¡model ¡– order of words on the page ignored The document is just a bag of words: i.i.d. words – Sounds really silly, but often works very well! ) K( 50000-1) parameters to estimate The probability of a document with words x 1 ,x 2 ,… ¡ slide by Barnabás Póczos & Aarti Singh 27 6

Doc Words Class Last time… Bag Training 1 Chinese Beijing Chinese c of words model 2 Chinese Chinese Shanghai c 3 Chinese Macao c P ( c ) = N c 4 Tokyo Japan Chinese j ˆ Test 5 Chinese Chinese Chinese Tokyo Japan ? N P ( w | c ) = count ( w , c ) + 1 ˆ count ( c ) + | V | Priors: 3 P ( c )= 4 Choosing a class: 1 P ( j )= P(c|d 5 ) 3/4 * (3/7) 3 * 1/14 * 1/14 4 ∝ ≈ 0.0003 Conditional Probabilities: (5+1) / (8+6) = 6/14 = 3/7 P(Chinese| c ) = (0+1) / (8+6) = 1/14 P(Tokyo| c ) = P(j|d 5 ) 1/4 * (2/9) 3 * 2/9 * 2/9 ∝ (0+1) / (8+6) = 1/14 P(Japan| c ) = ≈ 0.0001 (1+1) / (3+6) = 2/9 P(Chinese| j ) = slide by Dan Jurafsky P(Tokyo| j ) = (1+1) / (3+6) = 2/9 P(Japan| j ) = (1+1) / (3+6) = 2/9 7

Last time… What if features are continuous? e.g., character recognition: X i is intensity at i th pixel ecognition: i is intensity a Gaussian Naïve Bayes (GNB): � Naïve Bayes (GNB): • Gaussian Naïve Bayes (GNB): � • � • tinuous mean and variance for each class k and each pixel i Di ff erent mean and variance for each class k and each pixel i. � • � Sometimes assume variance • slide by Barnabás Póczos & Aarti Singh � ates: • � • • is independent of Y (i.e., σ i ), � • � • • or independent of X i (i.e., σ k ) • or both (i.e., σ ) 8

Logistic Regression 9

      Recap: Naïve Bayes • NB Assumption:   :% • NB Classifier:   • Assume parametric form for P(X i |Y) and P(Y) - Estimate parameters using MLE/MAP and plug in slide by Aarti Singh & Barnabás Póczos 10

      Gaussian Naïve Bayes (GNB) • There are several distributions that can lead to a linear boundary. • As an example, consider Gaussian Naïve Bayes:   Gaussian class conditional densities Gaussian class conditional densities slide by Aarti Singh & Barnabás Póczos • What if we assume variance is independent of class, i.e. .e.%%%%%%%%%%%?% 11

GNB with equal variance is a Linear Classifier! fier!( Decision(boundary:( Decision boundary: d d Y Y P ( X i | Y = 0) P ( Y = 0) = P ( X i | Y = 1) P ( Y = 1) i =1 i =1 slide by Aarti Singh & Barnabás Póczos 12

GNB with equal variance is a Linear Classifier! fier!( Decision(boundary:( Decision boundary: d d Y Y P ( X i | Y = 0) P ( Y = 0) = P ( X i | Y = 1) P ( Y = 1) i =1 i =1 d log P ( Y = 0) Q d i =1 P ( X i | Y = 0) = log 1 − π log P ( X i | Y = 0) X + P ( Y = 1) Q d P ( X i | Y = 1) i =1 P ( X i | Y = 1) π i =1 slide by Aarti Singh & Barnabás Póczos 13

GNB with equal variance is a Linear Classifier! fier!( Decision(boundary:( Decision boundary: d d Y Y P ( X i | Y = 0) P ( Y = 0) = P ( X i | Y = 1) P ( Y = 1) i =1 i =1 d log P ( Y = 0) Q d i =1 P ( X i | Y = 0) = log 1 − π log P ( X i | Y = 0) X + P ( Y = 1) Q d P ( X i | Y = 1) i =1 P ( X i | Y = 1) π i =1 slide by Aarti Singh & Barnabás Póczos { { Constant term First-order term 14

Gaussian Naive Bayes (GNB) Decision(Boundary( Decision Boundary = ( x 1 , x 2 ) X slide by Aarti Singh & Barnabás Póczos = P ( Y = 0) P 1 = P ( Y = 1) P 2 p 1 ( X ) = p ( X | Y = 0) ∼ N ( M 1 , Σ 1 ) p ( X | Y = 1) ∼ N ( M 2 , Σ 2 ) p 2 ( X ) = 15

Generative vs. Discriminative Classifiers • Generative classifiers (e.g. Naïve Bayes) - Assume some functional form for P(X,Y) (or P(X|Y) and P(Y)) - Estimate parameters of P(X|Y), P(Y) directly from training data • But arg max_Y P(X|Y) P(Y) = arg max_Y P(Y|X) • Why not learn P(Y|X) directly? Or better yet, why not learn the decision boundary directly? • Discriminative classifiers (e.g. Logistic Regression) - Assume some functional form for P(Y|X) or for the decision boundary slide by Aarti Singh & Barnabás Póczos - Estimate parameters of P(Y|X) directly from training data 16

Logistic Regression Assumes%the%following%func$onal%form%for%P(Y|X):% Assumes the following functional form for P(Y ∣ X): Logis$c%func$on%applied%to%a%linear% Logistic function applied to linear func$on%of%the%data% function of the data logit%(z)% Logis&c( Logistic   func&on( function   slide by Aarti Singh & Barnabás Póczos (or(Sigmoid):( (or Sigmoid): z% Features(can(be(discrete(or(con&nuous!( Features can be discrete or continuous! 8% 17

Logistic Regression is a Linear Classifier! Assumes%the%following%func$onal%form%for%P(Y|X):% Assumes the following functional form for P(Y ∣ X): % % % % Decision boundary: Decision%boundary:% 1% slide by Aarti Singh & Barnabás Póczos 1% (Linear Decision Boundary) (Linear Decision Boundary) 9% 18

Logistic Regression is a Linear Classifier! Assumes%the%following%func$onal%form%for%P(Y|X) Assumes the following functional form for P(Y ∣ X): % % % % slide by Aarti Singh & Barnabás Póczos 1% 1% 1% 19

Logistic Regression for more than 2 classes • Logis$c%regression%in%more%general%case,%where%% Logistic regression in more general case, where Y% {y 1 ,…,y K }% ∈ %for% k<K% for k<K % % % %for% k=K% (normaliza$on,%so%no%weights%for%this%class)% for k=K (normalization, so no weights for this class) slide by Aarti Singh & Barnabás Póczos % % 20

Training Logistic Regression We’ll%focus%on%binary%classifica$on:% We’ll focus on binary classification: % How(to(learn(the(parameters(w 0 ,(w 1 ,(…(w d ?( How to learn the parameters w 0 , w 1 , …, w d ? Training%Data% Training Data Maximum%Likelihood%Es$mates% Maximum Likelihood Estimates % slide by Aarti Singh & Barnabás Póczos But there is a problem … Don’t have a model for P(X) or P(X|Y) – only for P(Y|X) 12% 21

Training Logistic Regression We’ll%focus%on%binary%classifica$on:% We’ll focus on binary classification: % How(to(learn(the(parameters(w 0 ,(w 1 ,(…(w d ?( How to learn the parameters w 0 , w 1 , …, w d ? Training%Data% Training Data Maximum%Likelihood%Es$mates% Maximum Likelihood Estimates % slide by Aarti Singh & Barnabás Póczos But there is a problem … But there is a problem … But there is a problem… Don’t have a model for P(X) or P(X|Y) – only for P(Y|X) Don’t have a model for P(X) or P(X|Y) — only for P(Y|X) Don’t have a model for P(X) or P(X|Y) — only for P(Y|X) 12% 22

Training Logistic Regression How(to(learn(the(parameters(w 0 ,(w 1 ,(…(w d ?( How to learn the parameters w 0 , w 1 , …, w d ? Training Data Training%Data% Maximum%(Condi$onal)%Likelihood%Es$mates% Maximum (Conditional) Likelihood Estimates % % % slide by Aarti Singh & Barnabás Póczos Discriminative philosophy — Don’t waste e ff ort learning P(X),   Discrimina$ve%philosophy%–%Don’t%waste%effort%learning%P(X),% focus on P(Y|X) — that’s all that matters for classification! focus%on%P(Y|X)%–%that’s%all%that%maders%for%classifica$on!% 23

Expressing Conditional log Likelihood Y l ln P ( Y l = 1 | X l , W )+( 1 � Y l ) ln P ( Y l = 0 | X l , W ) l ( W ) = ∑ l 1 P ( Y = 0 | X ) = 1 + exp ( w 0 + ∑ n i = 1 w i X i ) exp ( w 0 + ∑ n i = 1 w i X i ) P ( Y = 1 | X ) = 1 + exp ( w 0 + ∑ n i = 1 w i X i ) slide by Aarti Singh & Barnabás Póczos we can reexpress the log of the conditional likelihood Y can take only values 0 or 1, so only one of the two terms in the expression will be non-zero for any given Y l 24

Expressing Conditional log Likelihood Y l ln P ( Y l = 1 | X l , W )+( 1 � Y l ) ln P ( Y l = 0 | X l , W ) = ∑ l ( W ) l slide by Aarti Singh & Barnabás Póczos 25

Lecture 9: Logistic Regression Discriminative vs. Generative - PowerPoint PPT Presentation

Lecture 9: Logistic Regression Discriminative vs. Generative Classification Linear Discriminant Functions Aykut Erdem October 2017 Hacettepe University Administrative Midterm exam will be held on November 6. Project proposals

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Linear Regression Let us assume that the target variable and the inputs are related by the

Introduction to GSEM in Stata Christopher F Baum ECON 8823: Applied Econometrics Boston College,

Multiclass Logistic Regression, Multilayer Perceptron Milan Straka October 26, 2020 Charles

Beyond GLM: The potential for a generic likelihood toolbox Peter Dalgaard Department of

Machine Learning for Computational Linguistics May 3, 2016 regression non-parametric neighbors

Neural Network Learning Looking behind the scenes: a mathematical perspective Textbook reference:

Generalized linear mixed effects models Consider stochastic vector Y = ( Y 1 , . . . , Y n ) and

(Ch. 18.6-18.7) Announcements Homework 4 due Sunday Test next Wednesday... covers ch 15-17 (HW