Lecture 9 Logistic regression Lecture 9 Logistic regression 10 17 - PowerPoint PPT Presentation

Lecture 9 Logistic regression Lecture 9 Logistic regression 10 ‐ 17 ‐ 2008

Review Review • Training a Naïve Bayes classifier g y – p(y), p(x i |y) for i=1,…, m • Predicting with Naïve Bayes Classifier – p(y|X) = p(X|y)p(y)/p(X) – predict y that maximizes p(y|X) • Zero probabilities cause headache for Bayes • Zero probabilities cause headache for Bayes classifiers – Laplace smoothing • Generative vs. discriminative approaches – Naïve Bayes is a generative approach

Logistic Regression Logistic Regression • Assume that the log odds of y=1 is a linear function of x : x : = ( 1 | ) P y x = + + + log ... w w x w m x = 0 1 1 m ( 0 | ) P y x • Or equivalently, we have: 1 1 = = ( 1 | ) P y x − + + + + ( ... ) w w x w m x 1 e 0 1 1 m Sigmoid function Side Note: the odds in favor of an event are the quantity p / (1 − p ), where p is th the probability of the event b bilit f th t If I toss a fair dice, what are the odds that I will have a six?

Learning w for logistic regression t 1 = 1 t − + + v 1 e e v Given a set of training data points we would like to find a weight vector w Given a set of training data points, we would like to find a weight vector w • 1 = = ( 1 | ) P y X such that − + + + + ( ... ) w w x w m x 1 e 0 1 1 m is large (e.g. 1) for positive training examples, and small (e.g. 0) otherwise s a ge (e.g. ) o pos t e t a g e a p es, a d s a (e.g. 0) ot e se In other words, a good weight vector W should satisfy the following: • if we plot (v = W ∙ X i , t=y i ), i=1, …, n, they should be in the area close to t=0 (for y i =0) and t=1 (for y i =1)

Learning w for logistic regression • This can be captured in the following objective function: ∑ = i i ( ) log ( | ) w x , w L P y i ∑ ∑ = = = = + + − − = = i i i i i i [ [ log log ( ( 1 1 | | ) ) ( ( 1 1 ) ) log( log( 1 1 ( ( 1 1 | | ))] ))] y y P P y y x x , w w y y P P y y x x , w w i Note that the superscript i is an index to the examples in the training set This is call the likelihood function of w and by maximizing this objective function we This is call the likelihood function of w, and by maximizing this objective function, we perform what we call “maximum likelihood estimation” of the parameter w.

Maximum Likelihood Estimation Maximum Likelihood Estimation Goal : estimate the parameters given data Assuming the data is i.i.d (identically independently distributed) For example given the results of n coin tosses we like to For example, given the results of n coin tosses, we like to estimate the probability of head p. Likelihood function: n n ∏ ∑ θ = θ = θ = θ i i i i ( ) log ( | ) log ( , | ) log ( , | ) L P D P x y P x y = = 1 1 i i MLE estimator: MLE estimator: θ MLE = θ arg max ( ) L θ

Example Example • Data: n iid coin toss: D={0, 0, 1, 0,…1} θ = ( = • Parameter 1 ) P x = θ − θ − 1 x x • Binary distribution ( ) ( 1 ) P x • Likelihood function? • MLE estimate? MLE i ?

Example Example • Data: n iid coin toss: D={0, 0, 1, 0,…1} θ = ( = • Parameter 1 ) P x = θ − θ − 1 x x • Binary distribution ( ) ( 1 ) P x • Likelihood function? θ = θ − θ = θ + − θ n n ( ) log ( 1 ) log log( 1 ) L n n 1 0 1 0 • MLE estimate? MLE i ? dL n n = − = ⇒ 1 0 0 θ θ − θ ( 1 ) d n n = ⇒ − θ = θ ⇒ 1 0 ( 1 ) n n θ − θ 1 0 ( 1 ) n n = θ + θ ⇒ θ = 1 1 n n n 1 1 0 + n n 1 0

MLE for logistic regression MLE for logistic regression n n ∏ ∏ ∑ ∑ ( θ = = = i i i i ( ) ) log g ( ( | | ) ) log g ( ( , , | | ) ) log g ( ( , , | | ) ) L P D W P X y y W P X y y W = = 1 i 1 i n n n ∑ ∑ ∑ = = + i i i i i i log( ( | , ) ( | ) ) log ( | , ) ( | ) P y W X P X W P y W X P X W = = = 1 1 1 1 1 1 i i i i i i n ∑ ∑ = = i i arg max ( ) arg max log ( | , ) W L W P y W X MLE W W = 1 i n ∑ = = + − − = i i i i i i arg max log( ( 1 | , ) ( 1 )( 1 ( 1 | , )) y P y W X y P y W X = W 1 i Equivalently, given a set of training data points, we would like to find a weight = ( 1 | , ) P y W X vector w such that is large (e.g. 1) for positive training examples, and small (e.g. 0) otherwise – the same as our intuition

Optimizing L(w) Optimizing L(w) • Unfortunately this does not have a close form Unfortunately this does not have a close form solution • Instead we iteratively search for the optimal • Instead, we iteratively search for the optimal w • Start with a random w, iteratively improve w S i h d i i l i (similar to Perceptron)

Logistic regression learning Logistic regression learning Learning rate

Batch Learning for Logistic Regression = i , y i ( ) 1 ,..., Given : training examples x , i N ← Let w ( , , , (0,0,0, ...,0) , ) Repeat until convergenc e d ← (0,0,0, ...,0) = 1 For i to N do ) 1 ← i y − + + i w x · 1 1 e e ) = − i i error y y = + ⋅ i x d d error ← w + η w d Note: y takes 0/1 here, not 1/ ‐ 1

Logistic Regression Vs. Perceptron Logistic Regression Vs. Perceptron • Note the striking similarity between the two ote t e st g s a ty bet ee t e t o algorithms • In fact LR learns a linear decision boundary – how y so? – We can show that mathematically, see board • What are the difference? – Different ways to train the weights – LR produces a probability estimation d b bili i i – (a maybe not so interesting difference)LR by statistician and Perceptron by CS statistician and Perceptron by CS

There are more! There are more! • If we assume Gaussian distribution for p(x i |y) in Naïve Bayes, p(y=1|X) will take the same functional N ï B ( 1|X) ill k h f i l form of Logistic Regression • What are the differences here? Wh t th diff h ? – Different ways of training • Naïve bayes estimates θ i by maximizing P(X|y=v i θ i ) and while Naïve bayes estimates θ i by maximizing P(X|y v i , θ i ), and while doing so assumes conditional independence among attributes • Logistic regression estimates w by maximizing P(y|x, w ) and make no conditional independence assumption. no conditional independence assumption.

Comparatively Comparatively • Naïve Bayes ‐ generative model: P(X|y) – makes strong conditional independence assumption about makes strong conditional independence assumption about the data attributes – When the assumptions are ok, naïve bayes can use small amount of training data and estimate a reasonable model amount of training data and estimate a reasonable model • Logistic regression ‐ discriminative model: directly learn p(y|X) – has fewer parameters to estimate, but they are tied h f i b h i d together and make learning harder – Makes no strong assumptions – May need large number of training examples Bottom line: if the naïve bayes assumption holds and the probabilistic models are accurate (i.e., x is gaussian given y etc.), NB would be a good d l (i i i i ) ld b d choice; otherwise, logistic regression works better

Summary Summary • We introduced the concept of generative vs. discriminative method – Given a method that we discussed in class, you need to know which category it belongs to • Logistic regression – Assumes that the log odds of y=1 is a linear function of X (i.e., W ∙ X) – Learning goal is to learn a weight vector W such that examples with y=1 are predicted to have high P(y=1|X) and vice versa • Maximum likelihood estimation is a approach that achieves this • Iterative algorithm to learn W using MLE • Iterative algorithm to learn W using MLE • Similarity and difference between LR and perceptrons – Logistic regression learns a linear decision boundarys

Lecture 9 Logistic regression Lecture 9 Logistic regression 10 17 - PowerPoint PPT Presentation

Lecture 9 Logistic regression Lecture 9 Logistic regression 10 17 2008 Review Review Training a Nave Bayes classifier g y p(y), p(x i |y) for i=1,, m Predicting with Nave Bayes Classifier p(y|X) = p(X|y)p(y)/p(X)

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Logistic Regression using OLS1D in Excel 2013 XL4D: V0H XL4D: V0H XL4D: V0H 2015 Schield

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Logistic regression Shay Cohen (based on slides by Sharon Goldwater) 28 October 2019 Todays

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

XL4B: Logistic Regression using OLS1B in Excel 2013 25 Feb 2018 V0C-2x XL4B: V0C-2x XL4B: V0C-2x

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Logistic regression Predict binary outcomes (success/failure) from numerical or categorical

Logistic regression and Poisson regression Rasmus Waagepetersen Department of Mathematics

Improved Direct Product Theorems for Randomized Query Complexity Andrew Drucker Sept. 13, 2010

SIPTA Summer School 2016 Matthias Troffaes (Durham) Gero Walter (Eindhoven) Edoardo Patelli

A game-theoretic ergodic theorem for imprecise Markov chains Gert de Cooman Ghent University,

Hoeffdings Bound Theorem Let X 1 , . . . , X n be independent random variables with E [ X i ] =

Fair Classification with Counterfactual Learning Dr. Maryam Tavakol 1/15 What is Fairness 2/15

Health and Heterogeneity Josep Pijoan-Mas , CEMFI Jos e-V ctor R os-Rull , UPenn

CS 457 Lecture 22 Congestion Fall 2011 Extended Project 3 Discussion Topics Principles

Lecture 1. Probabilities - Definitions, Examples and Basic Tools Igor Rychlik Chalmers