Lecture 5: Logistic Regression Julia Hockenmaier - PowerPoint PPT Presentation

CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Lecture 5:   Logistic Regression Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center

: 1 t r d a n P a w e i w v e e i R v r e v O CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 2

  Probabilistic classifiers We want to find the most likely class y for the input x : y * = argmax y P ( Y = y | X = x ) P ( Y = y | X = x ) : y The probability that the class label is   x when the input feature vector is   y * = argmax y f ( y ) y * y f ( y ) Let be the that maximizes 3 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

P ( Y | X ) Modeling with Bayes Rule Bayes Rule relates P ( Y | X ) P ( X | Y ) P ( Y ) to and : P ( Y | X ) = P ( Y , X ) P ( X ) Posterior = P ( X | Y ) P ( Y ) P ( X ) Likelihood Prior ∝ P ( X | Y ) P ( Y ) P ( Y ∣ X ) Bayes rule: The posterior is proportional   to the prior times the likelihood P ( Y ) P ( X | Y ) 4 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

P ( Y | X ) Posterior P ( Y ∣ X ) Modeling with Bayes Rule Y Probability of the label   X after having seen the data Bayes Rule relates P ( Y | X ) P ( X | Y ) P ( Y ) to and : P ( Y | X ) = P ( Y , X ) P ( X ) Posterior = P ( X | Y ) P ( Y ) P ( X ) Likelihood Prior P ( X ∣ Y ) P ( Y ) X Y Probability of the data   Probability of the label   Likelihood Prior ∝ P ( X | Y ) P ( Y ) Y X according to class independent of the data P ( Y ∣ X ) Bayes rule: The posterior is proportional   to the prior times the likelihood P ( Y ) P ( X | Y ) 5 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  Using Bayes Rule for our classifier y * = argmax y P ( Y ∣ X ) P ( X ∣ Y ) P ( Y ) = argmax y [ Bayes Rule ]   P ( X ) [ P ( X ) doesn’t   = argmax y P ( X ∣ Y ) P ( Y ) change argmax y ] 6 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Classification more generally Class Feature Raw Feature Classifier function Label(s) Data vector Before we can use a classifier on our data, we have to map the data to “feature” vectors 7 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Feature engineering as a prerequisite   for classification To talk about classification mathematically, we assume   each input item is represented as a ‘ feature’ vector x = (x 1 ….x N ) — Each element in x is one feature. — The number of elements/features N is fixed, and may be very large. — x has to capture all the information about the item that the classifier needs. But the raw data points (e.g. documents to classify)   are typically not in vector form. Before we can train a classifier, we therefore have to first define   a suitable feature function that maps raw data points to vectors. In practice, feature engineering (designing suitable feature functions) is very important for accurate classification. 8 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  Probabilistic classifiers y * A probabilistic classifier returns the most likely class   x for input : y * = argmax y P ( Y = y | X = x ) [Last class:] Naive Bayes uses Bayes Rule: y * = argmax y P ( y ∣ x ) = argmax y P ( x ∣ y ) P ( y ) Naive Bayes models the joint distribution of the class and the data: P ( x ∣ y ) P ( y ) = P ( x , y ) Joint models are also called generative models because we can view them   as stochastic processes that generate (labeled) items: P ( x ∣ y ) y P ( y ) x Sample/pick a label with , and then an item with P ( y ∣ x ) [Today:] Logistic Regression models directly This is also called a discriminative or conditional model, because it only models the probability of the class given the input, and not of the raw data itself. 9 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Key questions for today’s class What do we mean by generative vs. discriminative models/classifiers? Why is it difficult to incorporate complex features   into a generative model like Naive Bayes? How can we use (standard or multinomial) logistic regression for (binary or multiclass) classification? How can we train logistic regression models with (stochastic) gradient descent ? 10 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  Today’s class Part 1: Review and Overview   Part 2: From generative to discriminative classifiers   (Logistic Regression   and Multinomial Regression) Part 3: Learning Logistic Regression Models   with (Stochastic) Gradient Descent Reading: Chapter 5 (Jurafsky & Martin, 3rd Edition) 11 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

m o r F o : t 2 e t r v a i t P e a v r i e t n a s n l e e G i d m o i r M c s y i D t i l i b a b o r P CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 12

  Probabilistic classifiers y * A probabilistic classifier returns the most likely class   x for input : y * = argmax y P ( Y = y | X = x ) [Last class:] Naive Bayes uses Bayes Rule: y * = argmax y P ( y ∣ x ) = argmax y P ( x ∣ y ) P ( y ) Naive Bayes models the joint distribution of the class and the data: P ( x ∣ y ) P ( y ) = P ( x , y ) Joint models are also called generative models because we can view them   as stochastic processes that generate (labeled) items: P ( x ∣ y ) y P ( y ) x Sample/pick a label with , and then an item with P ( y ∣ x ) [Today:] Logistic Regression models directly This is also called a discriminative or conditional model, because it only models the probability of the class given the input, and not of the raw data itself. 13 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  (Directed) Graphical Models Graphical models are a visual notation   for probability models.   Each node represents a distribution   over one random variable: P ( X ) : X Arrows represent dependencies (i.e. what other random variables the current node is conditioned on) P ( Y ) P ( X ∣ Y ) P ( Y ) P ( Z ) P ( X ∣ Y , Z ) Y X Y X Z 14 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Generative vs Discriminative Models In classification: x = ( x 1 , …, x n ) — The data is observed (shaded nodes). y — The label is hidden (and needs to be inferred) Generative Model   Discriminative Model   (Naive Bayes) (Logistic Regression) P ( x ∣ y ) P ( y ∣ x ) Y Y X 1 X i X n X 1 X i X n 15 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  P ( Y = y ∣ X = x ) How do we model   x such that we can compute it for any ? x We’ve probably never seen any particular   that we want to classify at test time. Even if we could define and compute probability distributions   P ( Y = y ∣ X i = x i )   Σ y j ∈ Y P ( Y = y j ∣ X i = x i ) = 1 Good! sums to 1 with P ( Y ) x i ∈ x = ( x 1 , …, x i , …, x n ) for any single feature … ….we can’t just multiply these probabilities together   y j ∈ Y x to get one distribution over all for a given   P ( Y = y ∣ X = x ) := ∑ y j ∈ Y [ ∏ P ( Y = y j ∣ X i = x i ) ] < 1 Bad! i =1... n does not sum to 1 P ( Y ) 16 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

The sigmoid function σ ( x ) σ ( x ) The sigmoid function maps   x any real number to the range (0,1): e x 1 σ ( x ) = e x + 1 = 1 + e − x 1 0.5 0 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 17 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  σ () x Using with feature vectors σ () We can use the sigmoid to express a Bernoulli distribution Coin flips: P ( Tails ) = 1 − P ( Heads ) = 1 − σ ( x ) P ( Heads ) = σ ( x ) and σ () But to use the sigmoid for binary classification,   P ( Y ∈ {0,1} ∣ x = X ) we need to model the conditional probability   x ∈ X such that it depends on the particular feature vector Also: We don’t know how important each feature (element)   x i of for our particular classification task is… x = ( x 1 , …, x n ) σ () … and we need to feed a single real number into ! Solution: Assign (learn) a vector of feature weights f = ( f 1 , …, f n ) n ∑ σ ( fx ) and compute to obtain a single real, and then fx = f i x i i =1 18 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Lecture 5: Logistic Regression Julia Hockenmaier - PowerPoint PPT Presentation

CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Lecture 5: Logistic Regression Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center : 1 t r d a n P a w e i w v e e i R v r e v O CS447

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020

Logistic Regression using OLS1D in Excel 2013 XL4D: V0H XL4D: V0H XL4D: V0H 2015 Schield

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Logistic regression Shay Cohen (based on slides by Sharon Goldwater) 28 October 2019 Todays

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

XL4B: Logistic Regression using OLS1B in Excel 2013 25 Feb 2018 V0C-2x XL4B: V0C-2x XL4B: V0C-2x

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Lecture 9 Logistic regression Lecture 9 Logistic regression 10 17 2008 Review Review

Logistic regression Predict binary outcomes (success/failure) from numerical or categorical

Information Visualization Tools, Paper Types, Pre-Prop Mtgs 2Nums/Color Exercise Tamara Munzner

All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent

1 & 2 Samuel Series Lesson #150 October 16, 2018 Dean Bible Ministries

MATH 12002 - CALCULUS I 5.3: The Natural Exponential Function Professor Donald L. White

Data Hazards Consider the data dependencies in the following sequence SUB $2, $1, $3

arXiv:1801.00084v1 [hep-ex] 30 Dec 2017 measurement will accumulate 21 times the BNL statistics

CPSC 213 Introduction to Computer Systems Unit 2b Virtual Processors 1 Readings for These Next

Two-Sided Random Acceptability graph Matching Markets: m 1 Ex-Ante Equivalence of the w 1 m 2

Lecture 5: Logistic Regression Julia Hockenmaier - PowerPoint PPT Presentation

CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Lecture 5: Logistic Regression Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center : 1 t r d a n P a w e i w v e e i R v r e v O CS447

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020

Logistic Regression using OLS1D in Excel 2013 XL4D: V0H XL4D: V0H XL4D: V0H 2015 Schield

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Logistic regression Shay Cohen (based on slides by Sharon Goldwater) 28 October 2019 Todays

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

XL4B: Logistic Regression using OLS1B in Excel 2013 25 Feb 2018 V0C-2x XL4B: V0C-2x XL4B: V0C-2x

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Lecture 9 Logistic regression Lecture 9 Logistic regression 10 17 2008 Review Review

Logistic regression Predict binary outcomes (success/failure) from numerical or categorical

Information Visualization Tools, Paper Types, Pre-Prop Mtgs 2Nums/Color Exercise Tamara Munzner

All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent

1 &amp; 2 Samuel Series Lesson #150 October 16, 2018 Dean Bible Ministries

MATH 12002 - CALCULUS I 5.3: The Natural Exponential Function Professor Donald L. White

Data Hazards Consider the data dependencies in the following sequence SUB $2, $1, $3

arXiv:1801.00084v1 [hep-ex] 30 Dec 2017 measurement will accumulate 21 times the BNL statistics

CPSC 213 Introduction to Computer Systems Unit 2b Virtual Processors 1 Readings for These Next

Two-Sided Random Acceptability graph Matching Markets: m 1 Ex-Ante Equivalence of the w 1 m 2

1 & 2 Samuel Series Lesson #150 October 16, 2018 Dean Bible Ministries