Logistic Regression Jia-Bin Huang Virginia Tech Spring 2019 - - PowerPoint PPT Presentation

โ–ถ
logistic regression
SMART_READER_LITE
LIVE PREVIEW

Logistic Regression Jia-Bin Huang Virginia Tech Spring 2019 - - PowerPoint PPT Presentation

Logistic Regression Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative Please start HW 1 early! Questions are welcome! Two principles for estimating parameters Maximum Likelihood Estimate (MLE) Choose


slide-1
SLIDE 1

Logistic Regression

Jia-Bin Huang Virginia Tech

Spring 2019

ECE-5424G / CS-5824

slide-2
SLIDE 2

Administrative

  • Please start HW 1 early!
  • Questions are welcome!
slide-3
SLIDE 3

Two principles for estimating parameters

  • Maximum Likelihood Estimate (MLE)

Choose ๐œ„ that maximizes probability of observed data

เทก ๐œพMLE = argmax

๐œ„

๐‘„(๐ธ๐‘๐‘ข๐‘|๐œ„)

  • Maximum a posteriori estimation (MAP)

Choose ๐œ„ that is most probable given prior probability and data

เทก ๐œพMAP = argmax

๐œ„

๐‘„ ๐œ„ ๐ธ = argmax

๐œ„

๐‘„ ๐ธ๐‘๐‘ข๐‘ ๐œ„ ๐‘„ ๐œ„ ๐‘„(๐ธ๐‘๐‘ข๐‘)

Slide credit: Tom Mitchell

slide-4
SLIDE 4

Naรฏve Bayes classifier

  • Want to learn ๐‘„ ๐‘ ๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ)
  • But require ๐Ÿ‘๐’ parameters...
  • How about applying Bayes rule?
  • ๐‘„ ๐‘ ๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ) =

๐‘„(๐‘Œ1,โ‹ฏ,๐‘Œ๐‘œ ๐‘ ๐‘„ ๐‘ ๐‘„(๐‘Œ1,โ‹ฏ,๐‘Œ๐‘œ)

โˆ ๐‘„(๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ ๐‘ ๐‘„ ๐‘

  • ๐‘„(๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ ๐‘ : Need (๐Ÿ‘๐’โˆ’๐Ÿ) ร— ๐Ÿ‘ parameters
  • ๐‘„(๐‘):

Need 1 parameter

  • Apply conditional independence assumption
  • ๐‘„ ๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ ๐‘ = ฯ‚๐‘˜=1

๐‘œ

๐‘„(๐‘Œ

๐‘˜|๐‘): Need ๐จ ร— ๐Ÿ‘ parameters

slide-5
SLIDE 5

Naรฏve Bayes classifier

  • Bayes rule:

๐‘„ ๐‘ = ๐‘ง๐‘™ ๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ) = ๐‘„(๐‘ = ๐‘ง๐‘™)๐‘„(๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ ๐‘ = ๐‘ง๐‘™ ฯƒ๐‘˜ ๐‘„ ๐‘ = ๐‘ง๐‘˜ ๐‘„ ๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ ๐‘ = ๐‘ง๐‘˜

  • Assume conditional independence among ๐‘Œ๐‘—โ€™s:

๐‘„ ๐‘ = ๐‘ง๐‘™ ๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ) = ๐‘„ ๐‘ = ๐‘ง๐‘™ ฮ ๐‘—๐‘„ ๐‘Œ๐‘— ๐‘ = ๐‘ง๐‘™) ฯƒ๐‘˜ ๐‘„ ๐‘ = ๐‘ง๐‘˜ ฮ ๐‘—๐‘„ ๐‘Œ๐‘— ๐‘ = ๐‘ง๐‘˜)

  • Pick the most probable Y

เท  ๐‘ โ† argmax

๐‘ง๐‘™

๐‘„ ๐‘ = ๐‘ง๐‘™ ฮ ๐‘—๐‘„ ๐‘Œ๐‘— ๐‘ = ๐‘ง๐‘™)

Slide credit: Tom Mitchell

slide-6
SLIDE 6

Example

  • ๐‘„ ๐‘ ๐‘Œ1, ๐‘Œ2 โˆ ๐‘„ ๐‘ ๐‘„ ๐‘Œ1, ๐‘Œ2 ๐‘ = ๐‘„ ๐‘ ๐‘„ ๐‘Œ1 ๐‘ ๐‘„(๐‘Œ2 ๐‘
  • Estimating parameters
  • Test example: ๐‘Œ1 = 1, ๐‘Œ2 = 0
  • ๐‘ = 1: ๐‘„ ๐‘ = 1 ๐‘„ ๐‘Œ1 = 1|๐‘ = 1 ๐‘„ ๐‘Œ2 = 0|๐‘ = 1 = 0.4 ร— 0.2 ร— 0.7 = 0.056
  • ๐‘ = 0: ๐‘„ ๐‘ = 0 ๐‘„ ๐‘Œ1 = 1|๐‘ = 0 ๐‘„ ๐‘Œ2 = 0|๐‘ = 0 = 0.6 ร— 0.7 ร— 0.1 = 0.042

Bayes rule Conditional indep. ๐‘„ ๐‘ = 1 = 0.4 ๐‘„ ๐‘Œ1 = 1|๐‘ = 1 = 0.2 ๐‘„ ๐‘Œ1 = 1|๐‘ = 0 = 0.7 ๐‘„ ๐‘Œ2 = 1|๐‘ = 1 = 0.3 ๐‘„ ๐‘Œ2 = 1|๐‘ = 0 = 0.9 ๐‘„ ๐‘ = 0 = 0.6 ๐‘„ ๐‘Œ1 = 0|๐‘ = 1 = 0.8 ๐‘„ ๐‘Œ1 = 0|๐‘ = 0 = 0.3 ๐‘„ ๐‘Œ2 = 0|๐‘ = 1 = 0.7 ๐‘„ ๐‘Œ2 = 0|๐‘ = 0 = 0.1

slide-7
SLIDE 7

Naรฏve Bayes algorithm โ€“ discrete Xi

  • For each value yk

Estimate ๐œŒ๐‘™ = ๐‘„(๐‘ = ๐‘ง๐‘™) For each value xij of each attribute Xi

Estimate ๐œ„๐‘—๐‘˜๐‘™ = ๐‘„(๐‘Œ๐‘— = ๐‘ฆ๐‘—๐‘˜๐‘™|๐‘ = ๐‘ง๐‘™)

  • Classify Xtest

เท  ๐‘ โ† argmax

๐‘ง๐‘™

๐‘„ ๐‘ = ๐‘ง๐‘™ ฮ ๐‘—๐‘„ ๐‘Œ๐‘—

test ๐‘ = ๐‘ง๐‘™)

เท  ๐‘ โ† argmax

๐‘ง๐‘™

๐œŒ๐‘™ ฮ ๐‘—๐œ„๐‘—๐‘˜๐‘™

Slide credit: Tom Mitchell

slide-8
SLIDE 8

Estimating parameters: discrete ๐‘, ๐‘Œ๐‘—

  • Maximum likelihood estimates (MLE)

เทœ ๐œŒ๐‘™ = เท  ๐‘„ ๐‘ = ๐‘ง๐‘™ = #๐ธ ๐‘ = ๐‘ง๐‘™ ๐ธ แˆ˜ ๐œ„๐‘—๐‘˜๐‘™ = เท  ๐‘„ ๐‘Œ๐‘— = ๐‘ฆ๐‘—๐‘˜ ๐‘ = ๐‘ง๐‘™ = #๐ธ ๐‘Œ๐‘— = ๐‘ฆ๐‘—๐‘˜ ^ ๐‘ = ๐‘ง๐‘™ #๐ธ{๐‘ = ๐‘ง๐‘™}

Slide credit: Tom Mitchell

slide-9
SLIDE 9
  • F = 1 iff you live in Fox Ridge
  • S = 1 iff you watched the superbowl last night
  • D = 1 iff you drive to VT
  • G = 1 iff you went to gym in the last month

๐‘„ ๐บ = 1 = ๐‘„ ๐‘‡ = 1|๐บ = 1 = ๐‘„ ๐‘‡ = 1|๐บ = 0 = ๐‘„ ๐ธ = 1|๐บ = 1 = ๐‘„ ๐ธ = 1|๐บ = 0 = ๐‘„ ๐ป = 1|๐บ = 1 = ๐‘„ ๐ป = 1|๐บ = 0 = ๐‘„ ๐บ = 0 = ๐‘„ ๐‘‡ = 0|๐บ = 1 = ๐‘„ ๐‘‡ = 0|๐บ = 0 = ๐‘„ ๐ธ = 0|๐บ = 1 = ๐‘„ ๐ธ = 0|๐บ = 0 = ๐‘„ ๐ป = 0|๐บ = 1 = ๐‘„ ๐ป = 0|๐บ = 0 = ๐‘„ ๐บ|๐‘‡, ๐ธ, ๐ป = ๐‘„ ๐บ P S F P D F P(G|F)

slide-10
SLIDE 10

Naรฏve Bayes: Subtlety #1

  • Often the ๐‘Œ๐‘— are not really conditionally independent
  • Naรฏve Bayes often works pretty well anyway
  • Often the right classification, even when not the right probability

[Domingos & Pazzani, 1996])

  • What is the effect on estimated P(Y|X)?
  • What if we have two copies: ๐‘Œ๐‘— = ๐‘Œ๐‘™

๐‘„ ๐‘ = ๐‘ง๐‘™ ๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ) โˆ ๐‘„ ๐‘ = ๐‘ง๐‘™ ฮ ๐‘—๐‘„ ๐‘Œ๐‘— ๐‘ = ๐‘ง๐‘™)

Slide credit: Tom Mitchell

slide-11
SLIDE 11

Naรฏve Bayes: Subtlety #2

MLE estimate for ๐‘„ ๐‘Œ๐‘— ๐‘ = ๐‘ง๐‘™) might be zero. (for example, ๐‘Œ๐‘— = birthdate. ๐‘Œ๐‘— = Feb_4_1995)

  • Why worry about just one parameter out of many?

๐‘„ ๐‘ = ๐‘ง๐‘™ ๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ) โˆ ๐‘„ ๐‘ = ๐‘ง๐‘™ ฮ ๐‘—๐‘„ ๐‘Œ๐‘— ๐‘ = ๐‘ง๐‘™)

  • What can we do to address this?
  • MAP estimates (adding โ€œimaginaryโ€ examples)

Slide credit: Tom Mitchell

slide-12
SLIDE 12

Estimating parameters: discrete ๐‘, ๐‘Œ๐‘—

  • Maximum likelihood estimates (MLE)

เทœ ๐œŒ๐‘™ = เท  ๐‘„ ๐‘ = ๐‘ง๐‘™ = #๐ธ ๐‘ = ๐‘ง๐‘™ ๐ธ แˆ˜ ๐œ„๐‘—๐‘˜๐‘™ = เท  ๐‘„ ๐‘Œ๐‘— = ๐‘ฆ๐‘—๐‘˜ ๐‘ = ๐‘ง๐‘™ = #๐ธ ๐‘Œ๐‘— = ๐‘ฆ๐‘—๐‘˜, ๐‘ = ๐‘ง๐‘™ #๐ธ{๐‘ = ๐‘ง๐‘™}

  • MAP estimates (Dirichlet priors):

เทœ ๐œŒ๐‘™ = เท  ๐‘„ ๐‘ = ๐‘ง๐‘™ = #๐ธ ๐‘ = ๐‘ง๐‘™ + (๐›พ๐‘™โˆ’1) ๐ธ + ฯƒ๐‘›(๐›พ๐‘›โˆ’1) แˆ˜ ๐œ„๐‘—๐‘˜๐‘™ = เท  ๐‘„ ๐‘Œ๐‘— = ๐‘ฆ๐‘—๐‘˜ ๐‘ = ๐‘ง๐‘™ = #๐ธ ๐‘Œ๐‘— = ๐‘ฆ๐‘—๐‘˜, ๐‘ = ๐‘ง๐‘™ + (๐›พ๐‘™ โˆ’1) #๐ธ{๐‘ = ๐‘ง๐‘™} + ฯƒ๐‘›(๐›พ๐‘›โˆ’1)

Slide credit: Tom Mitchell

slide-13
SLIDE 13

What if we have continuous Xi

  • Gaussian Naรฏve Bayes (GNB): assume

๐‘„ ๐‘Œ๐‘— = ๐‘ฆ ๐‘ = ๐‘ง๐‘™ = 1 2๐œŒ๐œ๐‘—๐‘™ exp(โˆ’ ๐‘ฆ โˆ’ ๐œˆ๐‘—๐‘™ 2๐œ๐‘—๐‘™

2 2

)

  • Additional assumption on ๐œ๐‘—๐‘™:
  • Is independent of ๐‘ (๐œ๐‘—)
  • Is independent of ๐‘Œ๐‘— (๐œ๐‘™)
  • Is independent of ๐‘Œi and ๐‘ (๐œ)

Slide credit: Tom Mitchell

slide-14
SLIDE 14

Naรฏve Bayes algorithm โ€“ continuous Xi

  • For each value yk

Estimate ๐œŒ๐‘™ = ๐‘„(๐‘ = ๐‘ง๐‘™) For each attribute Xi estimate Class conditional mean ๐œˆ๐‘—๐‘™, variance ๐œ๐‘—๐‘™

  • Classify Xtest

เท  ๐‘ โ† argmax

๐‘ง๐‘™

๐‘„ ๐‘ = ๐‘ง๐‘™ ฮ ๐‘—๐‘„ ๐‘Œ๐‘—

test ๐‘ = ๐‘ง๐‘™)

เท  ๐‘ โ† argmax

๐‘ง๐‘™

๐œŒ๐‘™ ฮ ๐‘— ๐‘‚๐‘๐‘ ๐‘›๐‘๐‘š(๐‘Œ๐‘—

test, ๐œˆ๐‘—๐‘™, ๐œ๐‘—๐‘™)

Slide credit: Tom Mitchell

slide-15
SLIDE 15

Things to remember

  • Probability basics
  • Conditional probability, joint probability, Bayes rule
  • Estimating parameters from data
  • Maximum likelihood (ML) maximize ๐‘„(Data|๐œ„)
  • Maximum a posteriori estimation (MAP) maximize ๐‘„(๐œ„|Data)
  • Naive Bayes

๐‘„ ๐‘ = ๐‘ง๐‘™ ๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ) โˆ ๐‘„ ๐‘ = ๐‘ง๐‘™ ฮ ๐‘—๐‘„ ๐‘Œ๐‘— ๐‘ = ๐‘ง๐‘™)

slide-16
SLIDE 16

Logistic Regression

  • Hypothesis representation
  • Cost function
  • Logistic regression with gradient descent
  • Regularization
  • Multi-class classification
slide-17
SLIDE 17

Logistic Regression

  • Hypothesis representation
  • Cost function
  • Logistic regression with gradient descent
  • Regularization
  • Multi-class classification
slide-18
SLIDE 18
  • Threshold classifier output โ„Ž๐œ„ ๐‘ฆ at 0.5
  • If โ„Ž๐œ„ ๐‘ฆ โ‰ฅ 0.5, predict โ€œ๐‘ง = 1โ€
  • If โ„Ž๐œ„ ๐‘ฆ < 0.5, predict โ€œ๐‘ง = 0โ€

Malignant? 0 (No) 1 (Yes) Tumor Size โ„Ž๐œ„ ๐‘ฆ = ๐œ„โŠค๐‘ฆ

Slide credit: Andrew Ng

slide-19
SLIDE 19

Classification: ๐‘ง = 1 or ๐‘ง = 0 โ„Ž๐œ„ ๐‘ฆ = ๐œ„โŠค๐‘ฆ (from linear regression) can be > 1 or < 0 Logistic regression: 0 โ‰ค โ„Ž๐œ„ ๐‘ฆ โ‰ค 1 Logistic regression is actually for classification

Slide credit: Andrew Ng

slide-20
SLIDE 20

Hypothesis representation

  • Want 0 โ‰ค โ„Ž๐œ„ ๐‘ฆ โ‰ค 1
  • โ„Ž๐œ„ ๐‘ฆ = ๐‘• ๐œ„โŠค๐‘ฆ ,

where ๐‘• ๐‘จ =

1 1+๐‘“โˆ’๐‘จ

  • Sigmoid function
  • Logistic function

โ„Ž๐œ„ ๐‘ฆ = 1 1 + ๐‘“โˆ’๐œ„โŠค๐‘ฆ

๐‘จ ๐‘•(๐‘จ)

Slide credit: Andrew Ng

slide-21
SLIDE 21

Interpretation of hypothesis output

  • โ„Ž๐œ„ ๐‘ฆ = estimated probability that ๐‘ง = 1 on input ๐‘ฆ
  • Example: If ๐‘ฆ = ๐‘ฆ0

x1 = 1 tumorSize

  • โ„Ž๐œ„ ๐‘ฆ = 0.7
  • Tell patient that 70% chance of tumor being malignant

Slide credit: Andrew Ng

slide-22
SLIDE 22

Logistic regression

โ„Ž๐œ„ ๐‘ฆ = ๐‘• ๐œ„โŠค๐‘ฆ ๐‘• ๐‘จ = 1 1 + ๐‘“โˆ’๐‘จ Suppose predict โ€œy = 1โ€ if โ„Ž๐œ„ ๐‘ฆ โ‰ฅ 0.5 ๐‘จ = ๐œ„โŠค๐‘ฆ โ‰ฅ 0 predict โ€œy = 0โ€ if โ„Ž๐œ„ ๐‘ฆ < 0.5 ๐‘จ = ๐œ„โŠค๐‘ฆ < 0

๐‘จ = ๐œ„โŠค๐‘ฆ ๐‘•(๐‘จ)

Slide credit: Andrew Ng

slide-23
SLIDE 23

Decision boundary

  • โ„Ž๐œ„ ๐‘ฆ = ๐‘•(๐œ„0 + ๐œ„1๐‘ฆ1 + ๐œ„2๐‘ฆ2)

E.g., ๐œ„0 = โˆ’3, ๐œ„1= 1, ๐œ„2 = 1

  • Predict โ€œ๐‘ง = 1โ€ if โˆ’3 + ๐‘ฆ1 + ๐‘ฆ2 โ‰ฅ 0

Tumor Size Age

Slide credit: Andrew Ng

slide-24
SLIDE 24
  • โ„Ž๐œ„ ๐‘ฆ = ๐‘•(๐œ„0 + ๐œ„1๐‘ฆ1 + ๐œ„2๐‘ฆ2

+ ๐œ„3๐‘ฆ1

2 + ๐œ„4๐‘ฆ2 2)

E.g., ๐œ„0 = โˆ’1, ๐œ„1= 0, ๐œ„2 = 0, ๐œ„3 = 1, ๐œ„4 = 1

  • Predict โ€œ๐‘ง = 1โ€ if โˆ’1 + ๐‘ฆ1

2 + ๐‘ฆ2 2 โ‰ฅ 0

  • โ„Ž๐œ„ ๐‘ฆ = ๐‘•(๐œ„0 + ๐œ„1๐‘ฆ1 + ๐œ„2๐‘ฆ2 + ๐œ„3๐‘ฆ1

2 +

๐œ„4๐‘ฆ1

2๐‘ฆ2 + ๐œ„5๐‘ฆ1 2๐‘ฆ2 2 + ๐œ„6๐‘ฆ1 3๐‘ฆ2 + โ‹ฏ )

Slide credit: Andrew Ng

slide-25
SLIDE 25

Where does the form come from?

  • Logistic regression hypothesis representation

โ„Ž๐œ„ ๐‘ฆ = 1 1 + ๐‘“โˆ’๐œ„โŠค๐‘ฆ = 1 1 + ๐‘“โˆ’(๐œ„0+๐œ„1๐‘ฆ1+๐œ„2๐‘ฆ2+โ‹ฏ+๐œ„๐‘œ๐‘ฆ๐‘œ)

  • Consider learning f: ๐‘Œ โ†’ ๐‘, where
  • ๐‘Œ is a vector of real-valued features ๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ โŠค
  • ๐‘ is Boolean
  • Assume all ๐‘Œ๐‘— are conditionally independent given ๐‘
  • Model ๐‘„ ๐‘Œ๐‘— ๐‘ = ๐‘ง๐‘™ as Gaussian ๐‘‚ ๐œˆ๐‘—๐‘™, ๐œ๐‘—
  • Model ๐‘„ ๐‘ as Bernoulli ๐œŒ

What is ๐‘„ ๐‘|๐‘Œ1, ๐‘Œ2, โ‹ฏ , ๐‘Œ๐‘œ ?

Slide credit: Tom Mitchell

slide-26
SLIDE 26
  • ๐‘„ ๐‘ = 1|๐‘Œ =

๐‘„ ๐‘=1 ๐‘„(๐‘Œ|๐‘=1) ๐‘„ ๐‘=1 ๐‘„(๐‘Œ|๐‘=1) +๐‘„(๐‘=0)๐‘„(๐‘Œ|๐‘=0)

=

1 1+๐‘„ ๐‘=0 ๐‘„(๐‘Œ|๐‘=0)

๐‘„ ๐‘=1 ๐‘„(๐‘Œ|๐‘=1)

=

1 1+exp(ln(๐‘„ ๐‘=0 ๐‘„(๐‘Œ|๐‘=0)

๐‘„ ๐‘=1 ๐‘„(๐‘Œ|๐‘=1) ))

=

1 1+exp(ln 1โˆ’๐œŒ

๐œŒ

+ฯƒ๐‘— ๐‘š๐‘œ

๐‘„(๐‘Œ๐‘—|๐‘=0) ๐‘„(๐‘Œ๐‘—|๐‘=1))

๐‘„ ๐‘ = 1|๐‘Œ1, ๐‘Œ2, โ‹ฏ , ๐‘Œ๐‘œ = 1 1 + exp(๐œ„0 + ฯƒ๐‘— ๐œ„๐‘— ๐‘Œ๐‘—)

เท

๐‘—

(๐œˆ๐‘—0 โˆ’ ๐œˆ๐‘—1 ๐œ๐‘—

2

๐‘Œ๐‘— + ๐œˆ๐‘—1

2 โˆ’ ๐œˆ๐‘—0 2

2๐œ๐‘—

2

)

Applying Bayes rule Divide by ๐‘„ ๐‘ = 1 ๐‘„(๐‘Œ|๐‘ = 1) Apply exp(ln(โ‹…)) Plug in ๐‘„(๐‘Œ๐‘—|๐‘)

๐‘„ ๐‘ฆ|๐‘ง๐‘™ = 1 2๐œŒ๐œ๐‘— ๐‘“

โˆ’ ๐‘ฆโˆ’๐œˆ๐‘—๐‘™ 2 2๐œ๐‘—

2 Slide credit: Tom Mitchell

slide-27
SLIDE 27
slide-28
SLIDE 28

Logistic Regression

  • Hypothesis representation
  • Cost function
  • Logistic regression with gradient descent
  • Regularization
  • Multi-class classification
slide-29
SLIDE 29

Training set with ๐‘› examples { ๐‘ฆ 1 , ๐‘ง 1 , ๐‘ฆ 2 , ๐‘ง 2 , โ‹ฏ , ๐‘ฆ ๐‘› , ๐‘ง ๐‘› ๐‘ฆ โˆˆ ๐‘ฆ0 ๐‘ฆ1 โ‹ฎ ๐‘ฆ๐‘œ ๐‘ฆ0 = 1, ๐‘ง โˆˆ {0, 1} โ„Ž๐œ„ ๐‘ฆ = 1 1 + ๐‘“โˆ’๐œ„โŠค๐‘ฆ

How to choose parameters ๐œ„?

Slide credit: Andrew Ng

slide-30
SLIDE 30

Cost function for Linear Regression

๐พ ๐œ„ = 1 2๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2 = 1

๐‘› เท

๐‘—=1 ๐‘›

Cost(โ„Ž๐œ„(๐‘ฆ ๐‘— ), ๐‘ง))

Cost(โ„Ž๐œ„ ๐‘ฆ , ๐‘ง) = 1 2 โ„Ž๐œ„ ๐‘ฆ โˆ’ ๐‘ง 2

Slide credit: Andrew Ng

slide-31
SLIDE 31

Cost function for Logistic Regression

Cost(โ„Ž๐œ„ ๐‘ฆ , ๐‘ง) = เตโˆ’log โ„Ž๐œ„ ๐‘ฆ if ๐‘ง = 1 โˆ’log 1 โˆ’ โ„Ž๐œ„ ๐‘ฆ if ๐‘ง = 0

1

if ๐‘ง = 1

โ„Ž๐œ„ ๐‘ฆ 1

if ๐‘ง = 0

โ„Ž๐œ„ ๐‘ฆ

Slide credit: Andrew Ng

slide-32
SLIDE 32

Logistic regression cost function

  • Cost(โ„Ž๐œ„ ๐‘ฆ , ๐‘ง) = เตโˆ’log โ„Ž๐œ„ ๐‘ฆ

if ๐‘ง = 1 โˆ’log 1 โˆ’ โ„Ž๐œ„ ๐‘ฆ if ๐‘ง = 0

  • Cost โ„Ž๐œ„ ๐‘ฆ , ๐‘ง = โˆ’๐‘ง log h๐œ„ x

โˆ’ (1 โˆ’ y) log 1 โˆ’ โ„Ž๐œ„ ๐‘ฆ

  • If ๐‘ง = 1: Cost โ„Ž๐œ„ ๐‘ฆ , ๐‘ง = โˆ’log โ„Ž๐œ„ ๐‘ฆ
  • If ๐‘ง = 0: Cost โ„Ž๐œ„ ๐‘ฆ , ๐‘ง = โˆ’log 1 โˆ’ โ„Ž๐œ„ ๐‘ฆ

Slide credit: Andrew Ng

slide-33
SLIDE 33

Logistic regression

๐พ ๐œ„ = 1 ๐‘› เท

๐‘—=1 ๐‘›

Cost(โ„Ž๐œ„(๐‘ฆ ๐‘— ), ๐‘ง(๐‘—))) = โˆ’

1 ๐‘› ฯƒ๐‘—=1 ๐‘› ๐‘ง(๐‘—) log โ„Ž๐œ„ ๐‘ฆ(๐‘—)

+ (1 โˆ’ ๐‘ง(๐‘—)) log 1 โˆ’ โ„Ž๐œ„ ๐‘ฆ(๐‘—)

Prediction: given new ๐‘ฆ Output โ„Ž๐œ„ ๐‘ฆ =

1 1+๐‘“โˆ’๐œ„โŠค๐‘ฆ

Learning: fit parameter ๐œ„ min

๐œ„ ๐พ(๐œ„)

Slide credit: Andrew Ng

slide-34
SLIDE 34

Where does the cost come from?

  • Training set with ๐‘› examples

๐‘ฆ 1 , ๐‘ง 1 , ๐‘ฆ 2 , ๐‘ง 2 , โ‹ฏ , ๐‘ฆ ๐‘› , ๐‘ง ๐‘›

  • Maximum likelihood estimate for parameter ๐œ„

๐œ„MLE = argmax

๐œ„

๐‘„๐œ„ ๐‘ฆ 1 , ๐‘ง 1 , ๐‘ฆ 2 , ๐‘ง 2 , โ‹ฏ , ๐‘ฆ ๐‘› , ๐‘ง ๐‘› = argmax

๐œ„

เท‘

๐‘—=1 ๐‘›

๐‘„๐œ„ ๐‘ฆ ๐‘— , ๐‘ง ๐‘—

  • Maximum conditional likelihood estimate for parameter ๐œ„

Slide credit: Tom Mitchell

slide-35
SLIDE 35
  • Goal: choose ๐œ„ to maximize conditional likelihood of training data
  • ๐‘„๐œ„ ๐‘ = 1 ๐‘Œ = ๐‘ฆ = โ„Ž๐œ„ ๐‘ฆ =

1 1+๐‘“โˆ’๐œ„โŠค๐‘ฆ

  • ๐‘„๐œ„ ๐‘ = 0 ๐‘Œ = ๐‘ฆ = 1 โˆ’ โ„Ž๐œ„ ๐‘ฆ =

๐‘“โˆ’๐œ„โŠค๐‘ฆ 1+๐‘“โˆ’๐œ„โŠค๐‘ฆ

  • Training data D =

๐‘ฆ 1 , ๐‘ง 1 , ๐‘ฆ 2 , ๐‘ง 2 , โ‹ฏ , ๐‘ฆ ๐‘› , ๐‘ง ๐‘›

  • Data likelihood = ฯ‚๐‘—=1

๐‘› ๐‘„๐œ„

๐‘ฆ ๐‘— , ๐‘ง ๐‘—

  • Data conditional likelihood = ฯ‚๐‘—=1

๐‘› ๐‘„๐œ„ ๐‘ง(๐‘—)|๐‘ฆ ๐‘—

๐œ„MCLE = argmax

๐œ„

ฯ‚๐‘—=1

๐‘› ๐‘„๐œ„ ๐‘ง(๐‘—)|๐‘ฆ ๐‘—

Slide credit: Tom Mitchell

slide-36
SLIDE 36

Expressing conditional log-likelihood

๐‘€ ๐œ„ = log เท‘

๐‘—=1 ๐‘›

๐‘„๐œ„ ๐‘ง(๐‘—)|๐‘ฆ ๐‘— = เท

๐‘—=1 ๐‘›

log ๐‘„๐œ„ ๐‘ง(๐‘—)|๐‘ฆ ๐‘— = เท

๐‘—=1 ๐‘›

๐‘ง(๐‘—) log ๐‘„๐œ„ ๐‘ง(๐‘—) = 1|๐‘ฆ ๐‘— + 1 โˆ’ ๐‘ง ๐‘— log ๐‘„๐œ„ ๐‘ง(๐‘—) = 0|๐‘ฆ ๐‘— = ฯƒ๐‘—=1

๐‘› ๐‘ง(๐‘—) log (โ„Ž๐œ„(๐‘ฆ(๐‘—))) + 1 โˆ’ ๐‘ง ๐‘—

log(1 โˆ’ โ„Ž๐œ„(๐‘ฆ(๐‘—)))

Cost(โ„Ž๐œ„ ๐‘ฆ , ๐‘ง) = เตโˆ’log โ„Ž๐œ„ ๐‘ฆ if ๐‘ง = 1 โˆ’log 1 โˆ’ โ„Ž๐œ„ ๐‘ฆ if ๐‘ง = 0

slide-37
SLIDE 37

Logistic Regression

  • Hypothesis representation
  • Cost function
  • Logistic regression with gradient descent
  • Regularization
  • Multi-class classification
slide-38
SLIDE 38

Gradient descent

๐พ ๐œ„ = โˆ’ 1 ๐‘› เท

๐‘—=1 ๐‘›

๐‘ง(๐‘—) log โ„Ž๐œ„ ๐‘ฆ(๐‘—) + (1 โˆ’ ๐‘ง(๐‘—)) log 1 โˆ’ โ„Ž๐œ„ ๐‘ฆ(๐‘—)

Goal: min

๐œ„ ๐พ(๐œ„)

Repeat { ๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜ โˆ’ ๐›ฝ ๐œ–

๐œ–๐œ„

๐‘˜

๐พ(๐œ„) }

(Simultaneously update all ๐œ„

๐‘˜)

๐œ– ๐œ–๐œ„

๐‘˜

๐พ ๐œ„ = 1 ๐‘› เท

๐‘—=1 ๐‘›

(โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง(๐‘—)) ๐‘ฆ๐‘˜

(๐‘—)

Good news: Convex function! Bad news: No analytical solution

Slide credit: Andrew Ng

slide-39
SLIDE 39

Gradient descent

๐พ ๐œ„ = โˆ’ 1 ๐‘› เท

๐‘—=1 ๐‘›

๐‘ง(๐‘—) log โ„Ž๐œ„ ๐‘ฆ(๐‘—) + (1 โˆ’ ๐‘ง(๐‘—)) log 1 โˆ’ โ„Ž๐œ„ ๐‘ฆ(๐‘—)

Goal: min

๐œ„ ๐พ(๐œ„)

Repeat { ๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜ โˆ’ ๐›ฝ 1

๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง(๐‘—) ๐‘ฆ๐‘˜

(๐‘—)

}

(Simultaneously update all ๐œ„

๐‘˜)

Slide credit: Andrew Ng

slide-40
SLIDE 40

Gradient descent for Linear Regression

Repeat { ๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜ โˆ’ ๐›ฝ 1

๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง(๐‘—) ๐‘ฆ๐‘˜

(๐‘—)

}

Gradient descent for Logistic Regression

Repeat { ๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜ โˆ’ ๐›ฝ 1

๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง(๐‘—) ๐‘ฆ๐‘˜

(๐‘—)

}

โ„Ž๐œ„ ๐‘ฆ = ๐œ„โŠค๐‘ฆ โ„Ž๐œ„ ๐‘ฆ = 1 1 + ๐‘“โˆ’๐œ„โŠค๐‘ฆ

Slide credit: Andrew Ng

slide-41
SLIDE 41

Logistic Regression

  • Hypothesis representation
  • Cost function
  • Logistic regression with gradient descent
  • Regularization
  • Multi-class classification
slide-42
SLIDE 42

How about MAP?

  • Maximum conditional likelihood estimate (MCLE)
  • Maximum conditional a posterior estimate (MCAP)

๐œ„MCLE = argmax

๐œ„

ฯ‚๐‘—=1

๐‘› ๐‘„๐œ„ ๐‘ง(๐‘—)|๐‘ฆ ๐‘—

๐œ„MCAP = argmax

๐œ„

ฯ‚๐‘—=1

๐‘› ๐‘„๐œ„ ๐‘ง(๐‘—)|๐‘ฆ ๐‘—

๐‘„(๐œ„)

slide-43
SLIDE 43

Prior ๐‘„(๐œ„)

  • Common choice of ๐‘„(๐œ„):
  • Normal distribution, zero mean, identity covariance
  • โ€œPushesโ€ parameters towards zeros
  • Corresponds to Regularization
  • Helps avoid very large weights and overfitting

Slide credit: Tom Mitchell

slide-44
SLIDE 44

MLE vs. MAP

  • Maximum conditional likelihood estimate (MCLE)

๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜ โˆ’ ๐›ฝ 1

๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง(๐‘—) ๐‘ฆ๐‘˜

(๐‘—)

  • Maximum conditional a posterior estimate (MCAP)

๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜ โˆ’ ๐›ฝ๐œ‡๐œ„ ๐‘˜ โˆ’ ๐›ฝ 1

๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง(๐‘—) ๐‘ฆ๐‘˜

(๐‘—)

slide-45
SLIDE 45

Logistic Regression

  • Hypothesis representation
  • Cost function
  • Logistic regression with gradient descent
  • Regularization
  • Multi-class classification
slide-46
SLIDE 46

Multi-class classification

  • Email foldering/taggning: Work, Friends, Family, Hobby
  • Medical diagrams: Not ill, Cold, Flu
  • Weather: Sunny, Cloudy, Rain, Snow

Slide credit: Andrew Ng

slide-47
SLIDE 47

Binary classification

๐‘ฆ2 ๐‘ฆ1

Multiclass classification

๐‘ฆ2 ๐‘ฆ1

slide-48
SLIDE 48

One-vs-all (one-vs-rest)

๐‘ฆ2 ๐‘ฆ1

Class 1: Class 2: Class 3:

โ„Ž๐œ„

๐‘— ๐‘ฆ = ๐‘„ ๐‘ง = ๐‘— ๐‘ฆ; ๐œ„

(๐‘— = 1, 2, 3) ๐‘ฆ2 ๐‘ฆ1 ๐‘ฆ2 ๐‘ฆ1 ๐‘ฆ2 ๐‘ฆ1

โ„Ž๐œ„

1 ๐‘ฆ

โ„Ž๐œ„

2 ๐‘ฆ

โ„Ž๐œ„

3 ๐‘ฆ

Slide credit: Andrew Ng

slide-49
SLIDE 49

One-vs-all

  • Train a logistic regression classifier โ„Ž๐œ„

๐‘— ๐‘ฆ for

each class ๐‘— to predict the probability that ๐‘ง = ๐‘—

  • Given a new input ๐‘ฆ, pick the class ๐‘— that

maximizes max

i

โ„Ž๐œ„

๐‘— ๐‘ฆ

Slide credit: Andrew Ng

slide-50
SLIDE 50

Generative Approach

Ex: Naรฏve Bayes

Estimate ๐‘„(๐‘) and ๐‘„(๐‘Œ|๐‘) Prediction

เทœ ๐‘ง = argmax๐‘ง ๐‘„ ๐‘ = ๐‘ง ๐‘„(๐‘Œ = ๐‘ฆ|๐‘ = ๐‘ง)

Discriminative Approach

Ex: Logistic regression

Estimate ๐‘„(๐‘|๐‘Œ) directly (Or a discriminant function: e.g., SVM) Prediction

เทœ ๐‘ง = ๐‘„(๐‘ = ๐‘ง|๐‘Œ = ๐‘ฆ)

slide-51
SLIDE 51

Further readings

  • Tom M. Mitchell

Generative and discriminative classifiers: Naรฏve Bayes and Logistic Regression http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf

  • Andrew Ng, Michael Jordan

On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes http://papers.nips.cc/paper/2020-on-discriminative-vs-generative- classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf

slide-52
SLIDE 52

Things to remember

  • Hypothesis representation
  • Cost function
  • Logistic regression with gradient descent
  • Regularization
  • Multi-class classification

โ„Ž๐œ„ ๐‘ฆ = 1 1 + ๐‘“โˆ’๐œ„โŠค๐‘ฆ Cost(โ„Ž๐œ„ ๐‘ฆ , ๐‘ง) = เตโˆ’log โ„Ž๐œ„ ๐‘ฆ if ๐‘ง = 1 โˆ’log 1 โˆ’ โ„Ž๐œ„ ๐‘ฆ if ๐‘ง = 0 ๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜ โˆ’ ๐›ฝ 1

๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง(๐‘—) ๐‘ฆ๐‘˜

(๐‘—)

๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜ โˆ’ ๐›ฝ๐œ‡๐œ„ ๐‘˜ โˆ’ ๐›ฝ 1

๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง(๐‘—) ๐‘ฆ๐‘˜

(๐‘—)

max

i

โ„Ž๐œ„

๐‘— ๐‘ฆ

slide-53
SLIDE 53

Coming upโ€ฆ

  • Regularization
  • Support Vector Machine