logistic regression
play

Logistic Regression Jia-Bin Huang Virginia Tech Spring 2019 - PowerPoint PPT Presentation

Logistic Regression Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative Please start HW 1 early! Questions are welcome! Two principles for estimating parameters Maximum Likelihood Estimate (MLE) Choose


  1. Logistic Regression Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824

  2. Administrative • Please start HW 1 early! • Questions are welcome!

  3. Two principles for estimating parameters • Maximum Likelihood Estimate (MLE) Choose 𝜄 that maximizes probability of observed data 𝜾 MLE = argmax ෡ 𝑄(𝐸𝑏𝑢𝑏|𝜄) 𝜄 • Maximum a posteriori estimation (MAP) Choose 𝜄 that is most probable given prior probability and data 𝑄 𝐸𝑏𝑢𝑏 𝜄 𝑄 𝜄 𝜾 MAP = argmax ෡ 𝑄 𝜄 𝐸 = argmax 𝑄(𝐸𝑏𝑢𝑏) 𝜄 𝜄 Slide credit: Tom Mitchell

  4. Naïve Bayes classifier • Want to learn 𝑄 𝑍 𝑌 1 , ⋯ , 𝑌 𝑜 ) • But require 𝟑 𝒐 parameters... • How about applying Bayes rule? 𝑄(𝑌 1 ,⋯,𝑌 𝑜 𝑍 𝑄 𝑍 • 𝑄 𝑍 𝑌 1 , ⋯ , 𝑌 𝑜 ) = ∝ 𝑄(𝑌 1 , ⋯ , 𝑌 𝑜 𝑍 𝑄 𝑍 𝑄(𝑌 1 ,⋯,𝑌 𝑜 ) • 𝑄(𝑌 1 , ⋯ , 𝑌 𝑜 𝑍 : Need (𝟑 𝒐 −𝟐) × 𝟑 parameters • 𝑄(𝑍) : Need 1 parameter • Apply conditional independence assumption 𝑜 • 𝑄 𝑌 1 , ⋯ , 𝑌 𝑜 𝑍 = ς 𝑘=1 𝑄(𝑌 𝑘 |𝑍) : Need 𝐨 × 𝟑 parameters

  5. Naïve Bayes classifier • Bayes rule: 𝑄(𝑍 = 𝑧 𝑙 )𝑄(𝑌 1 , ⋯ , 𝑌 𝑜 𝑍 = 𝑧 𝑙 𝑄 𝑍 = 𝑧 𝑙 𝑌 1 , ⋯ , 𝑌 𝑜 ) = σ 𝑘 𝑄 𝑍 = 𝑧 𝑘 𝑄 𝑌 1 , ⋯ , 𝑌 𝑜 𝑍 = 𝑧 𝑘 • Assume conditional independence among 𝑌 𝑗 ’s: 𝑄 𝑍 = 𝑧 𝑙 𝑌 1 , ⋯ , 𝑌 𝑜 ) = 𝑄 𝑍 = 𝑧 𝑙 Π 𝑗 𝑄 𝑌 𝑗 𝑍 = 𝑧 𝑙 ) σ 𝑘 𝑄 𝑍 = 𝑧 𝑘 Π 𝑗 𝑄 𝑌 𝑗 𝑍 = 𝑧 𝑘 ) • Pick the most probable Y ෠ 𝑍 ← argmax 𝑄 𝑍 = 𝑧 𝑙 Π 𝑗 𝑄 𝑌 𝑗 𝑍 = 𝑧 𝑙 ) 𝑧 𝑙 Slide credit: Tom Mitchell

  6. Example • 𝑄 𝑍 𝑌 1 , 𝑌 2 ∝ 𝑄 𝑍 𝑄 𝑌 1 , 𝑌 2 𝑍 = 𝑄 𝑍 𝑄 𝑌 1 𝑍 𝑄(𝑌 2 𝑍 Bayes rule Conditional indep. • Estimating parameters 𝑄 𝑍 = 1 = 0.4 𝑄 𝑍 = 0 = 0.6 𝑄 𝑌 1 = 1|𝑍 = 1 = 0.2 𝑄 𝑌 1 = 0|𝑍 = 1 = 0.8 𝑄 𝑌 1 = 1|𝑍 = 0 = 0.7 𝑄 𝑌 1 = 0|𝑍 = 0 = 0.3 𝑄 𝑌 2 = 1|𝑍 = 1 = 0.3 𝑄 𝑌 2 = 0|𝑍 = 1 = 0.7 𝑄 𝑌 2 = 1|𝑍 = 0 = 0.9 𝑄 𝑌 2 = 0|𝑍 = 0 = 0.1 • Test example: 𝑌 1 = 1, 𝑌 2 = 0 • 𝑍 = 1 : 𝑄 𝑍 = 1 𝑄 𝑌 1 = 1|𝑍 = 1 𝑄 𝑌 2 = 0|𝑍 = 1 = 0.4 × 0.2 × 0.7 = 0.056 • 𝑍 = 0 : 𝑄 𝑍 = 0 𝑄 𝑌 1 = 1|𝑍 = 0 𝑄 𝑌 2 = 0|𝑍 = 0 = 0.6 × 0.7 × 0.1 = 0.042

  7. Naïve Bayes algorithm – discrete X i • For each value y k Estimate 𝜌 𝑙 = 𝑄(𝑍 = 𝑧 𝑙 ) For each value x ij of each attribute X i Estimate 𝜄 𝑗𝑘𝑙 = 𝑄(𝑌 𝑗 = 𝑦 𝑗𝑘𝑙 |𝑍 = 𝑧 𝑙 ) • Classify X test test 𝑍 = 𝑧 𝑙 ) ෠ 𝑍 ← argmax 𝑄 𝑍 = 𝑧 𝑙 Π 𝑗 𝑄 𝑌 𝑗 𝑧 𝑙 ෠ 𝑍 ← argmax 𝜌 𝑙 Π 𝑗 𝜄 𝑗𝑘𝑙 𝑧 𝑙 Slide credit: Tom Mitchell

  8. Estimating parameters: discrete 𝑍, 𝑌 𝑗 • Maximum likelihood estimates (MLE) 𝑄 𝑍 = 𝑧 𝑙 = #𝐸 𝑍 = 𝑧 𝑙 𝜌 𝑙 = ෠ ො 𝐸 𝑄 𝑌 𝑗 = 𝑦 𝑗𝑘 𝑍 = 𝑧 𝑙 = #𝐸 𝑌 𝑗 = 𝑦 𝑗𝑘 ^ 𝑍 = 𝑧 𝑙 መ 𝜄 𝑗𝑘𝑙 = ෠ #𝐸{𝑍 = 𝑧 𝑙 } Slide credit: Tom Mitchell

  9. • F = 1 iff you live in Fox Ridge • S = 1 iff you watched the superbowl last night • D = 1 iff you drive to VT • G = 1 iff you went to gym in the last month 𝑄 𝐺 = 1 = 𝑄 𝐺 = 0 = 𝑄 𝑇 = 1|𝐺 = 1 = 𝑄 𝑇 = 0|𝐺 = 1 = 𝑄 𝑇 = 1|𝐺 = 0 = 𝑄 𝑇 = 0|𝐺 = 0 = 𝑄 𝐸 = 1|𝐺 = 1 = 𝑄 𝐸 = 0|𝐺 = 1 = 𝑄 𝐸 = 1|𝐺 = 0 = 𝑄 𝐸 = 0|𝐺 = 0 = 𝑄 𝐻 = 1|𝐺 = 1 = 𝑄 𝐻 = 0|𝐺 = 1 = 𝑄 𝐻 = 1|𝐺 = 0 = 𝑄 𝐻 = 0|𝐺 = 0 = 𝑄 𝐺|𝑇, 𝐸, 𝐻 = 𝑄 𝐺 P S F P D F P(G|F)

  10. Naïve Bayes: Subtlety #1 • Often the 𝑌 𝑗 are not really conditionally independent • Naïve Bayes often works pretty well anyway • Often the right classification, even when not the right probability [Domingos & Pazzani, 1996]) • What is the effect on estimated P(Y|X) ? • What if we have two copies: 𝑌 𝑗 = 𝑌 𝑙 𝑄 𝑍 = 𝑧 𝑙 𝑌 1 , ⋯ , 𝑌 𝑜 ) ∝ 𝑄 𝑍 = 𝑧 𝑙 Π 𝑗 𝑄 𝑌 𝑗 𝑍 = 𝑧 𝑙 ) Slide credit: Tom Mitchell

  11. Naïve Bayes: Subtlety #2 MLE estimate for 𝑄 𝑌 𝑗 𝑍 = 𝑧 𝑙 ) might be zero. (for example, 𝑌 𝑗 = birthdate. 𝑌 𝑗 = Feb_4_1995) • Why worry about just one parameter out of many? 𝑄 𝑍 = 𝑧 𝑙 𝑌 1 , ⋯ , 𝑌 𝑜 ) ∝ 𝑄 𝑍 = 𝑧 𝑙 Π 𝑗 𝑄 𝑌 𝑗 𝑍 = 𝑧 𝑙 ) • What can we do to address this? • M AP estimates (adding “imaginary” examples) Slide credit: Tom Mitchell

  12. Estimating parameters: discrete 𝑍, 𝑌 𝑗 • Maximum likelihood estimates (MLE) 𝑄 𝑍 = 𝑧 𝑙 = #𝐸 𝑍 = 𝑧 𝑙 𝜌 𝑙 = ෠ ො 𝐸 𝑄 𝑌 𝑗 = 𝑦 𝑗𝑘 𝑍 = 𝑧 𝑙 = #𝐸 𝑌 𝑗 = 𝑦 𝑗𝑘 , 𝑍 = 𝑧 𝑙 𝜄 𝑗𝑘𝑙 = ෠ መ #𝐸{𝑍 = 𝑧 𝑙 } • MAP estimates (Dirichlet priors): 𝑄 𝑍 = 𝑧 𝑙 = #𝐸 𝑍 = 𝑧 𝑙 + (𝛾 𝑙 −1) 𝜌 𝑙 = ෠ ො 𝐸 + σ 𝑛 (𝛾 𝑛 −1) 𝑄 𝑌 𝑗 = 𝑦 𝑗𝑘 𝑍 = 𝑧 𝑙 = #𝐸 𝑌 𝑗 = 𝑦 𝑗𝑘 , 𝑍 = 𝑧 𝑙 + (𝛾 𝑙 −1) 𝜄 𝑗𝑘𝑙 = ෠ መ #𝐸{𝑍 = 𝑧 𝑙 } + σ 𝑛 (𝛾 𝑛 −1) Slide credit: Tom Mitchell

  13. What if we have continuous X i • Gaussian Naïve Bayes (GNB): assume 2 1 exp(− 𝑦 − 𝜈 𝑗𝑙 𝑄 𝑌 𝑗 = 𝑦 𝑍 = 𝑧 𝑙 = ) 2 2𝜏 𝑗𝑙 2𝜌𝜏 𝑗𝑙 • Additional assumption on 𝜏 𝑗𝑙 : • Is independent of 𝑍 ( 𝜏 𝑗 ) • Is independent of 𝑌 𝑗 ( 𝜏 𝑙 ) • Is independent of 𝑌 i and 𝑍 ( 𝜏 ) Slide credit: Tom Mitchell

  14. Naïve Bayes algorithm – continuous X i • For each value y k Estimate 𝜌 𝑙 = 𝑄(𝑍 = 𝑧 𝑙 ) For each attribute X i estimate Class conditional mean 𝜈 𝑗𝑙 , variance 𝜏 𝑗𝑙 • Classify X test test 𝑍 = 𝑧 𝑙 ) ෠ 𝑍 ← argmax 𝑄 𝑍 = 𝑧 𝑙 Π 𝑗 𝑄 𝑌 𝑗 𝑧 𝑙 ෠ test , 𝜈 𝑗𝑙 , 𝜏 𝑗𝑙 ) 𝑍 ← argmax 𝜌 𝑙 Π 𝑗 𝑂𝑝𝑠𝑛𝑏𝑚(𝑌 𝑗 𝑧 𝑙 Slide credit: Tom Mitchell

  15. Things to remember • Probability basics • Conditional probability, joint probability, Bayes rule • Estimating parameters from data • Maximum likelihood (ML) maximize 𝑄(Data|𝜄) • Maximum a posteriori estimation (MAP) maximize 𝑄(𝜄|Data) • Naive Bayes 𝑄 𝑍 = 𝑧 𝑙 𝑌 1 , ⋯ , 𝑌 𝑜 ) ∝ 𝑄 𝑍 = 𝑧 𝑙 Π 𝑗 𝑄 𝑌 𝑗 𝑍 = 𝑧 𝑙 )

  16. Logistic Regression • Hypothesis representation • Cost function • Logistic regression with gradient descent • Regularization • Multi-class classification

  17. Logistic Regression • Hypothesis representation • Cost function • Logistic regression with gradient descent • Regularization • Multi-class classification

  18. 1 (Yes) Malignant? 0 (No) Tumor Size ℎ 𝜄 𝑦 = 𝜄 ⊤ 𝑦 • Threshold classifier output ℎ 𝜄 𝑦 at 0.5 • If ℎ 𝜄 𝑦 ≥ 0.5, predict “ 𝑧 = 1 ” • If ℎ 𝜄 𝑦 < 0.5 , predict “ 𝑧 = 0 ” Slide credit: Andrew Ng

  19. Classification: 𝑧 = 1 or 𝑧 = 0 ℎ 𝜄 𝑦 = 𝜄 ⊤ 𝑦 (from linear regression) can be > 1 or < 0 Logistic regression: 0 ≤ ℎ 𝜄 𝑦 ≤ 1 Logistic regression is actually for classification Slide credit: Andrew Ng

  20. Hypothesis representation • Want 0 ≤ ℎ 𝜄 𝑦 ≤ 1 1 ℎ 𝜄 𝑦 = 1 + 𝑓 −𝜄 ⊤ 𝑦 • ℎ 𝜄 𝑦 = 𝑕 𝜄 ⊤ 𝑦 , 1 where 𝑕 𝑨 = 1+𝑓 −𝑨 𝑕(𝑨) • Sigmoid function • Logistic function 𝑨 Slide credit: Andrew Ng

  21. Interpretation of hypothesis output • ℎ 𝜄 𝑦 = estimated probability that 𝑧 = 1 on input 𝑦 • Example: If 𝑦 = 𝑦 0 1 x 1 = tumorSize • ℎ 𝜄 𝑦 = 0.7 • Tell patient that 70% chance of tumor being malignant Slide credit: Andrew Ng

  22. Logistic regression 𝑕(𝑨) ℎ 𝜄 𝑦 = 𝑕 𝜄 ⊤ 𝑦 1 𝑕 𝑨 = 1 + 𝑓 −𝑨 𝑨 = 𝜄 ⊤ 𝑦 Suppose predict “y = 1” if ℎ 𝜄 𝑦 ≥ 0.5 𝑨 = 𝜄 ⊤ 𝑦 ≥ 0 predict “y = 0” if ℎ 𝜄 𝑦 < 0.5 𝑨 = 𝜄 ⊤ 𝑦 < 0 Slide credit: Andrew Ng

  23. Decision boundary • ℎ 𝜄 𝑦 = 𝑕(𝜄 0 + 𝜄 1 𝑦 1 + 𝜄 2 𝑦 2 ) Age E.g., 𝜄 0 = −3, 𝜄 1 = 1, 𝜄 2 = 1 Tumor Size • Predict “ 𝑧 = 1 ” if −3 + 𝑦 1 + 𝑦 2 ≥ 0 Slide credit: Andrew Ng

  24. • ℎ 𝜄 𝑦 = 𝑕(𝜄 0 + 𝜄 1 𝑦 1 + 𝜄 2 𝑦 2 2 + 𝜄 4 𝑦 2 2 ) + 𝜄 3 𝑦 1 E.g., 𝜄 0 = −1, 𝜄 1 = 0, 𝜄 2 = 0, 𝜄 3 = 1, 𝜄 4 = 1 2 ≥ 0 2 + 𝑦 2 • Predict “ 𝑧 = 1 ” if −1 + 𝑦 1 2 + • ℎ 𝜄 𝑦 = 𝑕(𝜄 0 + 𝜄 1 𝑦 1 + 𝜄 2 𝑦 2 + 𝜄 3 𝑦 1 2 + 𝜄 6 𝑦 1 2 𝑦 2 + 𝜄 5 𝑦 1 2 𝑦 2 3 𝑦 2 + ⋯ ) 𝜄 4 𝑦 1 Slide credit: Andrew Ng

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend