learning outline
play

LEARNING Outline Math Behind Logistic Regression Visualizing - PowerPoint PPT Presentation

Logistic Regression CSCI 447/547 MACHINE LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss Function Minimizing Log Likelihood Function Batch/Full Logistic Regression Gradient Descent for


  1. Logistic Regression CSCI 447/547 MACHINE LEARNING

  2. Outline  Math Behind Logistic Regression  Visualizing Logistic Regression  Loss Function  Minimizing Log Likelihood Function  Batch/Full Logistic Regression  Gradient Descent for OLS  Gradient Descent for Logistic Regression  Comparing OLS and Logistic Regression  Multi-Class Logistic Regression Using Softmax

  3. OLS Recap  Linear regression  Predicts continuous and potentially unbounded labels based on given features ˆ  Y = XB  B are the coefficients, X is the data matrix  The Issue:  Unbounded output means we cannot use it with discrete classification

  4. Logistic Regression  Classification  Predicts discrete labels based on given features  The Setup  y = 0 or 1 with probabilities 1-p and p  Predict a probability instead of a value  Estimate P(y=1|X)

  5. Definitions  Link Function  Relates mean of distribution to output of linear model  Convert unbounded to bounded predictions  Convert continuous output to discrete interpretation  Typically the link function is exponentially distributed  Logistic Function  Input is ∞ to - ∞ and output is 0 to 1  f(- ∞, ∞) -> (0, 1)  Typically sigmoid function  f(0) = 0.5, f(- ∞) = 0 , f(∞) = 1  This value is a probability

  6. Note on Choice of Link Function  Properties:  Bounded [0,1]  Domain (- ∞, ∞)  Differentiable everywhere  Used in optimization  Increasing function  We do not lose the property of coefficient effect (positive and negative) in moving from linear to logistic

  7. Visualizing Logistic Regression

  8. Logistic Regression Loss  Our Goal  The Full Log-Likelihood  A Note on Minimization

  9. Our Goal  Find P(y=1 | X)  Probability to be estimated 1  Logistic Function: Φ 𝑦 = 1+𝑓 −𝑦  OLS Function Y = XB becomes Y = Φ (XB) 𝜚 1 …  Φ = 𝜚 𝑂  y i = 𝜚 (x i . B)  MLE – Maximum Likelihood Estimation 𝑂  L = 𝑄(𝑍 = 𝑧 𝑗 |𝑦 𝑗 ) 𝑗=1  Independent, so joint probability is the product of each observation

  10. Our Goal 𝑂  L = 𝑄(𝑍 = 𝑧𝑗|𝑦𝑗) 𝑗=1 𝑄 𝑧𝑗 1 − 𝑄 1 − 𝑧𝑗 ) 𝑂 =  𝑗=1  Log-Likelihood, log(L) 𝑂 log(L) = [𝑧 𝑗 log 𝑞 + 1 − 𝑧 𝑗 log(1 − 𝑞)  𝑗=1 𝑂  = [𝑧 𝑗 log 𝜚(𝑦 𝑗 . 𝐶) + 1 − 𝑧 𝑗 log(1 − 𝜚 𝑦 𝑗 . 𝐶 ) 𝑗=1  Since we want to minimize, take the negative log likelihood  -log(L) – will minimize this

  11. The Full Log-Likelihood  Minimize this in logistic regression  So far, similar to OLS  But this does not have an explicit solution so we need to minimize this numerically

  12. Logistic Regression Loss

  13. Gradient Descent

  14. Batch/Full Gradient Descent  The Gradient  Algorithm  1. Chose x randomly  2. Compute gradient of f at x: 𝛼 f(x)  3. Step in the direction of the negative of the gradient  x <- x – η ( 𝛼 f(x))  η = step size (too large, can overshoot, too small, takes too long)  Repeat steps 2 and 3 until convergence  Difference in iterations is not decreasing or have reached a certain number of iterations

  15. Gradient Descent for OLS  Numerical Minimization of OLS  Mean Log-Likelihood  L = -1/N ||y – XB|| 2 2  The Algorithm for OLS

  16. Gradient Descent for Logistic Regression  Numerical Minimization of Logistic Loss  L(B) = - ∑ i [y i log( Φ (x i . B)) + (1-y)log(1- Φ (x i . B))]  Two Components 1  Recall ∅ 𝑦 = 1+𝑓 −𝑦  Because this is symmetric about 1/2:  ∅ 𝑦 + ∅ −𝑦 = 1

  17. Gradient Descent for Logistic Regression  First Term 𝜖 1 𝜖 ))] ) 𝜖𝐶 𝑘0 [log (Φ(𝑦 ∙ 𝐶 𝜖𝐶 𝑘0 𝜚(𝑦 𝑗 ∙ 𝐶  = ∙𝐶) 𝜚(𝑦 𝑗 1+𝑓 − 𝑦 𝑗 .𝐶 𝜖 1  = 1+𝑓 − 𝑦𝑗.𝐶 1 𝜖𝐶 𝑘0 1 = 1 + 𝑓 − 𝑦𝑗.𝐶 2 𝑓 −𝑦𝑗.𝐶 (−𝑦 𝑗𝑘 )  1+𝑓 − 𝑦𝑗.𝐶 1 1+𝑓 𝑦𝑗.𝐶 𝑦 𝑗𝑘0  = = 𝜚(−𝑦 𝑗 ∙ 𝐶)𝑦 𝑗𝑘0 = (1 − 𝜚(𝑦 𝑗 ∙ 𝐶))𝑦 𝑗𝑘0  𝜖 ))] = 𝑦 𝑗 𝜖𝐶 𝑘0 [log (Φ(𝑦 ∙ 𝐶 𝑈 (1 − 𝜚(𝑦 𝑗 ∙ 𝐶))𝑦 𝑗𝑘0 

  18. Gradient Descent for Logistic Regression  Second Term 𝜖 ))] = 𝜖 ) 𝜖𝐶 𝑘0 [log (1 − Φ(𝑦 ∙ 𝐶 𝜖𝐶 log (𝜚 −𝑦 𝑗 ∙ 𝐶  ) = −𝑦 𝑗 𝑈 (1 − 𝜚 −𝑦 𝑗 ∙ 𝐶  ) = −𝑦 𝑗 𝑈 (𝜚 𝑦 𝑗 ∙ 𝐶 

  19. Gradient Descent for Logistic Regression  All Terms 𝜖𝑀(𝐶) )) + )] = − [𝒛 𝒋 𝒚 𝒋 𝑼 (1 − 𝜚(𝑦 ∙ 𝐶 (1 − 𝑧 𝑗 )(−𝑦 𝑗 𝑈 )𝜚(𝑦 𝑗 ∙ 𝐶  𝒋 𝜖𝑪 ) + 𝑦 𝑗 𝑈 𝑧 𝑗 𝜚(𝑦 𝑗 ) - 𝑦 𝑗 𝑈 𝜚(𝑦 𝑗 ) ] = − [𝒛 𝒋 𝒚 𝒋 𝑼 −𝑧 𝑗 𝑦 𝑗 𝑈 𝜚(𝑦 𝑗 ∙ 𝐶 ∙ 𝐶 ∙ 𝐶  𝒋 )) = − 𝒚 𝒋 𝑼 (𝑧 𝑗 − 𝜚(𝑦 𝑗 ∙ 𝐶  𝒋 = −X T 𝒛 − 𝚾 𝒀𝑪  𝜚 1 … where 𝚾 =  𝜚 𝑂

  20. Gradient Descent for Logistic Regression  𝜖𝑀 ) = −𝑌 𝑈 (𝑧 − Φ 𝑌𝐶 𝜖𝐶  where Φ is the individual logistic functions  Normalize this by the number of observations  Divide by N to get the mean loss

  21. Gradient Descent for Logistic Regression  Algorithm:  Pick learning rate η  Initialize B randomly  Iterate: 𝜖𝑀 𝜃 ⟵ 𝐶 − 𝜃 + ] 𝑂 𝑌 𝑈 [𝑧  𝐶 = 𝐶 − Φ 𝑌𝐶 𝜖𝐶

  22. Comparing OLS and Logistic Regression  The Two Update Steps  OLS: 𝑂 𝑌 𝑈 𝑧 2𝜃 ← 𝐶 +  𝐶 − 𝑌𝐶  constant residual error  Logistic: 𝑂 𝑌 𝑈 𝑧 𝜃 ← 𝐶 +  𝐶 − Φ(𝑌𝐶)  constant error

  23. Comparing OLS and Logistic Regression  Regularization  OLS Loss Function under Regularization: 2 + 𝜀 2 𝐶  𝑀 = 𝑧 − 𝑌𝐶 2 2 2  Logistic: 2 + 𝜀 2 𝐶  𝑀 = 𝑧 − Φ(𝑌𝐶) 2 2 2  L2 Norm – Ridge Regression: Uniform Regularization  L1 Norm – LASSO Regression: Dimensionality Reduction

  24. Multi-Class Logistic Regression Using Softmax  Softmax 𝑓 𝐶𝑘∙𝑦𝑗  𝑄 𝑧 = 𝑑 𝑘 𝑦 𝑗 = , 𝑧 ∈ {1, … , 𝑛} 𝑓 𝐶𝑘∙𝑦𝑗 𝑛 𝑘=1 𝑛  𝑄 𝑧 = 𝑑 𝑘 𝑦 𝑗 = 1 , so it is a probability 𝑘=1

  25. Multi-Class Logistic Regression Using Softmax  Comparison with Logistic Regression in the Case of Two Classes 1  2 Classes: 𝑄 𝑧 = 1 𝑦 𝑗 = 1+𝑓 −𝑦𝑗𝐶  Softmax with 2 classes: 𝑓 𝑦𝑗𝐶1 𝑓 −𝑦𝑗𝐶1 𝑄 𝑧 = 1 𝑦 𝑗 = 𝑓 𝑦𝑗𝐶1 +𝑓 𝑦𝑗𝐶2 multiply this by 𝑓 −𝑦𝑗𝐶1 1 =  1+𝑓 −𝑦𝑗(𝐶1−𝐶2)

  26. Softmax Optimization  Classification Probabilities: 𝜌 1,𝑘 𝑓 𝑦𝑗𝐶𝑘 …  Π 𝑗,𝑘 = , 𝜌 𝑘 = 𝑓 𝑦𝑗𝐶𝑙 𝜌 𝑂,𝑘 𝑙  Probabilities of observation i, class j  The Gradients: 𝜖𝐶 𝑘 = −𝑌 𝑈 𝑧 𝑘 𝜖𝑀 − 𝜌 𝑘  )  Logistic: −𝑌 𝑈 (𝑧 − 𝜚 𝑌𝐶  Probability vector replaces sigmoid term  y j us a column vector with all entries 0 except the j th

  27. Softmax Gradient Descent  Algorithm largely unchanged  1. Initialize learning rate  2. Randomly choose B 1 … B m + 𝜃 ← 𝐶 𝑂 𝑌 𝑈 (𝑧 𝑘  3. 𝐶 − 𝜌 𝑘 ) 𝑘 𝑘

  28. Last Notes  Learning Rate η :  Small values take a long time to converge  Large values, and convergence may not happen  Important to monitor loss function on each itertation  Also need to make sure you normalize so values don’t get too large  Gradient descent algorithms across all are very similar

  29. Summary  Math Behind Logistic Regression  Visualizing Logistic Regression  Loss Function  Minimizing Log Likelihood Function  Batch/Full Logistic Regression  Gradient Descent for OLS  Gradient Descent for Logistic Regression  Comparing OLS and Logistic Regression  Multi-Class Logistic Regression Using Softmax

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend