6 linear logistjc regressions
play

6. Linear & logistjc regressions Chlo-Agathe Azencot Centre for - PowerPoint PPT Presentation

Foundatjons of Machine Learning CentraleSuplec Fall 2017 6. Linear & logistjc regressions Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectjves


  1. Foundatjons of Machine Learning CentraleSupélec — Fall 2017 6. Linear & logistjc regressions Chloé-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr

  2. Learning objectjves ● Density estjmatjon: – Defjne parametric methods. – Defjne the maximum likelihood estjmator and compute it for Bernouilli , multjnomial and Gaussian densitjes. – Defjne the Bayes estjmator and compute it for normal priors. ● Supervised learning: – Compute the maximum likelihood estjmator / least- square fjt solutjon for linear regression. – Compute the maximum likelihood estjmator for logistjc regression. 2

  3. Density estjmatjon 3

  4. Parametric methods ● ● Parametric estjmatjon: – assume a form for p(x|θ) E.g. – Goal: estjmate θ using X – usually assume that independent and identjcally distributed (iid) 4

  5. Maximum likelihood estjmatjon ● Find θ such that X is the most likely to be drawn. ● Likelihood of θ given the i.i.d. sample X : ● Log likelihood: ● Maximum likelihood estjmatjon (MLE): 5

  6. Bernouilli density ● Two states: failure / success MLE estjmate of p 0 : 6

  7. Bernouilli density ● Two states: failure / success MLE estjmate of p 0 : ● Log likelihood: ? 7

  8. Bernouilli density ● Two states: failure / success MLE estjmate of p 0 : ● Log likelihood: ● Maximize the likelihood: ? 8

  9. Bernouilli density ● Two states: failure / success MLE estjmate of p 0 : ● Log likelihood: ● Maximize the likelihood: set the gradient to 0. ? 9

  10. 10

  11. Bernouilli density ● Two states: failure / success MLE estjmate of p 0 : ● Log likelihood: ● Maximize the likelihood: set its gradient to 0. 11

  12. Multjnomial density ● Consider K mutually exclusive and exhaustjve classes – Each class occurs with probability p k – x 1 , x 2 , …, x K indicator variables: x k =1 if the outcome is class k and 0 otherwise ● The MLE of p k is 12

  13. Gaussian distributjon ● Gaussian distributjon = normal distributjon Compute the MLE estjmates of μ and σ. 13

  14. 14

  15. Gaussian distributjon ● Gaussian distributjon = normal distributjon Compute the MLE estjmates of μ and σ. 15

  16. Bias-variance tradeof ● Mean squared error of the estjmator: A biased estjmator may achieve betuer MSE than an unbiased one. bias E[θθ̃] θ 0 θ variance 16

  17. Bayes estjmator prior likelihood posterior evidence ● Treat θ as a random variable with prior p(θ) ● Bayes rule: ● Density estjmatjon at x: 17

  18. Bayes estjmator ● Treat θ as a random variable with prior p(θ) ● Bayes rule: ● Density estjmatjon ● Maximum likelihood estjmate (MLE): ● Bayes estjmate: ? 18

  19. 19

  20. Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: Compute the Bayes estjmator of θ Hint: Compute p(θ| X ) and show that it follows a normal distributjon 20

  21. 21

  22. 22

  23. 23

  24. Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: Compute the Bayes estjmator of θ p(θ| X ) follows a normal distributjon with – mean – variance 24

  25. Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: Compute the Bayes estjmator of θ p(θ| X ) follows a normal distributjon with – mean – variance 25

  26. Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: ● Bayes estjmator: prior mean sample mean 26

  27. Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: ● Bayes estjmator: prior mean sample mean ? large when n is ? large when σ is 27

  28. Bayes estjmator: Normal prior ● n data points (iid) ● MLE of θ: ● Bayes estjmator: ● When n ↗: θ Bayes gets closer to the sample average (uses informatjon from the sample). ● When σ is small, θ Bayes gets closer to μ (litule uncertainty about the prior). 28

  29. Linear regression 29

  30. Linear regression 30

  31. Linear regression: MLE ● Assume error is Gaussian distributed ● Replace g with its estjmator f E[y|x] = βx+β 0 E[y|x*] p(y|x*) x* x 31

  32. MLE under Gaussian noise ● Maximize (log) likelihood independent of β 32

  33. MLE under Gaussian noise ● Maximize (log) likelihood independent of β ? 33

  34. MLE under Gaussian noise ● Maximize (log) likelihood independent of β ● Assuming Gaussian error, maximizing the likelihood is equivalent to minimizing the sum of squared residuals. 34

  35. Linear regression least-squares fjt ● Minimize the residual sum of squares 35

  36. Linear regression least-squares fjt ● Minimize the residual sum of squares Historically: – Carl Friedrich Gauss (to predict the locatjon of Ceres) – Adrien Marie Legendre 36

  37. Linear regression least-squares fjt ● Minimize the residual sum of squares Estjmate β. What conditjon do you need to verify? 37

  38. Linear regression least-squares fjt ● Minimize the residual sum of squares ● Assuming X has full column rank (and hence X T X invertjble): 38

  39. Linear regression least-squares fjt ● Minimize the residual sum of squares ● Assuming X has full column rank (and hence X T X invertjble): ● If X is rank-defjcient, use a pseudo-inverse. A pseudo-inverse of A is a matrix G s. t. AGA = A 39

  40. Gauss-Markov Theorem ● Under the assumptjon that the least-squares estjmator of β is its (unique) best linear unbiased estjmator. 40

  41. Gauss-Markov Theorem ● Under the assumptjon that the least-squares estjmator of β is its (unique) best linear unbiased estjmator. ● Best Linear Unbiased Estjmator (BLUE): Var(βθ̃) < Var(β*) for any β* that is a linear unbiased estjmator of β 41

  42. Gauss-Markov Theorem ● Under the assumptjon that the least-squares estjmator of β is its (unique) best linear unbiased estjmator. ● Best Linear Unbiased Estjmator (BLUE): Var(βθ̃) < Var(β*) for any β* that is a linear unbiased estjmator of β 42

  43. Gauss-Markov Theorem ● Under the assumptjon that the least-squares estjmator of β is its (unique) best linear unbiased estjmator. ● Best Linear Unbiased Estjmator (BLUE): Var(βθ̃) < Var(β*) for any β* that is a linear unbiased estjmator of β 43

  44. Gauss-Markov Theorem ● Under the assumptjon that the least-squares estjmator of β is its (unique) best linear unbiased estjmator. ● Best Linear Unbiased Estjmator (BLUE): Var(βθ̃) < Var(β*) for any β* that is a linear unbiased estjmator of β psd and minimal for D=0 44

  45. 45

  46. (true for all β) 46

  47. 47

  48. 48

  49. Correlated variables ● If the variables are decorrelated : – Each coeffjcient can be estjmated separately; – Interpretatjon is easy: “A change of 1 in x j is associated with a change of β j in Y, while everything else stays the same.” ● Correlatjons between variables cause problems: – The variance of all coeffjcients tend to increase; – Interpretatjon is much harder when x j changes, so does everything else. 49

  50. Logistjc regression 50

  51. What about classifjcatjon? 51

  52. What about classifjcatjon? ? ● Model P(Y=1|x) as a linear functjon? 52

  53. What about classifjcatjon? ● Model P(Y=1|x) as a linear functjon? – Problem: P(Y=1| x ) must be between 0 and 1. – Non-linearity: ● If P(Y=1| x ) close to +1 or 0, x must change a lot for y to change; ● If P(Y=1| x ) close to 0.5, that's not the case. – Hence: use a logit transformatjon p f( x ) → Logistjc regression. 53

  54. Maximum likelihood estjmatjon of logistjc regression coeffjcients ● Log likelihood for n observatjons ? 54

  55. 55

  56. Maximum likelihood estjmatjon of logistjc regression coeffjcients ● Log likelihood for n observatjons 56

  57. Maximum likelihood estjmatjon of logistjc regression coeffjcients ? ● Gradient of the log likelihood 57

  58. 58

  59. Maximum likelihood estjmatjon of logistjc regression coeffjcients ● Gradient of the log likelihood ● To maximize the likelihood: – set the gradient to 0 – cannot be solved analytjcally – -L convex so we can use gradient descent (no local minima) 59

  60. 60

  61. Summary ● MAP estjmate: ● MLE: ● Bayes estjmate: ● Assuming Gaussian error, maximizing the likelihood is equivalent to minimizing the RSS. ● Linear regression MLE: ● Logistjc regression MLE: solve with gradient descent. 61

  62. References ● A Course in Machine Learning. http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf – Least-squares regression: Chap 7.6 ● The Elements of Statjstjcal Learning. http://web.stanford.edu/~hastie/ElemStatLearn/ – Least-squares regression: Chap 2.2.1, 3.1, 3.2.1 – Gauss-Markov theorem: Chap 3.2.3 62

  63. class GradientDescentOptjmizer(): 63

  64. class LeastSquaresRegr() 64

  65. class seq_LeastSquaresRegr() 65

  66. class seq_LeastSquaresRegr() 66

  67. class seq_LeastSquaresRegr() 67

  68. class LogistjcRegr() 68

  69. class LogistjcRegr() 69

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend