lecture 2 linear classification
play

Lecture 2: Linear Classification Princeton University COS 495 - PowerPoint PPT Presentation

Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data , : 1 i.i.d. from


  1. Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang

  2. Review: machine learning basics

  3. Math formulation β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 1 β€’ Find 𝑧 = 𝑔(𝑦) ∈ π“˜ that minimizes ΰ·  π‘œ π‘œ Οƒ 𝑗=1 𝑀 𝑔 = π‘š(𝑔, 𝑦 𝑗 , 𝑧 𝑗 ) β€’ s.t. the expected loss is small 𝑀 𝑔 = 𝔽 𝑦,𝑧 ~𝐸 [π‘š(𝑔, 𝑦, 𝑧)]

  4. Machine learning 1-2-3 β€’ Collect data and extract features β€’ Build model: choose hypothesis class π“˜ and loss function π‘š β€’ Optimization: minimize the empirical loss

  5. Machine learning 1-2-3 Experience β€’ Collect data and extract features β€’ Build model: choose hypothesis class π“˜ and loss function π‘š β€’ Optimization: minimize the empirical loss Prior knowledge

  6. Example: Linear regression β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 1 π‘₯ 𝑦 = π‘₯ π‘ˆ 𝑦 that minimizes ΰ·  π‘œ π‘₯ π‘ˆ 𝑦 𝑗 βˆ’ 𝑧 𝑗 2 β€’ Find 𝑔 π‘œ Οƒ 𝑗=1 𝑀 𝑔 π‘₯ = π‘š 2 loss Linear model π“˜

  7. Why π‘š 2 loss β€’ Why not choose another loss β€’ π‘š 1 loss, hinge loss, exponential loss, … β€’ Empirical: easy to optimize β€’ For linear case: w = π‘Œ π‘ˆ π‘Œ βˆ’1 π‘Œ π‘ˆ 𝑧 β€’ Theoretical: a way to encode prior knowledge Questions: β€’ What kind of prior knowledge? β€’ Principal way to derive loss?

  8. Maximum likelihood Estimation

  9. Maximum likelihood Estimation (MLE) β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 β€’ Let {𝑄 πœ„ 𝑦, 𝑧 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„ β€’ Would like to pick πœ„ so that 𝑄 πœ„ (𝑦, 𝑧) fits the data well

  10. Maximum likelihood Estimation (MLE) β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 β€’ Let {𝑄 πœ„ 𝑦, 𝑧 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„ β€’ β€œfitness” of πœ„ to one data point 𝑦 𝑗 , 𝑧 𝑗 likelihood πœ„; 𝑦 𝑗 , 𝑧 𝑗 ≔ 𝑄 πœ„ (𝑦 𝑗 , 𝑧 𝑗 )

  11. Maximum likelihood Estimation (MLE) β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 β€’ Let {𝑄 πœ„ 𝑦, 𝑧 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„ β€’ β€œfitness” of πœ„ to i.i.d. data points { 𝑦 𝑗 , 𝑧 𝑗 } ≔ 𝑄 πœ„ {𝑦 𝑗 , 𝑧 𝑗 } = Ο‚ 𝑗 𝑄 πœ„ (𝑦 𝑗 , 𝑧 𝑗 ) likelihood πœ„; {𝑦 𝑗 , 𝑧 𝑗 }

  12. Maximum likelihood Estimation (MLE) β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 β€’ Let {𝑄 πœ„ 𝑦, 𝑧 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„ β€’ MLE: maximize β€œfitness” of πœ„ to i.i.d. data points { 𝑦 𝑗 , 𝑧 𝑗 } πœ„ 𝑁𝑀 = argmax θ∈Θ Ο‚ 𝑗 𝑄 πœ„ (𝑦 𝑗 , 𝑧 𝑗 )

  13. Maximum likelihood Estimation (MLE) β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 β€’ Let {𝑄 πœ„ 𝑦, 𝑧 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„ β€’ MLE: maximize β€œfitness” of πœ„ to i.i.d. data points { 𝑦 𝑗 , 𝑧 𝑗 } πœ„ 𝑁𝑀 = argmax θ∈Θ log[Ο‚ 𝑗 𝑄 πœ„ 𝑦 𝑗 , 𝑧 𝑗 ] πœ„ 𝑁𝑀 = argmax θ∈Θ Οƒ 𝑗 log[𝑄 πœ„ 𝑦 𝑗 , 𝑧 𝑗 ]

  14. Maximum likelihood Estimation (MLE) β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 β€’ Let {𝑄 πœ„ 𝑦, 𝑧 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„ β€’ MLE: negative log-likelihood loss πœ„ 𝑁𝑀 = argmax θ∈Θ Οƒ 𝑗 log(𝑄 πœ„ 𝑦 𝑗 , 𝑧 𝑗 ) π‘š 𝑄 πœ„ , 𝑦 𝑗 , 𝑧 𝑗 = βˆ’ log(𝑄 πœ„ 𝑦 𝑗 , 𝑧 𝑗 ) ΰ·  𝑀 𝑄 πœ„ = βˆ’ Οƒ 𝑗 log(𝑄 πœ„ 𝑦 𝑗 , 𝑧 𝑗 )

  15. MLE: conditional log-likelihood β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 β€’ Let {𝑄 πœ„ 𝑧 𝑦 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„ Only care about predicting y β€’ MLE: negative conditional log-likelihood loss from x; do not care about p(x) πœ„ 𝑁𝑀 = argmax θ∈Θ Οƒ 𝑗 log(𝑄 πœ„ 𝑧 𝑗 |𝑦 𝑗 ) π‘š 𝑄 πœ„ , 𝑦 𝑗 , 𝑧 𝑗 = βˆ’ log(𝑄 πœ„ 𝑧 𝑗 |𝑦 𝑗 ) ΰ·  𝑀 𝑄 πœ„ = βˆ’ Οƒ 𝑗 log(𝑄 πœ„ 𝑧 𝑗 |𝑦 𝑗 )

  16. MLE: conditional log-likelihood β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 β€’ Let {𝑄 πœ„ 𝑧 𝑦 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„ P(y|x): discriminative; β€’ MLE: negative conditional log-likelihood loss P(x,y): generative πœ„ 𝑁𝑀 = argmax θ∈Θ Οƒ 𝑗 log(𝑄 πœ„ 𝑧 𝑗 |𝑦 𝑗 ) π‘š 𝑄 πœ„ , 𝑦 𝑗 , 𝑧 𝑗 = βˆ’ log(𝑄 πœ„ 𝑧 𝑗 |𝑦 𝑗 ) ΰ·  𝑀 𝑄 πœ„ = βˆ’ Οƒ 𝑗 log(𝑄 πœ„ 𝑧 𝑗 |𝑦 𝑗 )

  17. Example: π‘š 2 loss β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 1 πœ„ 𝑦 that minimizes ΰ·  π‘œ πœ„ (𝑦 𝑗 ) βˆ’ 𝑧 𝑗 2 β€’ Find 𝑔 π‘œ Οƒ 𝑗=1 𝑀 𝑔 πœ„ = 𝑔

  18. Example: π‘š 2 loss β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 1 πœ„ 𝑦 that minimizes ΰ·  π‘œ πœ„ (𝑦 𝑗 ) βˆ’ 𝑧 𝑗 2 β€’ Find 𝑔 π‘œ Οƒ 𝑗=1 𝑀 𝑔 πœ„ = 𝑔 π‘š 2 loss: Normal + MLE β€’ Define 𝑄 πœ„ 𝑧 𝑦 = Normal 𝑧; 𝑔 πœ„ 𝑦 , 𝜏 2 βˆ’1 1 βˆ’ 𝑧 𝑗 ) 2 βˆ’log(𝜏) βˆ’ β€’ log(𝑄 πœ„ 𝑧 𝑗 |𝑦 𝑗 ) = 2𝜏 2 (𝑔 πœ„ 𝑦 𝑗 2 log(2𝜌) 1 π‘œ πœ„ (𝑦 𝑗 ) βˆ’ 𝑧 𝑗 2 β€’ πœ„ 𝑁𝑀 = argmin θ∈Θ π‘œ Οƒ 𝑗=1 𝑔

  19. Linear classification

  20. Example 1: image classification indoor Indoor outdoor

  21. Example 2: Spam detection #”$” #”Mr.” #”sale” … Spam? Email 1 2 1 1 Yes Email 2 0 1 0 No Email 3 1 1 1 Yes … Email n 0 0 0 No New email 0 0 1 ??

  22. Why classification β€’ Classification: a kind of summary β€’ Easy to interpret β€’ Easy for making decisions

  23. Linear classification π‘₯ π‘ˆ 𝑦 = 0 π‘₯ π‘ˆ 𝑦 > 0 π‘₯ π‘ˆ 𝑦 < 0 π‘₯ Class 1 Class 0

  24. Linear classification: natural attempt β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 π‘₯ 𝑦 = π‘₯ π‘ˆ 𝑦 β€’ Hypothesis 𝑔 β€’ 𝑧 = 1 if π‘₯ π‘ˆ 𝑦 > 0 Linear model π“˜ β€’ 𝑧 = 0 if π‘₯ π‘ˆ 𝑦 < 0 π‘₯ 𝑦 ) = step(π‘₯ π‘ˆ 𝑦) β€’ Prediction: 𝑧 = step(𝑔

  25. Linear classification: natural attempt β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 1 π‘₯ 𝑦 = π‘₯ π‘ˆ 𝑦 to minimize ΰ·  π‘œ 𝕁[step(π‘₯ π‘ˆ 𝑦 𝑗 ) β‰  𝑧 𝑗 ] β€’ Find 𝑔 π‘œ Οƒ 𝑗=1 𝑀 𝑔 π‘₯ = β€’ Drawback: difficult to optimize β€’ NP-hard in the worst case 0-1 loss

  26. Linear classification: simple approach β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 1 π‘₯ 𝑦 = π‘₯ π‘ˆ 𝑦 that minimizes ΰ·  π‘œ π‘₯ π‘ˆ 𝑦 𝑗 βˆ’ 𝑧 𝑗 2 β€’ Find 𝑔 π‘œ Οƒ 𝑗=1 𝑀 𝑔 π‘₯ = Reduce to linear regression; ignore the fact 𝑧 ∈ {0,1}

  27. Linear classification: simple approach Drawback: not robust to β€œoutliers” Figure borrowed from Pattern Recognition and Machine Learning , Bishop

  28. Compare the two 𝑧 𝑧 = π‘₯ π‘ˆ 𝑦 𝑧 = step(π‘₯ π‘ˆ 𝑦) π‘₯ π‘ˆ 𝑦

  29. Between the two β€’ Prediction bounded in [0,1] β€’ Smooth 1 β€’ Sigmoid: 𝜏 𝑏 = 1+exp(βˆ’π‘) Figure borrowed from Pattern Recognition and Machine Learning , Bishop

  30. Linear classification: sigmoid prediction β€’ Squash the output of the linear function 1 Sigmoid π‘₯ π‘ˆ 𝑦 = 𝜏 π‘₯ π‘ˆ 𝑦 = 1 + exp(βˆ’π‘₯ π‘ˆ 𝑦) 1 β€’ Find π‘₯ that minimizes ΰ·  π‘œ 𝜏(π‘₯ π‘ˆ 𝑦 𝑗 ) βˆ’ 𝑧 𝑗 2 π‘œ Οƒ 𝑗=1 𝑀 𝑔 π‘₯ =

  31. Linear classification: logistic regression β€’ Squash the output of the linear function 1 Sigmoid π‘₯ π‘ˆ 𝑦 = 𝜏 π‘₯ π‘ˆ 𝑦 = 1 + exp(βˆ’π‘₯ π‘ˆ 𝑦) β€’ A better approach: Interpret as a probability 1 π‘₯ (𝑧 = 1|𝑦) = 𝜏 π‘₯ π‘ˆ 𝑦 = 𝑄 1 + exp(βˆ’π‘₯ π‘ˆ 𝑦) π‘₯ 𝑧 = 1 𝑦 = 1 βˆ’ 𝜏 π‘₯ π‘ˆ 𝑦 𝑄 π‘₯ 𝑧 = 0 𝑦 = 1 βˆ’ 𝑄

  32. Linear classification: logistic regression β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 β€’ Find π‘₯ that minimizes π‘œ 𝑀 π‘₯ = βˆ’ 1 ΰ·  π‘œ ෍ log 𝑄 π‘₯ 𝑧 𝑦 𝑗=1 𝑀 π‘₯ = βˆ’ 1 log𝜏(π‘₯ π‘ˆ 𝑦 𝑗 ) βˆ’ 1 ΰ·  log[1 βˆ’ 𝜏 π‘₯ π‘ˆ 𝑦 𝑗 ] π‘œ ෍ π‘œ ෍ 𝑧 𝑗 =1 𝑧 𝑗 =0 Logistic regression: MLE with sigmoid

  33. Linear classification: logistic regression β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 β€’ Find π‘₯ that minimizes 𝑀 π‘₯ = βˆ’ 1 log𝜏(π‘₯ π‘ˆ 𝑦 𝑗 ) βˆ’ 1 ΰ·  log[1 βˆ’ 𝜏 π‘₯ π‘ˆ 𝑦 𝑗 ] π‘œ ෍ π‘œ ෍ 𝑧 𝑗 =1 𝑧 𝑗 =0 No close form solution; Need to use gradient descent

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend