na ve bayes
play

Nave Bayes Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / - PowerPoint PPT Presentation

Nave Bayes Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative HW 1 out today. Please start early! Office hours Chen: Wed 4pm-5pm Shih-Yang: Fri 3pm-4pm Location: Whittemore 266 Linear Regression


  1. Naïve Bayes Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824

  2. Administrative • HW 1 out today. Please start early! • Office hours • Chen: Wed 4pm-5pm • Shih-Yang: Fri 3pm-4pm • Location: Whittemore 266

  3. Linear Regression • Model representation ℎ 𝜄 𝑦 = 𝜄 0 + 𝜄 1 𝑦 1 + 𝜄 2 𝑦 2 + ⋯ + 𝜄 𝑜 𝑦 𝑜 = 𝜄 ⊤ 𝑦 2 1 • Cost function 𝐾 𝜄 = 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 2𝑛 σ 𝑗=1 • Gradient descent for linear regression 𝑗 } 1 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑛 σ 𝑗=1 Repeat until convergence { 𝜄 𝑘 ≔ 𝜄 𝑘 − 𝛽 𝑦 𝑘 • Features and polynomial regression Can combine features; can use different functions to generate features (e.g., polynomial) • Normal equation 𝜄 = (𝑌 ⊤ 𝑌) −1 𝑌 ⊤ 𝑧

  4. Number of ( 𝑦 0 ) Size in feet^2 Number of Age of home Price ($) in bedrooms ( 𝑦 1 ) floors ( 𝑦 3 ) (years) ( 𝑦 4 ) 1000’s (y) ( 𝑦 2 ) 1 2104 5 1 45 460 1 1416 3 2 40 232 1 1534 3 2 30 315 1 852 2 1 36 178 … … 460 232 𝑧 = 315 178 𝜄 = (𝑌 ⊤ 𝑌) −1 𝑌 ⊤ 𝑧 Slide credit: Andrew Ng

  5. Least square solution 2 1 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 • 𝐾 𝜄 = 2𝑛 σ 𝑗=1 2 1 𝜄 ⊤ 𝑦 (𝑗) − 𝑧 𝑗 𝑛 2𝑛 σ 𝑗=1 = 1 2 = 2𝑛 𝑌𝜄 − 𝑧 2 𝜖 • 𝜖𝜄 𝐾 𝜄 = 0 • 𝜄 = (𝑌 ⊤ 𝑌) −1 𝑌 ⊤ 𝑧

  6. 𝒛 Justification/interpretation 1 𝒀𝜾 − 𝒛 • Geometric interpretation 𝒀𝜾 column space of 𝒀 ← 𝒚 (1) → 1 ↑ ↑ ↑ ↑ ← 𝒚 (2) → ⋯ 1 𝒀 = = 𝒜 𝟏 𝒜 𝟐 𝒜 𝟑 𝒜 𝒐 ⋮ ⋮ ↓ ↓ ↓ ↓ ← 𝒚 (𝑛) → 1 • 𝒀𝜾 : column space of 𝒀 or span( {𝒜 𝟏 , 𝒜 𝟐 , ⋯ , 𝒜 𝒐 } ) • Residual 𝒀𝜾 − 𝐳 is orthogonal to the column space of 𝒀 • 𝒀 ⊤ 𝒀𝜾 − 𝐳 = 0 → (𝒀 ⊤ 𝒀)𝜾 = 𝒀 ⊤ 𝒛

  7. Justification/interpretation 2 • Probabilistic model • Assume linear model with Gaussian errors 2𝜌𝜏 2 exp(− 1 1 𝑞 𝜄 𝑧 𝑗 𝑦 𝑗 2𝜏 2 (𝑧 𝑗 − 𝜄 ⊤ 𝑦 𝑗 ) = • Solving maximum likelihood 𝑛 𝑞 𝜄 𝑧 𝑗 𝑦 𝑗 argmin ෑ 𝜄 𝑛 1 𝑗=1 𝑛 1 2 𝑞 𝑧 𝑗 𝑦 𝑗 2 𝜄 ⊤ 𝑦 𝑗 − 𝑧 𝑗 argmin log(ෑ ) = argmin 2𝜏 2 ෍ 𝜄 𝜄 𝑗=1 𝑗=1 Image credit: CS 446@UIUC

  8. Justification/interpretation 3 • Loss minimization 𝑛 𝑛 𝐾 𝜄 = 1 2 = 1 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑀 𝑚𝑡 ℎ 𝜄 𝑦 𝑗 , 𝑧 𝑗 2𝑛 ෍ 𝑛 ෍ 𝑗=1 𝑗=1 1 2 : Least squares loss • 𝑀 𝑚𝑡 𝑧, ො 𝑧 = 2 𝑧 − ො 𝑧 2 • Empirical Risk Minimization (ERM) 𝑛 1 𝑀 𝑚𝑡 𝑧 𝑗 , ො 𝑛 ෍ 𝑧 𝑗=1

  9. 𝑛 training examples, 𝑜 features Gradient Descent Normal Equation • Need to choose 𝛽 • No need to choose 𝛽 • Need many iterations • Don’t need to iterate • Works well even when • Need to compute (𝑌 ⊤ 𝑌) −1 𝑜 is large • Slow if 𝑜 is very large Slide credit: Andrew Ng

  10. Things to remember • Model representation ℎ 𝜄 𝑦 = 𝜄 0 + 𝜄 1 𝑦 1 + 𝜄 2 𝑦 2 + ⋯ + 𝜄 𝑜 𝑦 𝑜 = 𝜄 ⊤ 𝑦 2 1 • Cost function 𝐾 𝜄 = 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 2𝑛 σ 𝑗=1 • Gradient descent for linear regression 𝑗 } 1 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑛 σ 𝑗=1 Repeat until convergence { 𝜄 𝑘 ≔ 𝜄 𝑘 − 𝛽 𝑦 𝑘 • Features and polynomial regression Can combine features; can use different functions to generate features (e.g., polynomial) • Normal equation 𝜄 = (𝑌 ⊤ 𝑌) −1 𝑌 ⊤ 𝑧

  11. Today’s plan • Probability basics • Estimating parameters from data • Maximum likelihood (ML) • Maximum a posteriori estimation (MAP) • Naïve Bayes

  12. Today’s plan • Probability basics • Estimating parameters from data • Maximum likelihood (ML) • Maximum a posteriori estimation (MAP) • Naive Bayes

  13. Random variables • Outcome space S • Space of possible outcomes • Random variables • Functions that map outcomes to real numbers • Event E • Subset of S

  14. Visualizing probability 𝑄(𝐵) Sample space Area = 1 A is true A is false 𝑄 𝐵 = Area of the blue circle

  15. Visualizing probability 𝑄 𝐵 + P ~A A is true A is false 𝑄 𝐵 + P ~A = 1

  16. Visualizing probability 𝑄 𝐵 A^B B A^~B 𝑄 𝐵 = P(A, B) + P A, ~𝐶

  17. Visualizing conditional probability A^B 𝑄 𝐵|𝐶 = 𝑄 𝐵, 𝐶 /𝑄(𝐶) Corollary: The chain rule B A 𝑄 𝐵, 𝐶 = 𝑄 𝐵|𝐶 𝑄(𝐶)

  18. Bayes rule A^B Thomas Bayes B A 𝑄 𝐵|𝐶 = 𝑄 𝐵, 𝐶 𝑄 𝐶 = 𝑄 𝐶 𝐵 𝑄 𝐵 𝑄(𝐶) Corollary: The chain rule 𝑄 𝐵, 𝐶 = 𝑄 𝐵|𝐶 𝑄 𝐶 = P B P A B

  19. Other forms of Bayes rule 𝑄 𝐵|𝐶 = 𝑄 𝐶 𝐵 𝑄 𝐵 𝑄(𝐶) 𝑄 𝐵|𝐶, 𝑌 = 𝑄 𝐶 𝐵, 𝑌 𝑄 𝐵, 𝑌 𝑄(𝐶, 𝑌) 𝑄 𝐶 𝐵 𝑄 𝐵 𝑄 𝐵|𝐶 = 𝑄 𝐶 𝐵 𝑄 𝐵 + 𝑄 𝐶 ~𝐵 𝑄(~𝐵)

  20. Applying Bayes rule 𝑄 𝐶 𝐵 𝑄 𝐵 𝑄 𝐵|𝐶 = 𝑄 𝐶 𝐵 𝑄 𝐵 + 𝑄 𝐶 ~𝐵 𝑄(~𝐵) • A = you have the flu B = you just coughed • Assume: • 𝑄 𝐵 = 0.05 0.8 × 0.05 𝑄 𝐵|𝐶 = 0.8 × 0.05 + 0.2 × 0.95 ~0.17 • 𝑄 𝐶 𝐵 = 0.8 • 𝑄 𝐶 ~𝐵 = 0.2 • What is P(flu | cough) = P(A|B)? Slide credit: Tom Mitchell

  21. Why we are learning this? 𝑦 ℎ 𝑧 Hypothesis Learn 𝑄 𝑍|𝑌

  22. A B C Prob Joint distribution 0 0 0 0.30 0 0 1 0.05 0 1 0 0.10 • Making a joint distribution of M variables 0 1 1 0.05 1 0 0 0.05 1 0 1 0.10 1. Make a truth table listing all combinations 1 1 0 0.25 1 1 1 0.10 2. For each combination of values, say how probable it is 3. Probability must sum to 1 Slide credit: Tom Mitchell

  23. Using joint distribution • Can ask for any logical expression involving these variables • 𝑄 𝐹 = σ rows matching E 𝑄(row) σ rows matching E1 and 𝐹2 𝑄 row • 𝑄 𝐹 1 |𝐹 2 = σ rows matching 𝐹 2 𝑄 row Slide credit: Tom Mitchell

  24. The solution to learn 𝑄 𝑍|𝑌 ? • Main problem: learning 𝑄 𝑍|𝑌 may require more data than we have • Say, learning a joint distribution with 100 attributes 2 100 ≥ 10 30 • # of rows in this table? • # of people on earth? 10 9 Slide credit: Tom Mitchell

  25. What should we do? 1. Be smart about how we estimate probabilities from sparse data • Maximum likelihood estimates (ML) • Maximum a posteriori estimates (MAP) 2. Be smart about how to represent joint distributions • Bayes network, graphical models (more on this later) Slide credit: Tom Mitchell

  26. Today’s plan • Probability basics • Estimating parameters from data • Maximum likelihood (ML) • Maximum a posteriori (MAP) • Naive Bayes

  27. Estimating the probability 𝑌 = 1 𝑌 = 0 • Flip the coin repeatedly, observing • It turns heads 𝛽 1 times • It turns tails 𝛽 0 times • Your estimate for 𝑄 𝑌 = 1 is? • Case A: 100 flips: 51 Heads ( 𝑌 = 1 ), 49 Tails ( 𝑌 = 0 ) 𝑄 𝑌 = 1 = ? • Case B: 3 flips: 2 Heads ( 𝑌 = 1 ), 1 Tails ( 𝑌 = 0 ) 𝑄 𝑌 = 1 = ? Slide credit: Tom Mitchell

  28. Two principles for estimating parameters • Maximum Likelihood Estimate (MLE) Choose 𝜄 that maximizes probability of observed data 𝜾 MLE = argmax ෡ 𝑄(𝐸𝑏𝑢𝑏|𝜄) 𝜄 • Maximum a posteriori estimation (MAP) Choose 𝜄 that is most probable given prior probability and data 𝑄 𝐸𝑏𝑢𝑏 𝜄 𝑄 𝜄 𝜾 MAP = argmax ෡ 𝑄 𝜄 𝐸 = argmax 𝑄(𝐸𝑏𝑢𝑏) 𝜄 𝜄 Slide credit: Tom Mitchell

  29. Two principles for estimating parameters • Maximum Likelihood Estimate (MLE) Choose 𝜄 that maximizes 𝑄 𝐸𝑏𝑢𝑏 𝜄 𝛽 1 𝜾 MLE = ෡ 𝛽 1 + 𝛽 0 • Maximum a posteriori estimation (MAP) Choose 𝜄 that maximize 𝑄 𝜄 𝐸𝑏𝑢𝑏 (𝛽 1 + #halluciated 1s) 𝜾 MAP = ෡ (𝛽 1 +#halluciated 1𝑡) + (𝛽 0 + #halluciated 0s) Slide credit: Tom Mitchell

  30. Maximum likelihood estimate 𝑌 = 1 𝑌 = 0 • Each flip yields Boolean value for 𝑌 𝑄 𝑌 = 1 = 𝜄 𝑌 ∼ Bernoulli : 𝑄 𝑌 = 𝜄 𝑌 1 − 𝜄 1−𝑌 𝑄 𝑌 = 0 = 1 − 𝜄 • Data set 𝐸 of independent, identically distributed (iid) flips, produces 𝛽 1 ones, 𝛽 0 zeros 𝑄 𝐸 𝜄 = 𝑄 𝛽 1 , 𝛽 0 𝜄 = 𝜄 𝛽 1 1 − 𝜄 𝛽 0 𝛽 1 ෡ 𝜾 = argmax 𝑄(𝐸|𝜄) = 𝛽 1 + 𝛽 0 𝜄 Slide credit: Tom Mitchell

  31. Beta prior distribution 𝑄 𝜄 1 𝐶(𝛾 1 ,𝛾 0 ) 𝜄 𝛾 1 −1 1 − 𝜄 𝛾 0 −1 • 𝑄 𝜄 = 𝐶𝑓𝑢𝑏 𝛾 1 , 𝛾 0 = Slide credit: Tom Mitchell

  32. Maximum likelihood estimate 𝑌 = 1 𝑌 = 0 • Data set 𝐸 of iid flips, produces 𝛽 1 ones, 𝛽 0 zeros 𝑄 𝐸 𝜄 = 𝑄 𝛽 1 , 𝛽 0 𝜄 = 𝜄 𝛽 1 1 − 𝜄 𝛽 0 • Assume prior ( Conjugate prior: Closed form representation of posterior) 1 𝐶(𝛾 1 , 𝛾 0 ) 𝜄 𝛾 1 −1 1 − 𝜄 𝛾 0 −1 𝑄 𝜄 = 𝐶𝑓𝑢𝑏 𝛾 1 , 𝛾 0 = 𝛽 1 + 𝛾 1 − 1 ෡ 𝜾 = argmax 𝑄 𝐸 𝜄 P(𝜄) = (𝛽 1 + 𝛾 1 − 1) + (𝛽 0 + 𝛾 0 − 1) 𝜄 Slide credit: Tom Mitchell

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend