probabilistic models
play

Probabilistic Models Shan-Hung Wu shwu@cs.nthu.edu.tw Department - PowerPoint PPT Presentation

Probabilistic Models Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 1 / 25 Outline Probabilistic Models


  1. Probabilistic Models Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 1 / 25

  2. Outline Probabilistic Models 1 Maximum Likelihood Estimation 2 Linear Regression Logistic Regression Maximum A Posteriori Estimation 3 Bayesian Estimation** 4 Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 2 / 25

  3. Outline Probabilistic Models 1 Maximum Likelihood Estimation 2 Linear Regression Logistic Regression Maximum A Posteriori Estimation 3 Bayesian Estimation** 4 Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 3 / 25

  4. Predictions based on Probability Supervised learning, we are given a training set X = { ( x ( i ) , y ( i ) ) } N i = 1 Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 4 / 25

  5. Predictions based on Probability Supervised learning, we are given a training set X = { ( x ( i ) , y ( i ) ) } N i = 1 Model F : a collection of functions parametrized by Θ Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 4 / 25

  6. Predictions based on Probability Supervised learning, we are given a training set X = { ( x ( i ) , y ( i ) ) } N i = 1 Model F : a collection of functions parametrized by Θ Goal: to train a function f such that, given a new data point x 0 , the output value y = f ( x 0 ; Θ ) ˆ is closest to the correct label y 0 Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 4 / 25

  7. Predictions based on Probability Supervised learning, we are given a training set X = { ( x ( i ) , y ( i ) ) } N i = 1 Model F : a collection of functions parametrized by Θ Goal: to train a function f such that, given a new data point x 0 , the output value y = f ( x 0 ; Θ ) ˆ is closest to the correct label y 0 Examples in X are usually assumed to be i.i.d. sampled from random variables ( x , y ) following some data generating distribution P ( x , y ) Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 4 / 25

  8. Predictions based on Probability Supervised learning, we are given a training set X = { ( x ( i ) , y ( i ) ) } N i = 1 Model F : a collection of functions parametrized by Θ Goal: to train a function f such that, given a new data point x 0 , the output value y = f ( x 0 ; Θ ) ˆ is closest to the correct label y 0 Examples in X are usually assumed to be i.i.d. sampled from random variables ( x , y ) following some data generating distribution P ( x , y ) In probabilistic models, f is replaced by P ( y = y | x = x 0 ) and a prediction is made by: P ( y = y | x = x 0 ; Θ ) y = argmax ˆ y Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 4 / 25

  9. Predictions based on Probability Supervised learning, we are given a training set X = { ( x ( i ) , y ( i ) ) } N i = 1 Model F : a collection of functions parametrized by Θ Goal: to train a function f such that, given a new data point x 0 , the output value y = f ( x 0 ; Θ ) ˆ is closest to the correct label y 0 Examples in X are usually assumed to be i.i.d. sampled from random variables ( x , y ) following some data generating distribution P ( x , y ) In probabilistic models, f is replaced by P ( y = y | x = x 0 ) and a prediction is made by: P ( y = y | x = x 0 ; Θ ) y = argmax ˆ y How to find Θ ? Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 4 / 25

  10. Function ( Θ ) as Point Estimate Regard Θ ( f ) as an estimate of the “true” Θ ⇤ ( f ⇤ ) Mapped from the training set X Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 5 / 25

  11. Function ( Θ ) as Point Estimate Regard Θ ( f ) as an estimate of the “true” Θ ⇤ ( f ⇤ ) Mapped from the training set X Maximum a posteriori (MAP) estimation : argmax Θ P ( Θ | X ) = argmax Θ P ( X | Θ ) P ( Θ ) By Bayes’ rule ( P ( X ) is irrelevant) Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 5 / 25

  12. Function ( Θ ) as Point Estimate Regard Θ ( f ) as an estimate of the “true” Θ ⇤ ( f ⇤ ) Mapped from the training set X Maximum a posteriori (MAP) estimation : argmax Θ P ( Θ | X ) = argmax Θ P ( X | Θ ) P ( Θ ) By Bayes’ rule ( P ( X ) is irrelevant) Solves Θ first, then uses it as a constant in P ( y | x ; Θ ) to get ˆ y Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 5 / 25

  13. Function ( Θ ) as Point Estimate Regard Θ ( f ) as an estimate of the “true” Θ ⇤ ( f ⇤ ) Mapped from the training set X Maximum a posteriori (MAP) estimation : argmax Θ P ( Θ | X ) = argmax Θ P ( X | Θ ) P ( Θ ) By Bayes’ rule ( P ( X ) is irrelevant) Solves Θ first, then uses it as a constant in P ( y | x ; Θ ) to get ˆ y Maximum likelihood (ML) estimation : argmax Θ P ( X | Θ ) Assumes uniform P ( Θ ) and does not prefer particular Θ Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 5 / 25

  14. Outline Probabilistic Models 1 Maximum Likelihood Estimation 2 Linear Regression Logistic Regression Maximum A Posteriori Estimation 3 Bayesian Estimation** 4 Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 6 / 25

  15. Outline Probabilistic Models 1 Maximum Likelihood Estimation 2 Linear Regression Logistic Regression Maximum A Posteriori Estimation 3 Bayesian Estimation** 4 Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 7 / 25

  16. Probability Interpretation Assumption: y = f ⇤ ( x )+ ε , ε ⇠ N ( 0 , β � 1 ) Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 8 / 25

  17. Probability Interpretation Assumption: y = f ⇤ ( x )+ ε , ε ⇠ N ( 0 , β � 1 ) The unknown deterministic function is defined as f ⇤ ( x ; w ⇤ ) = w ⇤> x All variables are z -normalized, so no bias term ( b ) Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 8 / 25

  18. Probability Interpretation Assumption: y = f ⇤ ( x )+ ε , ε ⇠ N ( 0 , β � 1 ) The unknown deterministic function is defined as f ⇤ ( x ; w ⇤ ) = w ⇤> x All variables are z -normalized, so no bias term ( b ) We have ( y | x ) ⇠ N ( w ⇤> x , β � 1 ) Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 8 / 25

  19. Probability Interpretation Assumption: y = f ⇤ ( x )+ ε , ε ⇠ N ( 0 , β � 1 ) The unknown deterministic function is defined as f ⇤ ( x ; w ⇤ ) = w ⇤> x All variables are z -normalized, so no bias term ( b ) We have ( y | x ) ⇠ N ( w ⇤> x , β � 1 ) So, out goal is to find w as close to w ⇤ as possible such that: P ( y | x = x ; w ) = w > x y = argmax ˆ y Note that ˆ y is irrelevant to β , so we don’t need to solve β Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 8 / 25

  20. Probability Interpretation Assumption: y = f ⇤ ( x )+ ε , ε ⇠ N ( 0 , β � 1 ) The unknown deterministic function is defined as f ⇤ ( x ; w ⇤ ) = w ⇤> x All variables are z -normalized, so no bias term ( b ) We have ( y | x ) ⇠ N ( w ⇤> x , β � 1 ) So, out goal is to find w as close to w ⇤ as possible such that: P ( y | x = x ; w ) = w > x y = argmax ˆ y Note that ˆ y is irrelevant to β , so we don’t need to solve β ML estimation: w P ( X | w ) argmax Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 8 / 25

  21. ML Estimation I Problem: w P ( X | w ) argmax Since we assume i.i.d. samples, we have i = 1 P ( x ( i ) , y ( i ) | w ) = ∏ N i = 1 P ( y ( i ) | x ( i ) , w ) P ( x ( i ) | w ) P ( X | w ) = ∏ N Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 9 / 25

  22. ML Estimation I Problem: w P ( X | w ) argmax Since we assume i.i.d. samples, we have i = 1 P ( x ( i ) , y ( i ) | w ) = ∏ N i = 1 P ( y ( i ) | x ( i ) , w ) P ( x ( i ) | w ) P ( X | w ) = ∏ N i = 1 P ( y ( i ) | x ( i ) , w ) P ( x ( i ) ) = ∏ i N ( y ( i ) ; w > x ( i ) , σ 2 ) P ( x ( i ) ) = ∏ N Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 9 / 25

  23. ML Estimation I Problem: w P ( X | w ) argmax Since we assume i.i.d. samples, we have i = 1 P ( x ( i ) , y ( i ) | w ) = ∏ N i = 1 P ( y ( i ) | x ( i ) , w ) P ( x ( i ) | w ) P ( X | w ) = ∏ N i = 1 P ( y ( i ) | x ( i ) , w ) P ( x ( i ) ) = ∏ i N ( y ( i ) ; w > x ( i ) , σ 2 ) P ( x ( i ) ) = ∏ N q ⇣ 2 ( y ( i ) � w > x ( i ) ) 2 ⌘ β � β P ( x ( i ) ) = ∏ i 2 π exp Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 9 / 25

  24. ML Estimation I Problem: w P ( X | w ) argmax Since we assume i.i.d. samples, we have i = 1 P ( x ( i ) , y ( i ) | w ) = ∏ N i = 1 P ( y ( i ) | x ( i ) , w ) P ( x ( i ) | w ) P ( X | w ) = ∏ N i = 1 P ( y ( i ) | x ( i ) , w ) P ( x ( i ) ) = ∏ i N ( y ( i ) ; w > x ( i ) , σ 2 ) P ( x ( i ) ) = ∏ N q ⇣ 2 ( y ( i ) � w > x ( i ) ) 2 ⌘ β � β P ( x ( i ) ) = ∏ i 2 π exp To make the problem tractable, we prefer “sums” over “products” Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 9 / 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend