 
              Probabilistic Models Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 1 / 25
Outline Probabilistic Models 1 Maximum Likelihood Estimation 2 Linear Regression Logistic Regression Maximum A Posteriori Estimation 3 Bayesian Estimation** 4 Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 2 / 25
Outline Probabilistic Models 1 Maximum Likelihood Estimation 2 Linear Regression Logistic Regression Maximum A Posteriori Estimation 3 Bayesian Estimation** 4 Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 3 / 25
Predictions based on Probability Supervised learning, we are given a training set X = { ( x ( i ) , y ( i ) ) } N i = 1 Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 4 / 25
Predictions based on Probability Supervised learning, we are given a training set X = { ( x ( i ) , y ( i ) ) } N i = 1 Model F : a collection of functions parametrized by Θ Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 4 / 25
Predictions based on Probability Supervised learning, we are given a training set X = { ( x ( i ) , y ( i ) ) } N i = 1 Model F : a collection of functions parametrized by Θ Goal: to train a function f such that, given a new data point x 0 , the output value y = f ( x 0 ; Θ ) ˆ is closest to the correct label y 0 Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 4 / 25
Predictions based on Probability Supervised learning, we are given a training set X = { ( x ( i ) , y ( i ) ) } N i = 1 Model F : a collection of functions parametrized by Θ Goal: to train a function f such that, given a new data point x 0 , the output value y = f ( x 0 ; Θ ) ˆ is closest to the correct label y 0 Examples in X are usually assumed to be i.i.d. sampled from random variables ( x , y ) following some data generating distribution P ( x , y ) Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 4 / 25
Predictions based on Probability Supervised learning, we are given a training set X = { ( x ( i ) , y ( i ) ) } N i = 1 Model F : a collection of functions parametrized by Θ Goal: to train a function f such that, given a new data point x 0 , the output value y = f ( x 0 ; Θ ) ˆ is closest to the correct label y 0 Examples in X are usually assumed to be i.i.d. sampled from random variables ( x , y ) following some data generating distribution P ( x , y ) In probabilistic models, f is replaced by P ( y = y | x = x 0 ) and a prediction is made by: P ( y = y | x = x 0 ; Θ ) y = argmax ˆ y Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 4 / 25
Predictions based on Probability Supervised learning, we are given a training set X = { ( x ( i ) , y ( i ) ) } N i = 1 Model F : a collection of functions parametrized by Θ Goal: to train a function f such that, given a new data point x 0 , the output value y = f ( x 0 ; Θ ) ˆ is closest to the correct label y 0 Examples in X are usually assumed to be i.i.d. sampled from random variables ( x , y ) following some data generating distribution P ( x , y ) In probabilistic models, f is replaced by P ( y = y | x = x 0 ) and a prediction is made by: P ( y = y | x = x 0 ; Θ ) y = argmax ˆ y How to find Θ ? Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 4 / 25
Function ( Θ ) as Point Estimate Regard Θ ( f ) as an estimate of the “true” Θ ⇤ ( f ⇤ ) Mapped from the training set X Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 5 / 25
Function ( Θ ) as Point Estimate Regard Θ ( f ) as an estimate of the “true” Θ ⇤ ( f ⇤ ) Mapped from the training set X Maximum a posteriori (MAP) estimation : argmax Θ P ( Θ | X ) = argmax Θ P ( X | Θ ) P ( Θ ) By Bayes’ rule ( P ( X ) is irrelevant) Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 5 / 25
Function ( Θ ) as Point Estimate Regard Θ ( f ) as an estimate of the “true” Θ ⇤ ( f ⇤ ) Mapped from the training set X Maximum a posteriori (MAP) estimation : argmax Θ P ( Θ | X ) = argmax Θ P ( X | Θ ) P ( Θ ) By Bayes’ rule ( P ( X ) is irrelevant) Solves Θ first, then uses it as a constant in P ( y | x ; Θ ) to get ˆ y Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 5 / 25
Function ( Θ ) as Point Estimate Regard Θ ( f ) as an estimate of the “true” Θ ⇤ ( f ⇤ ) Mapped from the training set X Maximum a posteriori (MAP) estimation : argmax Θ P ( Θ | X ) = argmax Θ P ( X | Θ ) P ( Θ ) By Bayes’ rule ( P ( X ) is irrelevant) Solves Θ first, then uses it as a constant in P ( y | x ; Θ ) to get ˆ y Maximum likelihood (ML) estimation : argmax Θ P ( X | Θ ) Assumes uniform P ( Θ ) and does not prefer particular Θ Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 5 / 25
Outline Probabilistic Models 1 Maximum Likelihood Estimation 2 Linear Regression Logistic Regression Maximum A Posteriori Estimation 3 Bayesian Estimation** 4 Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 6 / 25
Outline Probabilistic Models 1 Maximum Likelihood Estimation 2 Linear Regression Logistic Regression Maximum A Posteriori Estimation 3 Bayesian Estimation** 4 Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 7 / 25
Probability Interpretation Assumption: y = f ⇤ ( x )+ ε , ε ⇠ N ( 0 , β � 1 ) Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 8 / 25
Probability Interpretation Assumption: y = f ⇤ ( x )+ ε , ε ⇠ N ( 0 , β � 1 ) The unknown deterministic function is defined as f ⇤ ( x ; w ⇤ ) = w ⇤> x All variables are z -normalized, so no bias term ( b ) Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 8 / 25
Probability Interpretation Assumption: y = f ⇤ ( x )+ ε , ε ⇠ N ( 0 , β � 1 ) The unknown deterministic function is defined as f ⇤ ( x ; w ⇤ ) = w ⇤> x All variables are z -normalized, so no bias term ( b ) We have ( y | x ) ⇠ N ( w ⇤> x , β � 1 ) Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 8 / 25
Probability Interpretation Assumption: y = f ⇤ ( x )+ ε , ε ⇠ N ( 0 , β � 1 ) The unknown deterministic function is defined as f ⇤ ( x ; w ⇤ ) = w ⇤> x All variables are z -normalized, so no bias term ( b ) We have ( y | x ) ⇠ N ( w ⇤> x , β � 1 ) So, out goal is to find w as close to w ⇤ as possible such that: P ( y | x = x ; w ) = w > x y = argmax ˆ y Note that ˆ y is irrelevant to β , so we don’t need to solve β Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 8 / 25
Probability Interpretation Assumption: y = f ⇤ ( x )+ ε , ε ⇠ N ( 0 , β � 1 ) The unknown deterministic function is defined as f ⇤ ( x ; w ⇤ ) = w ⇤> x All variables are z -normalized, so no bias term ( b ) We have ( y | x ) ⇠ N ( w ⇤> x , β � 1 ) So, out goal is to find w as close to w ⇤ as possible such that: P ( y | x = x ; w ) = w > x y = argmax ˆ y Note that ˆ y is irrelevant to β , so we don’t need to solve β ML estimation: w P ( X | w ) argmax Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 8 / 25
ML Estimation I Problem: w P ( X | w ) argmax Since we assume i.i.d. samples, we have i = 1 P ( x ( i ) , y ( i ) | w ) = ∏ N i = 1 P ( y ( i ) | x ( i ) , w ) P ( x ( i ) | w ) P ( X | w ) = ∏ N Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 9 / 25
ML Estimation I Problem: w P ( X | w ) argmax Since we assume i.i.d. samples, we have i = 1 P ( x ( i ) , y ( i ) | w ) = ∏ N i = 1 P ( y ( i ) | x ( i ) , w ) P ( x ( i ) | w ) P ( X | w ) = ∏ N i = 1 P ( y ( i ) | x ( i ) , w ) P ( x ( i ) ) = ∏ i N ( y ( i ) ; w > x ( i ) , σ 2 ) P ( x ( i ) ) = ∏ N Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 9 / 25
ML Estimation I Problem: w P ( X | w ) argmax Since we assume i.i.d. samples, we have i = 1 P ( x ( i ) , y ( i ) | w ) = ∏ N i = 1 P ( y ( i ) | x ( i ) , w ) P ( x ( i ) | w ) P ( X | w ) = ∏ N i = 1 P ( y ( i ) | x ( i ) , w ) P ( x ( i ) ) = ∏ i N ( y ( i ) ; w > x ( i ) , σ 2 ) P ( x ( i ) ) = ∏ N q ⇣ 2 ( y ( i ) � w > x ( i ) ) 2 ⌘ β � β P ( x ( i ) ) = ∏ i 2 π exp Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 9 / 25
ML Estimation I Problem: w P ( X | w ) argmax Since we assume i.i.d. samples, we have i = 1 P ( x ( i ) , y ( i ) | w ) = ∏ N i = 1 P ( y ( i ) | x ( i ) , w ) P ( x ( i ) | w ) P ( X | w ) = ∏ N i = 1 P ( y ( i ) | x ( i ) , w ) P ( x ( i ) ) = ∏ i N ( y ( i ) ; w > x ( i ) , σ 2 ) P ( x ( i ) ) = ∏ N q ⇣ 2 ( y ( i ) � w > x ( i ) ) 2 ⌘ β � β P ( x ( i ) ) = ∏ i 2 π exp To make the problem tractable, we prefer “sums” over “products” Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 9 / 25
Recommend
More recommend