machine learning 2007 lecture 11 instructor tim van erven
play

Machine Learning 2007: Lecture 11 Instructor: Tim van Erven - PowerPoint PPT Presentation

Machine Learning 2007: Lecture 11 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website: www.cwi.nl/erven/teaching/0708/ml/ November 28, 2007 1 / 35 Overview Organisational Organisational Matters Matters Models Models


  1. Machine Learning 2007: Lecture 11 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website: www.cwi.nl/˜erven/teaching/0708/ml/ November 28, 2007 1 / 35

  2. Overview Organisational Organisational Matters ● Matters Models ● Models Maximum Likelihood Parameter Estimation ● Maximum Likelihood Parameter Estimation Probability Theory ● Probability Theory Bayesian Learning ● Bayesian Learning ✦ The Bayesian Distribution ✦ From Prior to Posterior ✦ MAP Parameter Estimation ✦ Bayesian Predictions ✦ Discussion ✦ Advanced Issues 2 / 35

  3. Organisational Guest lecture: Matters Next week, Peter Gr¨ unwald will give a special guest lecture Models ● Maximum Likelihood about minimum description length (MDL) learning. Parameter Estimation This Lecture versus Mitchell: Probability Theory Bayesian Learning Chapter 6 up to section 6.5.0 about Bayesian learning. ● I present things in a better order. ● Mitchell also covers the connection between MAP parameter ● estimation and least squares linear regression: It is good for you to study this, but I will not ask an exam question about it. 3 / 35

  4. Overview Organisational Organisational Matters ● Matters Models ● Models Maximum Likelihood Parameter Estimation ● Maximum Likelihood Parameter Estimation Probability Theory ● Probability Theory Bayesian Learning ● Bayesian Learning ✦ The Bayesian Distribution ✦ From Prior to Posterior ✦ MAP Parameter Estimation ✦ Bayesian Predictions ✦ Discussion ✦ Advanced Issues 4 / 35

  5. Prediction Example without Noise Training data: Organisational Matters y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 Models D = Maximum Likelihood 0 1 0 1 0 1 0 1 Parameter Estimation Probability Theory Hypothesis Space: Bayesian Learning h 1 : y n = 0 � 0 if n is odd h 2 : y n = H = { h 1 , h 2 , h 3 } 1 if n is even h 3 : y n = 1 5 / 35

  6. Prediction Example without Noise Training data: Organisational Matters y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 Models D = Maximum Likelihood 0 1 0 1 0 1 0 1 Parameter Estimation Probability Theory Hypothesis Space: Bayesian Learning h 1 : y n = 0 � 0 if n is odd h 2 : y n = H = { h 1 , h 2 , h 3 } 1 if n is even h 3 : y n = 1 By simple list-then-eliminate: Only h 2 is consistent with the training data. ● Therefore we predict, in accordance with h 2 , that y 9 = 0 . ● 5 / 35

  7. Turning Hypotheses into Distributions Models: Organisational Matters We may view each hypothesis as probability distribution that ● Models Maximum Likelihood gives probability 1 to a certain outcome. Parameter Estimation A hypothesis space that contains such probabilistic ● Probability Theory hypotheses is called a (statistical) model . Bayesian Learning The previous hypotheses as distributions: P 1 : P 1 ( y n = 0) = 1 M = { P 1 , P 2 , P 3 } ( 1 if n is odd P 2 : P 2 ( y n = 0) = 0 if n is even P 3 ( y n = 1) = 1 P 3 : 6 / 35

  8. Turning Hypotheses into Distributions Models: Organisational Matters We may view each hypothesis as probability distribution that ● Models Maximum Likelihood gives probability 1 to a certain outcome. Parameter Estimation A hypothesis space that contains such probabilistic ● Probability Theory hypotheses is called a (statistical) model . Bayesian Learning The previous hypotheses as distributions: P 1 : P 1 ( y n = 0) = 1 M = { P 1 , P 2 , P 3 } ( 1 if n is odd P 2 : P 2 ( y n = 0) = 0 if n is even P 3 ( y n = 1) = 1 P 3 : List-then-eliminate still works: A probabilistic hypothesis is consistent with the data if it gives ● positive probability to the data. 6 / 35

  9. Prediction Example with Noise Noise: Organisational Matters Using probabilistic hypotheses is natural when there is noise Models ● Maximum Likelihood in the data. Parameter Estimation Suppose we observe a measurement error with some (small) ● Probability Theory probability ǫ . Bayesian Learning This is easy to incorporate: P 1 : P 1 ( y n = 0) = 1 − ǫ M = { P 1 , P 2 , P 3 } ( 1 − ǫ if n is odd P 2 : P 2 ( y n = 0) = if n is even ǫ P 3 : P 3 ( y n = 1) = 1 − ǫ 7 / 35

  10. Prediction Example with Noise Noise: Organisational Matters Using probabilistic hypotheses is natural when there is noise Models ● Maximum Likelihood in the data. Parameter Estimation Suppose we observe a measurement error with some (small) ● Probability Theory probability ǫ . Bayesian Learning This is easy to incorporate: P 1 : P 1 ( y n = 0) = 1 − ǫ M = { P 1 , P 2 , P 3 } ( 1 − ǫ if n is odd P 2 : P 2 ( y n = 0) = if n is even ǫ P 3 : P 3 ( y n = 1) = 1 − ǫ List-then-eliminate does not work any more: For example, P 1 ( D = 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1) = ǫ 4 (1 − ǫ ) 4 . ● Typically many or all probabilistic hypotheses in our model will ● be consistent with the data. 7 / 35

  11. Overview Organisational Organisational Matters ● Matters Models ● Models Maximum Likelihood Parameter Estimation ● Maximum Likelihood Parameter Estimation Probability Theory ● Probability Theory Bayesian Learning ● Bayesian Learning ✦ The Bayesian Distribution ✦ From Prior to Posterior ✦ MAP Parameter Estimation ✦ Bayesian Predictions ✦ Discussion ✦ Advanced Issues 8 / 35

  12. Parameters Parameters index the elements of a hypothesis space: Organisational Matters Models H = { h 1 , h 2 , h 3 } ⇐ ⇒ H = { h θ | θ ∈ { 1 , 2 , 3 }} Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 9 / 35

  13. Parameters Parameters index the elements of a hypothesis space: Organisational Matters Models H = { h 1 , h 2 , h 3 } ⇐ ⇒ H = { h θ | θ ∈ { 1 , 2 , 3 }} Maximum Likelihood Parameter Estimation Usually in a convenient way: Probability Theory Bayesian Learning Hypotheses are often expressed in terms of the parameters. In linear regression for example: H = { h w | w ∈ R 2 } where h w : y = w 0 + w 1 x . 9 / 35

  14. Parameters Parameters index the elements of a hypothesis space: Organisational Matters Models H = { h 1 , h 2 , h 3 } ⇐ ⇒ H = { h θ | θ ∈ { 1 , 2 , 3 }} Maximum Likelihood Parameter Estimation Usually in a convenient way: Probability Theory Bayesian Learning Hypotheses are often expressed in terms of the parameters. In linear regression for example: H = { h w | w ∈ R 2 } where h w : y = w 0 + w 1 x . Example where the hypothesis space is a model: For example in prediction of binary outcomes: � � 1 �� 4 , 1 2 , 3 M = P θ | θ ∈ where P θ ( y n = 1) = θ . 4 9 / 35

  15. Maximum Likelihood Parameter Estimation Training data and model: Organisational Matters Models D = y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 Maximum Likelihood Parameter Estimation 0 1 1 1 0 1 1 1 Probability Theory � � 1 4 , 1 2 , 3 �� M = P θ | θ ∈ where P θ ( y n = 1) = θ . Bayesian Learning 4 Likelihood: θ 1 / 4 1 / 2 3 / 4 (1 / 4) 6 (3 / 4) 2 (1 / 2) 8 (3 / 4) 6 (1 / 4) 2 P θ ( D ) = 9 / 65536 = 256 / 65536 = 729 / 65536 Maximum Likelihood Parameter Estimation: ˆ θ = arg max θ P θ ( D ) = 3 / 4 10 / 35

  16. Overview Organisational Organisational Matters ● Matters Models ● Models Maximum Likelihood Parameter Estimation ● Maximum Likelihood Parameter Estimation Probability Theory ● Probability Theory Bayesian Learning ● Bayesian Learning ✦ The Bayesian Distribution ✦ From Prior to Posterior ✦ MAP Parameter Estimation ✦ Bayesian Predictions ✦ Discussion ✦ Advanced Issues 11 / 35

  17. Relating Unions and Intersections Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning For any two events A and B : P ( A ∪ B ) = P ( A ) + P ( B ) − P ( A ∩ B ) 12 / 35

  18. The Law of Total Probability Organisational a Matters b g c Models d f Maximum Likelihood e Parameter Estimation Probability Theory Bayesian Learning Suppose Ω = { a, b, c, d, e, f, g } . ● 13 / 35

  19. The Law of Total Probability Organisational a Matters b g c Models d f Maximum Likelihood e Parameter Estimation Probability Theory Bayesian Learning Suppose Ω = { a, b, c, d, e, f, g } . ● A partition of Ω cuts it into parts: ● ✦ Let the parts be A 1 = { a, b } , A 2 = { c, d, e } and A 3 = { f, g } ✦ The parts do not overlap, and together cover Ω . 13 / 35

  20. The Law of Total Probability Organisational a Matters b g c Models d f Maximum Likelihood e Parameter Estimation Probability Theory Bayesian Learning Suppose Ω = { a, b, c, d, e, f, g } . ● A partition of Ω cuts it into parts: ● ✦ Let the parts be A 1 = { a, b } , A 2 = { c, d, e } and A 3 = { f, g } ✦ The parts do not overlap, and together cover Ω . B = { b, d, f } ● 13 / 35

  21. The Law of Total Probability Organisational a Matters b g c Models d f Maximum Likelihood e Parameter Estimation Probability Theory Bayesian Learning Suppose Ω = { a, b, c, d, e, f, g } . ● A partition of Ω cuts it into parts: ● ✦ Let the parts be A 1 = { a, b } , A 2 = { c, d, e } and A 3 = { f, g } ✦ The parts do not overlap, and together cover Ω . B = { b, d, f } ● Law of Total Probability: 3 3 � � P ( B ) = P ( B ∩ A i ) = P ( B | A i ) P ( A i ) i =1 i =1 13 / 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend