Machine Learning 2007: Lecture 11 Instructor: Tim van Erven - PowerPoint PPT Presentation

Machine Learning 2007: Lecture 11 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website: www.cwi.nl/˜erven/teaching/0708/ml/ November 28, 2007 1 / 35

Overview Organisational Organisational Matters ● Matters Models ● Models Maximum Likelihood Parameter Estimation ● Maximum Likelihood Parameter Estimation Probability Theory ● Probability Theory Bayesian Learning ● Bayesian Learning ✦ The Bayesian Distribution ✦ From Prior to Posterior ✦ MAP Parameter Estimation ✦ Bayesian Predictions ✦ Discussion ✦ Advanced Issues 2 / 35

Organisational Guest lecture: Matters Next week, Peter Gr¨ unwald will give a special guest lecture Models ● Maximum Likelihood about minimum description length (MDL) learning. Parameter Estimation This Lecture versus Mitchell: Probability Theory Bayesian Learning Chapter 6 up to section 6.5.0 about Bayesian learning. ● I present things in a better order. ● Mitchell also covers the connection between MAP parameter ● estimation and least squares linear regression: It is good for you to study this, but I will not ask an exam question about it. 3 / 35

Prediction Example without Noise Training data: Organisational Matters y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 Models D = Maximum Likelihood 0 1 0 1 0 1 0 1 Parameter Estimation Probability Theory Hypothesis Space: Bayesian Learning h 1 : y n = 0 � 0 if n is odd h 2 : y n = H = { h 1 , h 2 , h 3 } 1 if n is even h 3 : y n = 1 5 / 35

Prediction Example without Noise Training data: Organisational Matters y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 Models D = Maximum Likelihood 0 1 0 1 0 1 0 1 Parameter Estimation Probability Theory Hypothesis Space: Bayesian Learning h 1 : y n = 0 � 0 if n is odd h 2 : y n = H = { h 1 , h 2 , h 3 } 1 if n is even h 3 : y n = 1 By simple list-then-eliminate: Only h 2 is consistent with the training data. ● Therefore we predict, in accordance with h 2 , that y 9 = 0 . ● 5 / 35

Turning Hypotheses into Distributions Models: Organisational Matters We may view each hypothesis as probability distribution that ● Models Maximum Likelihood gives probability 1 to a certain outcome. Parameter Estimation A hypothesis space that contains such probabilistic ● Probability Theory hypotheses is called a (statistical) model . Bayesian Learning The previous hypotheses as distributions: P 1 : P 1 ( y n = 0) = 1 M = { P 1 , P 2 , P 3 } ( 1 if n is odd P 2 : P 2 ( y n = 0) = 0 if n is even P 3 ( y n = 1) = 1 P 3 : 6 / 35

Turning Hypotheses into Distributions Models: Organisational Matters We may view each hypothesis as probability distribution that ● Models Maximum Likelihood gives probability 1 to a certain outcome. Parameter Estimation A hypothesis space that contains such probabilistic ● Probability Theory hypotheses is called a (statistical) model . Bayesian Learning The previous hypotheses as distributions: P 1 : P 1 ( y n = 0) = 1 M = { P 1 , P 2 , P 3 } ( 1 if n is odd P 2 : P 2 ( y n = 0) = 0 if n is even P 3 ( y n = 1) = 1 P 3 : List-then-eliminate still works: A probabilistic hypothesis is consistent with the data if it gives ● positive probability to the data. 6 / 35

Prediction Example with Noise Noise: Organisational Matters Using probabilistic hypotheses is natural when there is noise Models ● Maximum Likelihood in the data. Parameter Estimation Suppose we observe a measurement error with some (small) ● Probability Theory probability ǫ . Bayesian Learning This is easy to incorporate: P 1 : P 1 ( y n = 0) = 1 − ǫ M = { P 1 , P 2 , P 3 } ( 1 − ǫ if n is odd P 2 : P 2 ( y n = 0) = if n is even ǫ P 3 : P 3 ( y n = 1) = 1 − ǫ 7 / 35

Prediction Example with Noise Noise: Organisational Matters Using probabilistic hypotheses is natural when there is noise Models ● Maximum Likelihood in the data. Parameter Estimation Suppose we observe a measurement error with some (small) ● Probability Theory probability ǫ . Bayesian Learning This is easy to incorporate: P 1 : P 1 ( y n = 0) = 1 − ǫ M = { P 1 , P 2 , P 3 } ( 1 − ǫ if n is odd P 2 : P 2 ( y n = 0) = if n is even ǫ P 3 : P 3 ( y n = 1) = 1 − ǫ List-then-eliminate does not work any more: For example, P 1 ( D = 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1) = ǫ 4 (1 − ǫ ) 4 . ● Typically many or all probabilistic hypotheses in our model will ● be consistent with the data. 7 / 35

Parameters Parameters index the elements of a hypothesis space: Organisational Matters Models H = { h 1 , h 2 , h 3 } ⇐ ⇒ H = { h θ | θ ∈ { 1 , 2 , 3 }} Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 9 / 35

Parameters Parameters index the elements of a hypothesis space: Organisational Matters Models H = { h 1 , h 2 , h 3 } ⇐ ⇒ H = { h θ | θ ∈ { 1 , 2 , 3 }} Maximum Likelihood Parameter Estimation Usually in a convenient way: Probability Theory Bayesian Learning Hypotheses are often expressed in terms of the parameters. In linear regression for example: H = { h w | w ∈ R 2 } where h w : y = w 0 + w 1 x . 9 / 35

Parameters Parameters index the elements of a hypothesis space: Organisational Matters Models H = { h 1 , h 2 , h 3 } ⇐ ⇒ H = { h θ | θ ∈ { 1 , 2 , 3 }} Maximum Likelihood Parameter Estimation Usually in a convenient way: Probability Theory Bayesian Learning Hypotheses are often expressed in terms of the parameters. In linear regression for example: H = { h w | w ∈ R 2 } where h w : y = w 0 + w 1 x . Example where the hypothesis space is a model: For example in prediction of binary outcomes: � � 1 �� 4 , 1 2 , 3 M = P θ | θ ∈ where P θ ( y n = 1) = θ . 4 9 / 35

Maximum Likelihood Parameter Estimation Training data and model: Organisational Matters Models D = y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 Maximum Likelihood Parameter Estimation 0 1 1 1 0 1 1 1 Probability Theory � � 1 4 , 1 2 , 3 �� M = P θ | θ ∈ where P θ ( y n = 1) = θ . Bayesian Learning 4 Likelihood: θ 1 / 4 1 / 2 3 / 4 (1 / 4) 6 (3 / 4) 2 (1 / 2) 8 (3 / 4) 6 (1 / 4) 2 P θ ( D ) = 9 / 65536 = 256 / 65536 = 729 / 65536 Maximum Likelihood Parameter Estimation: ˆ θ = arg max θ P θ ( D ) = 3 / 4 10 / 35

Relating Unions and Intersections Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning For any two events A and B : P ( A ∪ B ) = P ( A ) + P ( B ) − P ( A ∩ B ) 12 / 35

The Law of Total Probability Organisational a Matters b g c Models d f Maximum Likelihood e Parameter Estimation Probability Theory Bayesian Learning Suppose Ω = { a, b, c, d, e, f, g } . ● 13 / 35

The Law of Total Probability Organisational a Matters b g c Models d f Maximum Likelihood e Parameter Estimation Probability Theory Bayesian Learning Suppose Ω = { a, b, c, d, e, f, g } . ● A partition of Ω cuts it into parts: ● ✦ Let the parts be A 1 = { a, b } , A 2 = { c, d, e } and A 3 = { f, g } ✦ The parts do not overlap, and together cover Ω . 13 / 35

The Law of Total Probability Organisational a Matters b g c Models d f Maximum Likelihood e Parameter Estimation Probability Theory Bayesian Learning Suppose Ω = { a, b, c, d, e, f, g } . ● A partition of Ω cuts it into parts: ● ✦ Let the parts be A 1 = { a, b } , A 2 = { c, d, e } and A 3 = { f, g } ✦ The parts do not overlap, and together cover Ω . B = { b, d, f } ● 13 / 35

The Law of Total Probability Organisational a Matters b g c Models d f Maximum Likelihood e Parameter Estimation Probability Theory Bayesian Learning Suppose Ω = { a, b, c, d, e, f, g } . ● A partition of Ω cuts it into parts: ● ✦ Let the parts be A 1 = { a, b } , A 2 = { c, d, e } and A 3 = { f, g } ✦ The parts do not overlap, and together cover Ω . B = { b, d, f } ● Law of Total Probability: 3 3 � � P ( B ) = P ( B ∩ A i ) = P ( B | A i ) P ( A i ) i =1 i =1 13 / 35

Machine Learning 2007: Lecture 11 Instructor: Tim van Erven - PowerPoint PPT Presentation

Machine Learning 2007: Lecture 11 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website: www.cwi.nl/erven/teaching/0708/ml/ November 28, 2007 1 / 35 Overview Organisational Organisational Matters Matters Models Models

Machine Learning 2007: Lecture 2 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website:

Machine Learning 2007: Lecture 7 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website:

Machine Learning 2007: Lecture 8 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website:

Machine Learning 2007: Lecture 3 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website:

Machine Learning 2007: Lecture 4 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website:

Machine Learning 2007: Slides 1 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website:

Machine Learning 2007: Slides 1 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website:

Mixability in Statistical Learning Tim van Erven Joint work with: Peter Grnwald, Mark Reid, Bob

Learning Faster from Easy Data II Wouter Koolen Tim van Erven Aim of the Workshop

Follow the leader if you can, Hedge if you must Tim van Erven NIPS, 2013 Joint work with:

The Catch-up Phenomenon in Bayesian and MDL Model Selection Tim van Erven www.timvanerven.nl 23

Follow the Leader with Dropout Perturbations Tim van Erven COLT, 2014 Joint work with: Wojciech

Making Regional Forecasts Add Up 1,2 Tim van Erven Joint work with: Jairo Cugliari 2 1 2

x ? Machine Learning 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data,

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

Machine Learning Overview of probability Hamid Beigy Sharif University of Technology Fall 1396

Tutorial 2 the outline Example-1 from linear algebra Conditional probability Example 2:

Stochastic Simulation Introduction Bo Friis Nielsen Applied Mathematics and Computer Science

Conditional Probability and Independence Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department

Notes for 848 lecture 4: A ML basis for compatibility and parsimony Figure 1: The unrooted tree AB

Humanoid Robotics 6D Localization for Humanoid Robots Maren Bennewitz 1 Motivation To

ROBOTICS 01PEEQW Basilio Bona DAUIN Politecnico di Torino Probabilistic Fundamentals in

Convolutions and fluctuations: free, finite, quantized. Vadim Gorin MIT (Cambridge) and IITP