Probability Theory for Machine Learning
Chris Cremer September 2015
Probability Theory for Machine Learning Chris Cremer September - - PowerPoint PPT Presentation
Probability Theory for Machine Learning Chris Cremer September 2015 Outline Motivation Probability Definitions and Rules Probability Distributions MLE for Gaussian Parameter Estimation MLE and Least Squares Least Squares
Chris Cremer September 2015
appear nearby?
and manipulation of uncertainty
us, even though that information may be incomplete or ambiguous
Points ω in Ω are called sample outcomes, realizations, or elements. Subsets of Ω are called Events.
that the first toss is heads is A = {HH,HT}
Aj = {}
probability of A.
independent after conditioning on C
If disjoint, are events A and B also independent?
final and the midterm *
midterm?
* These are made up values.
final and the midterm *
midterm?
passed the final?
* These are made up values.
I should make example of both!!!!!!! Maybe even visualization of sum rule, some over matrix of probs
P(A|B) = P(AB) /P(B) (Conditional Probability) P(A|B) = P(B|A)P(A) /P(B) (Product Rule) P(A|B) = P(B|A)P(A) / Σ P(B|A)P(A) (Law of Total Probability)
probability that you actually have the disease?
background (prior) probability of the disease.
Bayes’ Rule
= .95 * .01 / .1085 = .087 Law of Total Probability
= P(T|D=1)P(D=1) + P(T|D=0)P(D=0) = .95*.01 + .1*.99 = .1085
The probability that you have the disease given you tested positive is 8.7%
each outcome ω
sequence ω. If ω = HHTHHTHHTT, then X(ω) = 6.
Univariate continuous random variable Univariate discrete random variable
=
=
Discrete Distribution = .6$ (1 − .6)$)$ = .6 Example: Probability of flipping heads (x=1) with a unfair coin
μ = probability of success
Discrete Distribution Example: Probability of flipping heads m times
probability 0.2
distribution to k categories instead of just binary (success/fail)
categories
Discrete Distribution
μ = μ1 … μK probability of success for each category, Σμ=1
Discrete Distribution
μ = μ1 … μK probability of success for each category, Σμ=1
Discrete Distribution Ex: Rolling 2 on a fair die 5 times out of 10 rolls. [0, 5, 0, 0, 0, 0] 10 [1/6, 1/6, 1/6, 1/6, 1/6, 1/6] 10 5
$
=
/ 0$-
written in the form
Continuous Distribution
written in the form
Continuous Distribution normalization constant 𝑓()2345678 892:5;<7 =6>? ?75;)
takes the form
See ‘Gibbs Sampling for the Uninitiated’ for a straightforward introduction to parameter estimation: http://www.umiacs.umd.edu/~resnik/pubs/LAMP-TR-153.pdf
they have the same probability distribution as the others and are all mutually independent.
variance (σ2)
for a certain µ and σ2 (assuming IID) is
Likelihood = Recall: If IID, P(ABC) = P(A)P(B)P(A)
What’s the distribution’s mean and variance? Likelihood =
the product becomes a sum
Likelihood = Log Likelihood = log
Log
To see proofs for these derivations: http://www.statlect.com/normal_distribution_maximum_likelihood.htm
To see proofs for these derivations: http://www.statlect.com/normal_distribution_maximum_likelihood.htm
sequence of data points (X1, T1), ..., (Xn, Tn), and you are asked to find the “best fit” line passing through those points.
precisely how to tell whether one line is “fitter” than another
error
For a good discussion of Maximum likelihood estimators and least squares see http://people.math.gatech.edu/~ecroot/3225/maximum_likelihood.pdf
y(x,w) is estimating the target t
Green lines Red line
using a probability distribution
Gaussian distribution with a mean equal to the value y(x,w)
β is the precision parameter (inverse variance)
determine the values of the unknown parameters w and β by maximum likelihood
they can be omitted.
constant β/2 does not alter the location of the maximum with respect to w, so it can be ignored
concerned, to minimizing the sum-of-squares error function
maximizing likelihood under the assumption of a Gaussian noise distribution
Training Set
Training Set Validation Set Held Out Data
Training Set Validation Set Held Out Data Training Set Error Validation Set Error Linear ++++ +++++ Quadratic +++ ++++++ Cubic ++ +++++++ 4th degree polynomial + ++++++++
Training Set Error Validation Set Error Linear ++++ +++++ Quadratic +++ ++++++ Cubic ++ +++++++ 4th degree polynomial + ++++++++
How well your model generalizes to new data is what matters!
takes the form