Introduction to Probabilistic Machine Learning
Piyush Rai
- Dept. of CSE, IIT Kanpur
(Mini-course 1) Nov 03, 2015
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1
Introduction to Probabilistic Machine Learning Piyush Rai Dept. of - - PowerPoint PPT Presentation
Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning Detecting trends/patterns in the data
Piyush Rai
(Mini-course 1) Nov 03, 2015
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1
Detecting trends/patterns in the data Making predictions about future data
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 2
Detecting trends/patterns in the data Making predictions about future data Two schools of thoughts
Learning as optimization: fit a model to minimize some loss function Learning as inference: infer parameters of the data generating distribution
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 2
Detecting trends/patterns in the data Making predictions about future data Two schools of thoughts
Learning as optimization: fit a model to minimize some loss function Learning as inference: infer parameters of the data generating distribution
The two are not really completely disjoint ways of thinking about learning
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 2
A series of 4 talks
Introduction to Probabilistic and Bayesian Machine Learning (today) Case Study: Bayesian Linear Regression, Approx. Bayesian Inference (Nov 5) Nonparametric Bayesian modeling for function approximation (Nov 7)
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 3
Assume data X = {①1, . . . , ①N} generated from a probabilistic model: Data usually assumed i.i.d. (independent and identically distributed) ①1, . . . , ①N ∼ p(①|θ) ① ① ① ① ①
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 4
Assume data X = {①1, . . . , ①N} generated from a probabilistic model: Data usually assumed i.i.d. (independent and identically distributed) ①1, . . . , ①N ∼ p(①|θ) For i.i.d. data, probability of observed data X given model parameters θ p(X|θ) = p(①1, . . . , ①N|θ) =
N
p(①n|θ) ① ①
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 4
Assume data X = {①1, . . . , ①N} generated from a probabilistic model: Data usually assumed i.i.d. (independent and identically distributed) ①1, . . . , ①N ∼ p(①|θ) For i.i.d. data, probability of observed data X given model parameters θ p(X|θ) = p(①1, . . . , ①N|θ) =
N
p(①n|θ) p(①n|θ) denotes the likelihood w.r.t. data point n The form of p(①n|θ) depends on the type/characteristics of the data
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 4
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 5
We wish to estimate parameters θ from observed data {①1, . . . , ①N} MLE does this by finding θ that maximizes the (log)likelihood p(X|θ) ˆ θ = arg max
θ
log p(X|θ) ① ①
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 6
We wish to estimate parameters θ from observed data {①1, . . . , ①N} MLE does this by finding θ that maximizes the (log)likelihood p(X|θ) ˆ θ = arg max
θ
log p(X|θ) = arg max
θ
log
N
p(①n|θ) ①
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 6
We wish to estimate parameters θ from observed data {①1, . . . , ①N} MLE does this by finding θ that maximizes the (log)likelihood p(X|θ) ˆ θ = arg max
θ
log p(X|θ) = arg max
θ
log
N
p(①n|θ) = arg max
θ N
log p(①n|θ)
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 6
We wish to estimate parameters θ from observed data {①1, . . . , ①N} MLE does this by finding θ that maximizes the (log)likelihood p(X|θ) ˆ θ = arg max
θ
log p(X|θ) = arg max
θ
log
N
p(①n|θ) = arg max
θ N
log p(①n|θ) MLE now reduces to solving an optimization problem w.r.t. θ
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 6
We wish to estimate parameters θ from observed data {①1, . . . , ①N} MLE does this by finding θ that maximizes the (log)likelihood p(X|θ) ˆ θ = arg max
θ
log p(X|θ) = arg max
θ
log
N
p(①n|θ) = arg max
θ N
log p(①n|θ) MLE now reduces to solving an optimization problem w.r.t. θ MLE has some nice theoretical properties (e.g., consistency as N → ∞)
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 6
Often, we might a priori know something about the parameters A prior distribution p(θ) can encode/specify this knowledge Bayes rule gives us the posterior distribution over θ: p(θ|X) Posterior reflects our updated knowledge about θ using observed data p(θ|X) = p(X|θ)p(θ) p(X) = p(X|θ)p(θ)
Note: θ is now a random variable
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 7
MAP estimation finds θ that maximizes the posterior p(θ|X) ∝ p(X|θ)p(θ) ˆ θ = arg max
θ
log
N
p(①n|θ)p(θ) = arg max
θ N
log p(①n|θ) + log p(θ) MAP now reduces to solving an optimization problem w.r.t. θ Objective function very similar to MLE, except for the log p(θ) term In some sense, MAP is just a “regularized” MLE
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 8
Both MLE and MAP only give a point estimate (single best answer) of θ How can we capture/quantify the uncertainty in θ?
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 9
Both MLE and MAP only give a point estimate (single best answer) of θ How can we capture/quantify the uncertainty in θ? Need to infer the full posterior distribution p(θ|X) = p(X|θ)p(θ) p(X) = p(X|θ)p(θ)
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 9
Both MLE and MAP only give a point estimate (single best answer) of θ How can we capture/quantify the uncertainty in θ? Need to infer the full posterior distribution p(θ|X) = p(X|θ)p(θ) p(X) = p(X|θ)p(θ)
Requires doing a “fully Bayesian” inference Inference sometimes a somewhat easy and sometimes a (very) hard problem
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 9
We want to estimate a coin’s bias θ ∈ (0, 1) based on N tosses The likelihood model: {①1, . . . , ①N} ∼ Bernoulli(θ) p(①n|θ) = θ①n(1 − θ)1−①n The prior: θ ∼ Beta(a, b) p(θ|a, b) = Γ(a + b) Γ(a)Γ(b)θa−1(1 − θ)b−1 The posterior p(θ|X) ∝ N
n=1 p(①n|θ)p(θ|a, b)
∝ N
n=1 θ①n(1 − θ)1−①nθa−1(1 − θ)b−1
= θa+N
n=1 ①n−1(1 − θ)b+N−N n=1 ①n−1
Thus the posterior is: Beta(a + N
n=1 ①n, b + N − N n=1 ①n)
Here, the posterior has the same form as the prior (both Beta) Also very easy to perform online inference
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 10
Recall p(θ|X) = p(X|θ)p(θ)
p(X)
Given some data distribution (likelihood) p(X|θ) and a prior p(θ) = π(θ|α).. The prior is conjugate if the posterior also has the same form, i.e., p(θ|α, X) = P(X|θ)π(θ|π) p(X) = π(θ|α∗) Several pairs of distributions are conjugate to each other, e.g.,
Gaussian-Gaussian Beta-Bernoulli Beta-Binomial Gamma-Poisson Dirichlet-Multinomial ..
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 11
Want to learn a classifier θ for predicting label ① ∈ {−1, +1} for a point ③ Assume a logistic likelihood model for the labels p(①n|θ) = 1 1 + exp(−①nθ⊤③n) ①
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 12
Want to learn a classifier θ for predicting label ① ∈ {−1, +1} for a point ③ Assume a logistic likelihood model for the labels p(①n|θ) = 1 1 + exp(−①nθ⊤③n) The prior: θ ∼ Normal(µ, Σ) (Gaussian, not conjugate to the logistic) p(θ|µ, Σ) ∝ exp(−1 2(θ − µ)⊤Σ−1(θ − µ)) ①
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 12
Want to learn a classifier θ for predicting label ① ∈ {−1, +1} for a point ③ Assume a logistic likelihood model for the labels p(①n|θ) = 1 1 + exp(−①nθ⊤③n) The prior: θ ∼ Normal(µ, Σ) (Gaussian, not conjugate to the logistic) p(θ|µ, Σ) ∝ exp(−1 2(θ − µ)⊤Σ−1(θ − µ)) The posterior p(θ|X) ∝ N
n=1 p(①n|θ)p(θ|µ, Σ) does not have a closed form
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 12
Want to learn a classifier θ for predicting label ① ∈ {−1, +1} for a point ③ Assume a logistic likelihood model for the labels p(①n|θ) = 1 1 + exp(−①nθ⊤③n) The prior: θ ∼ Normal(µ, Σ) (Gaussian, not conjugate to the logistic) p(θ|µ, Σ) ∝ exp(−1 2(θ − µ)⊤Σ−1(θ − µ)) The posterior p(θ|X) ∝ N
n=1 p(①n|θ)p(θ|µ, Σ) does not have a closed form
Approximate Bayesian inference needed in such cases
Sampling based approximations: MCMC methods Optimization based approximations: Variational Bayes, Laplace, etc.
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 12
Our estimate of θ is not a single value (“point”) but a distribution Can model and quantify the uncertainty (or “variance”) in θ via p(θ|X)
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 13
Our estimate of θ is not a single value (“point”) but a distribution Can model and quantify the uncertainty (or “variance”) in θ via p(θ|X) Can use the uncertainty in various tasks such as diagnosis, predictions
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 13
Our estimate of θ is not a single value (“point”) but a distribution Can model and quantify the uncertainty (or “variance”) in θ via p(θ|X) Can use the uncertainty in various tasks such as diagnosis, predictions , e.g., Making predictions by averaging over all possibile values of θ p(y|x, X, Y ) = Ep(θ|.)[p(y|x, θ)] p(y|x, X, Y ) =
Allows also quantifying the uncertainty in the predictions
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 13
Hierarchical model construction: parameters can depend on hyperparameters hyperparameters need not be tuned but can be inferred from data
.. by maximizing the marginal likelihood p(X|α)
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 14
Hierarchical model construction: parameters can depend on hyperparameters hyperparameters need not be tuned but can be inferred from data
.. by maximizing the marginal likelihood p(X|α)
Provides robustness: E.g., learning the sparsity hyperparameter in sparse regression, learning kernel hyperparameters in kernel methods
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 14
Can introduce “local parameters” (latent variables) associated with each data point and infer those as well Used in many problems: Gaussian mixture model, probabilistic principal component analysis, factor analysis, topic models
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 15
Enables a modular architecture: Simple models can be neatly combined to solve more complex problems Allows jointly learning across multiple data sets (sometimes also known as multitask learning or transfer learning)
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 16
Nonparametric Bayesian modeling: a principled way to learn model size E.g., how many clusters (Gaussian mixture model or graph clustering), how many basis vectors (PCA) or dictionary elements (sparse coding or dictionary learning), how many topics (topic models such as LDA), etc.. NPBayes modeling allows the model size to grow with data
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 17
Sequential data acquisition or “active learning” Can check how confident the learned model is w.r.t. a new data point p(θ|λ) = Normal(θ|0, λ2) Prior p(y|①, θ) = Normal(y|θ⊤①, σ2) Likelihood p(θ|Y , X) = Normal(θ|µθ, Σθ) Posterior p(y0|①0, Y , X) = Normal(y0|µ0, σ2
0)
Predictive dist. µ0 = µ⊤
θ ①0
Predictive mean σ2 = σ2 + ①⊤
0 Σθ①0
Predictive variance
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 18
Sequential data acquisition or “active learning” Can check how confident the learned model is w.r.t. a new data point p(θ|λ) = Normal(θ|0, λ2) Prior p(y|①, θ) = Normal(y|θ⊤①, σ2) Likelihood p(θ|Y , X) = Normal(θ|µθ, Σθ) Posterior p(y0|①0, Y , X) = Normal(y0|µ0, σ2
0)
Predictive dist. µ0 = µ⊤
θ ①0
Predictive mean σ2 = σ2 + ①⊤
0 Σθ①0
Predictive variance Gives a strategy to choose data points sequentially for improved learning with a budget on the amount of data available
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 18
Case study on Bayesian sparse linear regression Hyperparameter estimation Introduction to approximate Bayesian inference
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 19
Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 20