Introduction to Probabilistic Machine Learning Piyush Rai Dept. of - - PowerPoint PPT Presentation

introduction to probabilistic machine learning
SMART_READER_LITE
LIVE PREVIEW

Introduction to Probabilistic Machine Learning Piyush Rai Dept. of - - PowerPoint PPT Presentation

Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning Detecting trends/patterns in the data


slide-1
SLIDE 1

Introduction to Probabilistic Machine Learning

Piyush Rai

  • Dept. of CSE, IIT Kanpur

(Mini-course 1) Nov 03, 2015

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1

slide-2
SLIDE 2

Machine Learning

Detecting trends/patterns in the data Making predictions about future data

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 2

slide-3
SLIDE 3

Machine Learning

Detecting trends/patterns in the data Making predictions about future data Two schools of thoughts

Learning as optimization: fit a model to minimize some loss function Learning as inference: infer parameters of the data generating distribution

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 2

slide-4
SLIDE 4

Machine Learning

Detecting trends/patterns in the data Making predictions about future data Two schools of thoughts

Learning as optimization: fit a model to minimize some loss function Learning as inference: infer parameters of the data generating distribution

The two are not really completely disjoint ways of thinking about learning

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 2

slide-5
SLIDE 5

Plan for the mini-course

A series of 4 talks

Introduction to Probabilistic and Bayesian Machine Learning (today) Case Study: Bayesian Linear Regression, Approx. Bayesian Inference (Nov 5) Nonparametric Bayesian modeling for function approximation (Nov 7)

  • Nonparam. Bayesian modeling for clustering/dimensionality reduction (Nov 8)

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 3

slide-6
SLIDE 6

Machine Learning via Probabilistic Modeling

Assume data X = {①1, . . . , ①N} generated from a probabilistic model: Data usually assumed i.i.d. (independent and identically distributed) ①1, . . . , ①N ∼ p(①|θ) ① ① ① ① ①

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 4

slide-7
SLIDE 7

Machine Learning via Probabilistic Modeling

Assume data X = {①1, . . . , ①N} generated from a probabilistic model: Data usually assumed i.i.d. (independent and identically distributed) ①1, . . . , ①N ∼ p(①|θ) For i.i.d. data, probability of observed data X given model parameters θ p(X|θ) = p(①1, . . . , ①N|θ) =

N

  • n=1

p(①n|θ) ① ①

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 4

slide-8
SLIDE 8

Machine Learning via Probabilistic Modeling

Assume data X = {①1, . . . , ①N} generated from a probabilistic model: Data usually assumed i.i.d. (independent and identically distributed) ①1, . . . , ①N ∼ p(①|θ) For i.i.d. data, probability of observed data X given model parameters θ p(X|θ) = p(①1, . . . , ①N|θ) =

N

  • n=1

p(①n|θ) p(①n|θ) denotes the likelihood w.r.t. data point n The form of p(①n|θ) depends on the type/characteristics of the data

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 4

slide-9
SLIDE 9

Some common probability distributions

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 5

slide-10
SLIDE 10

Maximum Likelihood Estimation (MLE)

We wish to estimate parameters θ from observed data {①1, . . . , ①N} MLE does this by finding θ that maximizes the (log)likelihood p(X|θ) ˆ θ = arg max

θ

log p(X|θ) ① ①

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 6

slide-11
SLIDE 11

Maximum Likelihood Estimation (MLE)

We wish to estimate parameters θ from observed data {①1, . . . , ①N} MLE does this by finding θ that maximizes the (log)likelihood p(X|θ) ˆ θ = arg max

θ

log p(X|θ) = arg max

θ

log

N

  • n=1

p(①n|θ) ①

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 6

slide-12
SLIDE 12

Maximum Likelihood Estimation (MLE)

We wish to estimate parameters θ from observed data {①1, . . . , ①N} MLE does this by finding θ that maximizes the (log)likelihood p(X|θ) ˆ θ = arg max

θ

log p(X|θ) = arg max

θ

log

N

  • n=1

p(①n|θ) = arg max

θ N

  • n=1

log p(①n|θ)

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 6

slide-13
SLIDE 13

Maximum Likelihood Estimation (MLE)

We wish to estimate parameters θ from observed data {①1, . . . , ①N} MLE does this by finding θ that maximizes the (log)likelihood p(X|θ) ˆ θ = arg max

θ

log p(X|θ) = arg max

θ

log

N

  • n=1

p(①n|θ) = arg max

θ N

  • n=1

log p(①n|θ) MLE now reduces to solving an optimization problem w.r.t. θ

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 6

slide-14
SLIDE 14

Maximum Likelihood Estimation (MLE)

We wish to estimate parameters θ from observed data {①1, . . . , ①N} MLE does this by finding θ that maximizes the (log)likelihood p(X|θ) ˆ θ = arg max

θ

log p(X|θ) = arg max

θ

log

N

  • n=1

p(①n|θ) = arg max

θ N

  • n=1

log p(①n|θ) MLE now reduces to solving an optimization problem w.r.t. θ MLE has some nice theoretical properties (e.g., consistency as N → ∞)

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 6

slide-15
SLIDE 15

Injecting Prior Knowledge

Often, we might a priori know something about the parameters A prior distribution p(θ) can encode/specify this knowledge Bayes rule gives us the posterior distribution over θ: p(θ|X) Posterior reflects our updated knowledge about θ using observed data p(θ|X) = p(X|θ)p(θ) p(X) = p(X|θ)p(θ)

  • θ p(X|θ)p(θ)dθ ∝ Likelihood × Prior

Note: θ is now a random variable

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 7

slide-16
SLIDE 16

Maximum-a-Posteriori (MAP) Estimation

MAP estimation finds θ that maximizes the posterior p(θ|X) ∝ p(X|θ)p(θ) ˆ θ = arg max

θ

log

N

  • n=1

p(①n|θ)p(θ) = arg max

θ N

  • n=1

log p(①n|θ) + log p(θ) MAP now reduces to solving an optimization problem w.r.t. θ Objective function very similar to MLE, except for the log p(θ) term In some sense, MAP is just a “regularized” MLE

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 8

slide-17
SLIDE 17

Bayesian Learning

Both MLE and MAP only give a point estimate (single best answer) of θ How can we capture/quantify the uncertainty in θ?

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 9

slide-18
SLIDE 18

Bayesian Learning

Both MLE and MAP only give a point estimate (single best answer) of θ How can we capture/quantify the uncertainty in θ? Need to infer the full posterior distribution p(θ|X) = p(X|θ)p(θ) p(X) = p(X|θ)p(θ)

  • θ p(X|θ)p(θ)dθ ∝ Likelihood × Prior

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 9

slide-19
SLIDE 19

Bayesian Learning

Both MLE and MAP only give a point estimate (single best answer) of θ How can we capture/quantify the uncertainty in θ? Need to infer the full posterior distribution p(θ|X) = p(X|θ)p(θ) p(X) = p(X|θ)p(θ)

  • θ p(X|θ)p(θ)dθ ∝ Likelihood × Prior

Requires doing a “fully Bayesian” inference Inference sometimes a somewhat easy and sometimes a (very) hard problem

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 9

slide-20
SLIDE 20

A Simple Example of Bayesian Inference

We want to estimate a coin’s bias θ ∈ (0, 1) based on N tosses The likelihood model: {①1, . . . , ①N} ∼ Bernoulli(θ) p(①n|θ) = θ①n(1 − θ)1−①n The prior: θ ∼ Beta(a, b) p(θ|a, b) = Γ(a + b) Γ(a)Γ(b)θa−1(1 − θ)b−1 The posterior p(θ|X) ∝ N

n=1 p(①n|θ)p(θ|a, b)

∝ N

n=1 θ①n(1 − θ)1−①nθa−1(1 − θ)b−1

= θa+N

n=1 ①n−1(1 − θ)b+N−N n=1 ①n−1

Thus the posterior is: Beta(a + N

n=1 ①n, b + N − N n=1 ①n)

Here, the posterior has the same form as the prior (both Beta) Also very easy to perform online inference

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 10

slide-21
SLIDE 21

Conjugate Priors

Recall p(θ|X) = p(X|θ)p(θ)

p(X)

Given some data distribution (likelihood) p(X|θ) and a prior p(θ) = π(θ|α).. The prior is conjugate if the posterior also has the same form, i.e., p(θ|α, X) = P(X|θ)π(θ|π) p(X) = π(θ|α∗) Several pairs of distributions are conjugate to each other, e.g.,

Gaussian-Gaussian Beta-Bernoulli Beta-Binomial Gamma-Poisson Dirichlet-Multinomial ..

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 11

slide-22
SLIDE 22

A Non-Conjugate Case

Want to learn a classifier θ for predicting label ① ∈ {−1, +1} for a point ③ Assume a logistic likelihood model for the labels p(①n|θ) = 1 1 + exp(−①nθ⊤③n) ①

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 12

slide-23
SLIDE 23

A Non-Conjugate Case

Want to learn a classifier θ for predicting label ① ∈ {−1, +1} for a point ③ Assume a logistic likelihood model for the labels p(①n|θ) = 1 1 + exp(−①nθ⊤③n) The prior: θ ∼ Normal(µ, Σ) (Gaussian, not conjugate to the logistic) p(θ|µ, Σ) ∝ exp(−1 2(θ − µ)⊤Σ−1(θ − µ)) ①

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 12

slide-24
SLIDE 24

A Non-Conjugate Case

Want to learn a classifier θ for predicting label ① ∈ {−1, +1} for a point ③ Assume a logistic likelihood model for the labels p(①n|θ) = 1 1 + exp(−①nθ⊤③n) The prior: θ ∼ Normal(µ, Σ) (Gaussian, not conjugate to the logistic) p(θ|µ, Σ) ∝ exp(−1 2(θ − µ)⊤Σ−1(θ − µ)) The posterior p(θ|X) ∝ N

n=1 p(①n|θ)p(θ|µ, Σ) does not have a closed form

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 12

slide-25
SLIDE 25

A Non-Conjugate Case

Want to learn a classifier θ for predicting label ① ∈ {−1, +1} for a point ③ Assume a logistic likelihood model for the labels p(①n|θ) = 1 1 + exp(−①nθ⊤③n) The prior: θ ∼ Normal(µ, Σ) (Gaussian, not conjugate to the logistic) p(θ|µ, Σ) ∝ exp(−1 2(θ − µ)⊤Σ−1(θ − µ)) The posterior p(θ|X) ∝ N

n=1 p(①n|θ)p(θ|µ, Σ) does not have a closed form

Approximate Bayesian inference needed in such cases

Sampling based approximations: MCMC methods Optimization based approximations: Variational Bayes, Laplace, etc.

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 12

slide-26
SLIDE 26

Benefits of Bayesian Modeling

Our estimate of θ is not a single value (“point”) but a distribution Can model and quantify the uncertainty (or “variance”) in θ via p(θ|X)

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 13

slide-27
SLIDE 27

Benefits of Bayesian Modeling

Our estimate of θ is not a single value (“point”) but a distribution Can model and quantify the uncertainty (or “variance”) in θ via p(θ|X) Can use the uncertainty in various tasks such as diagnosis, predictions

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 13

slide-28
SLIDE 28

Benefits of Bayesian Modeling

Our estimate of θ is not a single value (“point”) but a distribution Can model and quantify the uncertainty (or “variance”) in θ via p(θ|X) Can use the uncertainty in various tasks such as diagnosis, predictions , e.g., Making predictions by averaging over all possibile values of θ p(y|x, X, Y ) = Ep(θ|.)[p(y|x, θ)] p(y|x, X, Y ) =

  • p(y|x, θ)p(θ|X, Y )dθ

Allows also quantifying the uncertainty in the predictions

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 13

slide-29
SLIDE 29

Other Benefits of Bayesian Modeling

Hierarchical model construction: parameters can depend on hyperparameters hyperparameters need not be tuned but can be inferred from data

.. by maximizing the marginal likelihood p(X|α)

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 14

slide-30
SLIDE 30

Other Benefits of Bayesian Modeling

Hierarchical model construction: parameters can depend on hyperparameters hyperparameters need not be tuned but can be inferred from data

.. by maximizing the marginal likelihood p(X|α)

Provides robustness: E.g., learning the sparsity hyperparameter in sparse regression, learning kernel hyperparameters in kernel methods

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 14

slide-31
SLIDE 31

Other Benefits of Probabilistic/Bayesian Modeling

Can introduce “local parameters” (latent variables) associated with each data point and infer those as well Used in many problems: Gaussian mixture model, probabilistic principal component analysis, factor analysis, topic models

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 15

slide-32
SLIDE 32

Other Benefits of Probabilistic/Bayesian Modeling

Enables a modular architecture: Simple models can be neatly combined to solve more complex problems Allows jointly learning across multiple data sets (sometimes also known as multitask learning or transfer learning)

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 16

slide-33
SLIDE 33

Other Benefits of Bayesian Modeling

Nonparametric Bayesian modeling: a principled way to learn model size E.g., how many clusters (Gaussian mixture model or graph clustering), how many basis vectors (PCA) or dictionary elements (sparse coding or dictionary learning), how many topics (topic models such as LDA), etc.. NPBayes modeling allows the model size to grow with data

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 17

slide-34
SLIDE 34

Other Benefits of Bayesian Modeling

Sequential data acquisition or “active learning” Can check how confident the learned model is w.r.t. a new data point p(θ|λ) = Normal(θ|0, λ2) Prior p(y|①, θ) = Normal(y|θ⊤①, σ2) Likelihood p(θ|Y , X) = Normal(θ|µθ, Σθ) Posterior p(y0|①0, Y , X) = Normal(y0|µ0, σ2

0)

Predictive dist. µ0 = µ⊤

θ ①0

Predictive mean σ2 = σ2 + ①⊤

0 Σθ①0

Predictive variance

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 18

slide-35
SLIDE 35

Other Benefits of Bayesian Modeling

Sequential data acquisition or “active learning” Can check how confident the learned model is w.r.t. a new data point p(θ|λ) = Normal(θ|0, λ2) Prior p(y|①, θ) = Normal(y|θ⊤①, σ2) Likelihood p(θ|Y , X) = Normal(θ|µθ, Σθ) Posterior p(y0|①0, Y , X) = Normal(y0|µ0, σ2

0)

Predictive dist. µ0 = µ⊤

θ ①0

Predictive mean σ2 = σ2 + ①⊤

0 Σθ①0

Predictive variance Gives a strategy to choose data points sequentially for improved learning with a budget on the amount of data available

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 18

slide-36
SLIDE 36

Next Talk

Case study on Bayesian sparse linear regression Hyperparameter estimation Introduction to approximate Bayesian inference

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 19

slide-37
SLIDE 37

Thanks! Questions?

Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 20