Introduction to Probabilistic Machine Learning Piyush Rai Dept. of - PowerPoint PPT Presentation

Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1

Machine Learning Detecting trends/patterns in the data Making predictions about future data Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 2

Machine Learning Detecting trends/patterns in the data Making predictions about future data Two schools of thoughts Learning as optimization: fit a model to minimize some loss function Learning as inference: infer parameters of the data generating distribution Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 2

Machine Learning Detecting trends/patterns in the data Making predictions about future data Two schools of thoughts Learning as optimization: fit a model to minimize some loss function Learning as inference: infer parameters of the data generating distribution The two are not really completely disjoint ways of thinking about learning Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 2

Plan for the mini-course A series of 4 talks Introduction to Probabilistic and Bayesian Machine Learning (today) Case Study: Bayesian Linear Regression, Approx. Bayesian Inference (Nov 5) Nonparametric Bayesian modeling for function approximation (Nov 7) Nonparam. Bayesian modeling for clustering/dimensionality reduction (Nov 8) Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 3

① ① ① ① ① Machine Learning via Probabilistic Modeling Assume data X = { ① 1 , . . . , ① N } generated from a probabilistic model: Data usually assumed i.i.d. (independent and identically distributed) ① 1 , . . . , ① N ∼ p ( ① | θ ) Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 4

① ① Machine Learning via Probabilistic Modeling Assume data X = { ① 1 , . . . , ① N } generated from a probabilistic model: Data usually assumed i.i.d. (independent and identically distributed) ① 1 , . . . , ① N ∼ p ( ① | θ ) For i.i.d. data, probability of observed data X given model parameters θ N � p ( X | θ ) = p ( ① 1 , . . . , ① N | θ ) = p ( ① n | θ ) n =1 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 4

Machine Learning via Probabilistic Modeling Assume data X = { ① 1 , . . . , ① N } generated from a probabilistic model: Data usually assumed i.i.d. (independent and identically distributed) ① 1 , . . . , ① N ∼ p ( ① | θ ) For i.i.d. data, probability of observed data X given model parameters θ N � p ( X | θ ) = p ( ① 1 , . . . , ① N | θ ) = p ( ① n | θ ) n =1 p ( ① n | θ ) denotes the likelihood w.r.t. data point n The form of p ( ① n | θ ) depends on the type/characteristics of the data Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 4

Some common probability distributions Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 5

① ① Maximum Likelihood Estimation (MLE) We wish to estimate parameters θ from observed data { ① 1 , . . . , ① N } MLE does this by finding θ that maximizes the (log)likelihood p ( X | θ ) ˆ θ = arg max log p ( X | θ ) θ Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 6

① Maximum Likelihood Estimation (MLE) We wish to estimate parameters θ from observed data { ① 1 , . . . , ① N } MLE does this by finding θ that maximizes the (log)likelihood p ( X | θ ) N ˆ � θ = arg max log p ( X | θ ) = arg max log p ( ① n | θ ) θ θ n =1 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 6

Maximum Likelihood Estimation (MLE) We wish to estimate parameters θ from observed data { ① 1 , . . . , ① N } MLE does this by finding θ that maximizes the (log)likelihood p ( X | θ ) N N ˆ � � θ = arg max log p ( X | θ ) = arg max log p ( ① n | θ ) = arg max log p ( ① n | θ ) θ θ θ n =1 n =1 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 6

Maximum Likelihood Estimation (MLE) We wish to estimate parameters θ from observed data { ① 1 , . . . , ① N } MLE does this by finding θ that maximizes the (log)likelihood p ( X | θ ) N N ˆ � � θ = arg max log p ( X | θ ) = arg max log p ( ① n | θ ) = arg max log p ( ① n | θ ) θ θ θ n =1 n =1 MLE now reduces to solving an optimization problem w.r.t. θ Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 6

Maximum Likelihood Estimation (MLE) We wish to estimate parameters θ from observed data { ① 1 , . . . , ① N } MLE does this by finding θ that maximizes the (log)likelihood p ( X | θ ) N N ˆ � � θ = arg max log p ( X | θ ) = arg max log p ( ① n | θ ) = arg max log p ( ① n | θ ) θ θ θ n =1 n =1 MLE now reduces to solving an optimization problem w.r.t. θ MLE has some nice theoretical properties (e.g., consistency as N → ∞ ) Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 6

Injecting Prior Knowledge Often, we might a priori know something about the parameters A prior distribution p ( θ ) can encode/specify this knowledge Bayes rule gives us the posterior distribution over θ : p ( θ | X ) Posterior reflects our updated knowledge about θ using observed data p ( θ | X ) = p ( X | θ ) p ( θ ) p ( X | θ ) p ( θ ) = θ p ( X | θ ) p ( θ ) d θ ∝ Likelihood × Prior � p ( X ) Note: θ is now a random variable Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 7

Maximum-a-Posteriori (MAP) Estimation MAP estimation finds θ that maximizes the posterior p ( θ | X ) ∝ p ( X | θ ) p ( θ ) N N ˆ � � θ = arg max log p ( ① n | θ ) p ( θ ) = arg max log p ( ① n | θ ) + log p ( θ ) θ θ n =1 n =1 MAP now reduces to solving an optimization problem w.r.t. θ Objective function very similar to MLE, except for the log p ( θ ) term In some sense, MAP is just a “regularized” MLE Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 8

Bayesian Learning Both MLE and MAP only give a point estimate (single best answer) of θ How can we capture/quantify the uncertainty in θ ? Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 9

Bayesian Learning Both MLE and MAP only give a point estimate (single best answer) of θ How can we capture/quantify the uncertainty in θ ? Need to infer the full posterior distribution p ( θ | X ) = p ( X | θ ) p ( θ ) p ( X | θ ) p ( θ ) = θ p ( X | θ ) p ( θ ) d θ ∝ Likelihood × Prior � p ( X ) Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 9

Bayesian Learning Both MLE and MAP only give a point estimate (single best answer) of θ How can we capture/quantify the uncertainty in θ ? Need to infer the full posterior distribution p ( θ | X ) = p ( X | θ ) p ( θ ) p ( X | θ ) p ( θ ) = θ p ( X | θ ) p ( θ ) d θ ∝ Likelihood × Prior � p ( X ) Requires doing a “fully Bayesian” inference Inference sometimes a somewhat easy and sometimes a (very) hard problem Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 9

A Simple Example of Bayesian Inference We want to estimate a coin’s bias θ ∈ (0 , 1) based on N tosses The likelihood model: { ① 1 , . . . , ① N } ∼ Bernoulli( θ ) p ( ① n | θ ) = θ ① n (1 − θ ) 1 − ① n The prior: θ ∼ Beta( a , b ) p ( θ | a , b ) = Γ( a + b ) Γ( a )Γ( b ) θ a − 1 (1 − θ ) b − 1 � N The posterior p ( θ | X ) ∝ n =1 p ( ① n | θ ) p ( θ | a , b ) � N n =1 θ ① n (1 − θ ) 1 − ① n θ a − 1 (1 − θ ) b − 1 ∝ θ a + � N n =1 ① n − 1 (1 − θ ) b + N − � N n =1 ① n − 1 = Thus the posterior is: Beta( a + � N n =1 ① n , b + N − � N n =1 ① n ) Here, the posterior has the same form as the prior (both Beta) Also very easy to perform online inference Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 10

Conjugate Priors Recall p ( θ | X ) = p ( X | θ ) p ( θ ) p ( X ) Given some data distribution (likelihood) p ( X | θ ) and a prior p ( θ ) = π ( θ | α ).. The prior is conjugate if the posterior also has the same form, i.e., p ( θ | α, X ) = P ( X | θ ) π ( θ | π ) = π ( θ | α ∗ ) p ( X ) Several pairs of distributions are conjugate to each other, e.g., Gaussian-Gaussian Beta-Bernoulli Beta-Binomial Gamma-Poisson Dirichlet-Multinomial .. Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 11

① A Non-Conjugate Case Want to learn a classifier θ for predicting label ① ∈ {− 1 , +1 } for a point ③ Assume a logistic likelihood model for the labels 1 p ( ① n | θ ) = 1 + exp( − ① n θ ⊤ ③ n ) Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 12

① A Non-Conjugate Case Want to learn a classifier θ for predicting label ① ∈ {− 1 , +1 } for a point ③ Assume a logistic likelihood model for the labels 1 p ( ① n | θ ) = 1 + exp( − ① n θ ⊤ ③ n ) The prior: θ ∼ Normal( µ, Σ) (Gaussian, not conjugate to the logistic) p ( θ | µ, Σ) ∝ exp( − 1 2( θ − µ ) ⊤ Σ − 1 ( θ − µ )) Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 12

Introduction to Probabilistic Machine Learning Piyush Rai Dept. of - PowerPoint PPT Presentation

Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning Detecting trends/patterns in the data

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

Automatic code rewriting in probabilistic programming Internship supervised by Hongseok Yang at

Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Selecting priors Applied Bayesian Statistics Dr. Earvin Balderama Department of Mathematics

Related to Bayesian Statistics by Atsuhide Mori (Osaka Dental University, Japan) Geometric

Tutorial 2 Monday 8 th August, 2016 Problem 1. Case for non-IID dataset: In the class, we

Bayesian Linear Regression Seung-Hoon Na Chonbuk National University Bayesian Linear Regression

Multiple co-clustering and its application Tomoki Tokuda, Okinawa Institute of Science and