Probabilistic Models Shan-Hung Wu shwu@cs.nthu.edu.tw Department - - PowerPoint PPT Presentation

probabilistic models
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Models Shan-Hung Wu shwu@cs.nthu.edu.tw Department - - PowerPoint PPT Presentation

Probabilistic Models Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 1 / 25 Outline Probabilistic Models


slide-1
SLIDE 1

Probabilistic Models

Shan-Hung Wu

shwu@cs.nthu.edu.tw

Department of Computer Science, National Tsing Hua University, Taiwan

Machine Learning

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 1 / 25

slide-2
SLIDE 2

Outline

1

Probabilistic Models

2

Maximum Likelihood Estimation Linear Regression Logistic Regression

3

Maximum A Posteriori Estimation

4

Bayesian Estimation**

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 2 / 25

slide-3
SLIDE 3

Outline

1

Probabilistic Models

2

Maximum Likelihood Estimation Linear Regression Logistic Regression

3

Maximum A Posteriori Estimation

4

Bayesian Estimation**

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 3 / 25

slide-4
SLIDE 4

Predictions based on Probability

Supervised learning, we are given a training set X = {(x(i),y(i))}N

i=1

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 4 / 25

slide-5
SLIDE 5

Predictions based on Probability

Supervised learning, we are given a training set X = {(x(i),y(i))}N

i=1

Model F: a collection of functions parametrized by Θ

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 4 / 25

slide-6
SLIDE 6

Predictions based on Probability

Supervised learning, we are given a training set X = {(x(i),y(i))}N

i=1

Model F: a collection of functions parametrized by Θ Goal: to train a function f such that, given a new data point x0, the

  • utput value

ˆ y = f(x0;Θ) is closest to the correct label y0

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 4 / 25

slide-7
SLIDE 7

Predictions based on Probability

Supervised learning, we are given a training set X = {(x(i),y(i))}N

i=1

Model F: a collection of functions parametrized by Θ Goal: to train a function f such that, given a new data point x0, the

  • utput value

ˆ y = f(x0;Θ) is closest to the correct label y0 Examples in X are usually assumed to be i.i.d. sampled from random variables (x,y) following some data generating distribution P(x,y)

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 4 / 25

slide-8
SLIDE 8

Predictions based on Probability

Supervised learning, we are given a training set X = {(x(i),y(i))}N

i=1

Model F: a collection of functions parametrized by Θ Goal: to train a function f such that, given a new data point x0, the

  • utput value

ˆ y = f(x0;Θ) is closest to the correct label y0 Examples in X are usually assumed to be i.i.d. sampled from random variables (x,y) following some data generating distribution P(x,y) In probabilistic models, f is replaced by P(y = y|x = x0) and a prediction is made by: ˆ y = argmax

y

P(y = y|x = x0;Θ)

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 4 / 25

slide-9
SLIDE 9

Predictions based on Probability

Supervised learning, we are given a training set X = {(x(i),y(i))}N

i=1

Model F: a collection of functions parametrized by Θ Goal: to train a function f such that, given a new data point x0, the

  • utput value

ˆ y = f(x0;Θ) is closest to the correct label y0 Examples in X are usually assumed to be i.i.d. sampled from random variables (x,y) following some data generating distribution P(x,y) In probabilistic models, f is replaced by P(y = y|x = x0) and a prediction is made by: ˆ y = argmax

y

P(y = y|x = x0;Θ) How to find Θ?

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 4 / 25

slide-10
SLIDE 10

Function (Θ) as Point Estimate

Regard Θ (f) as an estimate of the “true” Θ⇤ (f ⇤)

Mapped from the training set X

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 5 / 25

slide-11
SLIDE 11

Function (Θ) as Point Estimate

Regard Θ (f) as an estimate of the “true” Θ⇤ (f ⇤)

Mapped from the training set X

Maximum a posteriori (MAP) estimation: argmax

Θ P(Θ|X) = argmax Θ P(X|Θ)P(Θ)

By Bayes’ rule (P(X) is irrelevant)

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 5 / 25

slide-12
SLIDE 12

Function (Θ) as Point Estimate

Regard Θ (f) as an estimate of the “true” Θ⇤ (f ⇤)

Mapped from the training set X

Maximum a posteriori (MAP) estimation: argmax

Θ P(Θ|X) = argmax Θ P(X|Θ)P(Θ)

By Bayes’ rule (P(X) is irrelevant) Solves Θ first, then uses it as a constant in P(y|x;Θ) to get ˆ y

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 5 / 25

slide-13
SLIDE 13

Function (Θ) as Point Estimate

Regard Θ (f) as an estimate of the “true” Θ⇤ (f ⇤)

Mapped from the training set X

Maximum a posteriori (MAP) estimation: argmax

Θ P(Θ|X) = argmax Θ P(X|Θ)P(Θ)

By Bayes’ rule (P(X) is irrelevant) Solves Θ first, then uses it as a constant in P(y|x;Θ) to get ˆ y

Maximum likelihood (ML) estimation: argmax

Θ P(X|Θ)

Assumes uniform P(Θ) and does not prefer particular Θ

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 5 / 25

slide-14
SLIDE 14

Outline

1

Probabilistic Models

2

Maximum Likelihood Estimation Linear Regression Logistic Regression

3

Maximum A Posteriori Estimation

4

Bayesian Estimation**

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 6 / 25

slide-15
SLIDE 15

Outline

1

Probabilistic Models

2

Maximum Likelihood Estimation Linear Regression Logistic Regression

3

Maximum A Posteriori Estimation

4

Bayesian Estimation**

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 7 / 25

slide-16
SLIDE 16

Probability Interpretation

Assumption: y = f ⇤(x)+ε, ε ⇠ N (0,β 1)

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 8 / 25

slide-17
SLIDE 17

Probability Interpretation

Assumption: y = f ⇤(x)+ε, ε ⇠ N (0,β 1) The unknown deterministic function is defined as f ⇤(x;w⇤) = w⇤>x

All variables are z-normalized, so no bias term (b)

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 8 / 25

slide-18
SLIDE 18

Probability Interpretation

Assumption: y = f ⇤(x)+ε, ε ⇠ N (0,β 1) The unknown deterministic function is defined as f ⇤(x;w⇤) = w⇤>x

All variables are z-normalized, so no bias term (b)

We have (y|x) ⇠ N (w⇤>x,β 1)

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 8 / 25

slide-19
SLIDE 19

Probability Interpretation

Assumption: y = f ⇤(x)+ε, ε ⇠ N (0,β 1) The unknown deterministic function is defined as f ⇤(x;w⇤) = w⇤>x

All variables are z-normalized, so no bias term (b)

We have (y|x) ⇠ N (w⇤>x,β 1) So, out goal is to find w as close to w⇤ as possible such that: ˆ y = argmax

y

P(y|x = x;w) = w>x

Note that ˆ y is irrelevant to β, so we don’t need to solve β

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 8 / 25

slide-20
SLIDE 20

Probability Interpretation

Assumption: y = f ⇤(x)+ε, ε ⇠ N (0,β 1) The unknown deterministic function is defined as f ⇤(x;w⇤) = w⇤>x

All variables are z-normalized, so no bias term (b)

We have (y|x) ⇠ N (w⇤>x,β 1) So, out goal is to find w as close to w⇤ as possible such that: ˆ y = argmax

y

P(y|x = x;w) = w>x

Note that ˆ y is irrelevant to β, so we don’t need to solve β

ML estimation: argmax

w P(X|w)

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 8 / 25

slide-21
SLIDE 21

ML Estimation I

Problem: argmax

w P(X|w)

Since we assume i.i.d. samples, we have P(X|w) = ∏N

i=1 P(x(i),y(i) |w) = ∏N i=1 P(y(i) |x(i),w)P(x(i) |w)

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 9 / 25

slide-22
SLIDE 22

ML Estimation I

Problem: argmax

w P(X|w)

Since we assume i.i.d. samples, we have P(X|w) = ∏N

i=1 P(x(i),y(i) |w) = ∏N i=1 P(y(i) |x(i),w)P(x(i) |w)

= ∏N

i=1 P(y(i) |x(i),w)P(x(i)) = ∏i N (y(i);w>x(i),σ2)P(x(i))

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 9 / 25

slide-23
SLIDE 23

ML Estimation I

Problem: argmax

w P(X|w)

Since we assume i.i.d. samples, we have P(X|w) = ∏N

i=1 P(x(i),y(i) |w) = ∏N i=1 P(y(i) |x(i),w)P(x(i) |w)

= ∏N

i=1 P(y(i) |x(i),w)P(x(i)) = ∏i N (y(i);w>x(i),σ2)P(x(i))

= ∏i q

β 2π exp

⇣ β

2 (y(i) w>x(i))2⌘

P(x(i))

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 9 / 25

slide-24
SLIDE 24

ML Estimation I

Problem: argmax

w P(X|w)

Since we assume i.i.d. samples, we have P(X|w) = ∏N

i=1 P(x(i),y(i) |w) = ∏N i=1 P(y(i) |x(i),w)P(x(i) |w)

= ∏N

i=1 P(y(i) |x(i),w)P(x(i)) = ∏i N (y(i);w>x(i),σ2)P(x(i))

= ∏i q

β 2π exp

⇣ β

2 (y(i) w>x(i))2⌘

P(x(i)) To make the problem tractable, we prefer “sums” over “products”

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 9 / 25

slide-25
SLIDE 25

ML Estimation I

Problem: argmax

w P(X|w)

Since we assume i.i.d. samples, we have P(X|w) = ∏N

i=1 P(x(i),y(i) |w) = ∏N i=1 P(y(i) |x(i),w)P(x(i) |w)

= ∏N

i=1 P(y(i) |x(i),w)P(x(i)) = ∏i N (y(i);w>x(i),σ2)P(x(i))

= ∏i q

β 2π exp

⇣ β

2 (y(i) w>x(i))2⌘

P(x(i)) To make the problem tractable, we prefer “sums” over “products” We can instead maximize the log likelihood argmaxw logP(X|w) The optimal point does not change since log is monotone increasing

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 9 / 25

slide-26
SLIDE 26

ML Estimation I

Problem: argmax

w P(X|w)

Since we assume i.i.d. samples, we have P(X|w) = ∏N

i=1 P(x(i),y(i) |w) = ∏N i=1 P(y(i) |x(i),w)P(x(i) |w)

= ∏N

i=1 P(y(i) |x(i),w)P(x(i)) = ∏i N (y(i);w>x(i),σ2)P(x(i))

= ∏i q

β 2π exp

⇣ β

2 (y(i) w>x(i))2⌘

P(x(i)) To make the problem tractable, we prefer “sums” over “products” We can instead maximize the log likelihood argmaxw logP(X|w) = argmaxw log  ∏i q

β 2π exp

⇣ β

2 (y(i) w>x(i))2⌘

P(x(i))

  • The optimal point does not change since log is monotone increasing

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 9 / 25

slide-27
SLIDE 27

ML Estimation I

Problem: argmax

w P(X|w)

Since we assume i.i.d. samples, we have P(X|w) = ∏N

i=1 P(x(i),y(i) |w) = ∏N i=1 P(y(i) |x(i),w)P(x(i) |w)

= ∏N

i=1 P(y(i) |x(i),w)P(x(i)) = ∏i N (y(i);w>x(i),σ2)P(x(i))

= ∏i q

β 2π exp

⇣ β

2 (y(i) w>x(i))2⌘

P(x(i)) To make the problem tractable, we prefer “sums” over “products” We can instead maximize the log likelihood argmaxw logP(X|w) = argmaxw log  ∏i q

β 2π exp

⇣ β

2 (y(i) w>x(i))2⌘

P(x(i))

  • = argmaxw N

q

β 2π β 2 ∑i(y(i) w>x(i))2 +∑i P(x(i))

The optimal point does not change since log is monotone increasing

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 9 / 25

slide-28
SLIDE 28

ML Estimation II

argmax

w N

r β 2π β 2 ∑

i

(y(i) w>x(i))2 +∑

i

P(x(i)) Ignoring terms irrelevant to w, we have argmin

w ∑ i

(y(i) w>x(i))2

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 10 / 25

slide-29
SLIDE 29

ML Estimation II

argmax

w N

r β 2π β 2 ∑

i

(y(i) w>x(i))2 +∑

i

P(x(i)) Ignoring terms irrelevant to w, we have argmin

w ∑ i

(y(i) w>x(i))2 In other words, we seek for w by minimizing the SSE (sum of square errors), as we have done before

By, e.g., the stochastic gradient descent algorithm

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 10 / 25

slide-30
SLIDE 30

ML Estimation II

argmax

w N

r β 2π β 2 ∑

i

(y(i) w>x(i))2 +∑

i

P(x(i)) Ignoring terms irrelevant to w, we have argmin

w ∑ i

(y(i) w>x(i))2 In other words, we seek for w by minimizing the SSE (sum of square errors), as we have done before

By, e.g., the stochastic gradient descent algorithm

This new perspective explains our ad hoc choice of SSE for empirical risk minimization

Checking assumptions helps understand when model works the best

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 10 / 25

slide-31
SLIDE 31

ML Estimation II

argmax

w N

r β 2π β 2 ∑

i

(y(i) w>x(i))2 +∑

i

P(x(i)) Ignoring terms irrelevant to w, we have argmin

w ∑ i

(y(i) w>x(i))2 In other words, we seek for w by minimizing the SSE (sum of square errors), as we have done before

By, e.g., the stochastic gradient descent algorithm

This new perspective explains our ad hoc choice of SSE for empirical risk minimization

Checking assumptions helps understand when model works the best

Also motivates new models

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 10 / 25

slide-32
SLIDE 32

ML Estimation II

argmax

w N

r β 2π β 2 ∑

i

(y(i) w>x(i))2 +∑

i

P(x(i)) Ignoring terms irrelevant to w, we have argmin

w ∑ i

(y(i) w>x(i))2 In other words, we seek for w by minimizing the SSE (sum of square errors), as we have done before

By, e.g., the stochastic gradient descent algorithm

This new perspective explains our ad hoc choice of SSE for empirical risk minimization

Checking assumptions helps understand when model works the best

Also motivates new models. Probabilistic model for classification?

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 10 / 25

slide-33
SLIDE 33

Outline

1

Probabilistic Models

2

Maximum Likelihood Estimation Linear Regression Logistic Regression

3

Maximum A Posteriori Estimation

4

Bayesian Estimation**

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 11 / 25

slide-34
SLIDE 34

Probabilistic Models for Binary Classification

Probabilistic models: ˆ y = argmax

y

P(y|x;Θ)

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 12 / 25

slide-35
SLIDE 35

Probabilistic Models for Binary Classification

Probabilistic models: ˆ y = argmax

y

P(y|x;Θ) In regression, we assume (y|x) ⇠ N (based on y = f ⇤(x)+ε)

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 12 / 25

slide-36
SLIDE 36

Probabilistic Models for Binary Classification

Probabilistic models: ˆ y = argmax

y

P(y|x;Θ) In regression, we assume (y|x) ⇠ N (based on y = f ⇤(x)+ε) However, Gaussian distribution is not applicable to binary classification

The values of y should concentrate in either 1 or 1

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 12 / 25

slide-37
SLIDE 37

Probabilistic Models for Binary Classification

Probabilistic models: ˆ y = argmax

y

P(y|x;Θ) In regression, we assume (y|x) ⇠ N (based on y = f ⇤(x)+ε) However, Gaussian distribution is not applicable to binary classification

The values of y should concentrate in either 1 or 1

Which distribution to assume?

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 12 / 25

slide-38
SLIDE 38

Probabilistic Models for Binary Classification

Probabilistic models: ˆ y = argmax

y

P(y|x;Θ) In regression, we assume (y|x) ⇠ N (based on y = f ⇤(x)+ε) However, Gaussian distribution is not applicable to binary classification

The values of y should concentrate in either 1 or 1

Which distribution to assume? Coin flipping: (y|x) ⇠ Bernoulli(ρ), where P(y|x;ρ) = ρy0(1ρ)(1y0), where y0 = y+1 2

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 12 / 25

slide-39
SLIDE 39

Probabilistic Models for Binary Classification

Probabilistic models: ˆ y = argmax

y

P(y|x;Θ) In regression, we assume (y|x) ⇠ N (based on y = f ⇤(x)+ε) However, Gaussian distribution is not applicable to binary classification

The values of y should concentrate in either 1 or 1

Which distribution to assume? Coin flipping: (y|x) ⇠ Bernoulli(ρ), where P(y|x;ρ) = ρy0(1ρ)(1y0), where y0 = y+1 2 How to relate x to ρ?

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 12 / 25

slide-40
SLIDE 40

Logistic Function

Recall that the logistic function σ (z) = exp(z) exp(z)+1 = 1 1+exp(z) is commonly used as a parametrizing function of the Bernoulli distribution

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 13 / 25

slide-41
SLIDE 41

Logistic Function

Recall that the logistic function σ (z) = exp(z) exp(z)+1 = 1 1+exp(z) is commonly used as a parametrizing function of the Bernoulli distribution We have P(y|x;z) = σ (z)y0 (1σ (z))(1y0) The larger z, the higher chance we get a “positive flip”

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 13 / 25

slide-42
SLIDE 42

Logistic Function

Recall that the logistic function σ (z) = exp(z) exp(z)+1 = 1 1+exp(z) is commonly used as a parametrizing function of the Bernoulli distribution We have P(y|x;z) = σ (z)y0 (1σ (z))(1y0) The larger z, the higher chance we get a “positive flip” How to relate x to z?

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 13 / 25

slide-43
SLIDE 43

Logistic Regression

In logistic regression, we let z = w>x

Basically, z is the projection of x along the direction w

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 14 / 25

slide-44
SLIDE 44

Logistic Regression

In logistic regression, we let z = w>x

Basically, z is the projection of x along the direction w

We have P(y|x;w) = σ(w>x)y0[1σ(w>x)](1y0) Prediction: ˆ y = argmax

y

P(y|x;w)

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 14 / 25

slide-45
SLIDE 45

Logistic Regression

In logistic regression, we let z = w>x

Basically, z is the projection of x along the direction w

We have P(y|x;w) = σ(w>x)y0[1σ(w>x)](1y0) Prediction: ˆ y = argmax

y

P(y|x;w) = sign(w>x)

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 14 / 25

slide-46
SLIDE 46

Logistic Regression

In logistic regression, we let z = w>x

Basically, z is the projection of x along the direction w

We have P(y|x;w) = σ(w>x)y0[1σ(w>x)](1y0) Prediction: ˆ y = argmax

y

P(y|x;w) = sign(w>x) How to learn w from X?

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 14 / 25

slide-47
SLIDE 47

Logistic Regression

In logistic regression, we let z = w>x

Basically, z is the projection of x along the direction w

We have P(y|x;w) = σ(w>x)y0[1σ(w>x)](1y0) Prediction: ˆ y = argmax

y

P(y|x;w) = sign(w>x) How to learn w from X? ML estimation: argmax

w P(X|w)

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 14 / 25

slide-48
SLIDE 48

ML Estimation

Log-likelihood: logP(X|w) = log∏N

i=1 P

  • x(i),y(i) |w
  • = log∏i P
  • y(i) |x(i),w
  • P
  • x(i) |w
  • Shan-Hung Wu (CS, NTHU)

Probabilistic Models Machine Learning 15 / 25

slide-49
SLIDE 49

ML Estimation

Log-likelihood: logP(X|w) = log∏N

i=1 P

  • x(i),y(i) |w
  • = log∏i P
  • y(i) |x(i),w
  • P
  • x(i) |w
  • ∝ log∏i σ(w>x(i))y0(i)[1σ(w>x(i))](1y0(i))

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 15 / 25

slide-50
SLIDE 50

ML Estimation

Log-likelihood: logP(X|w) = log∏N

i=1 P

  • x(i),y(i) |w
  • = log∏i P
  • y(i) |x(i),w
  • P
  • x(i) |w
  • ∝ log∏i σ(w>x(i))y0(i)[1σ(w>x(i))](1y0(i))

= ∑i y0(i)w>x(i) log(1+ew>x(i)) [Homework] Unlike in linear regression, we cannot solve w analytically in a closed form via ∇w logP(X|w) =

N

t=1

[y0(i) σ(w>x(i))]x(i) = 0

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 15 / 25

slide-51
SLIDE 51

ML Estimation

Log-likelihood: logP(X|w) = log∏N

i=1 P

  • x(i),y(i) |w
  • = log∏i P
  • y(i) |x(i),w
  • P
  • x(i) |w
  • ∝ log∏i σ(w>x(i))y0(i)[1σ(w>x(i))](1y0(i))

= ∑i y0(i)w>x(i) log(1+ew>x(i)) [Homework] Unlike in linear regression, we cannot solve w analytically in a closed form via ∇w logP(X|w) =

N

t=1

[y0(i) σ(w>x(i))]x(i) = 0 However, we can still evaluate ∇w logP(X|w) and use the iterative methods to solve w

E.g., stochastic gradient descent

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 15 / 25

slide-52
SLIDE 52

ML Estimation

Log-likelihood: logP(X|w) = log∏N

i=1 P

  • x(i),y(i) |w
  • = log∏i P
  • y(i) |x(i),w
  • P
  • x(i) |w
  • ∝ log∏i σ(w>x(i))y0(i)[1σ(w>x(i))](1y0(i))

= ∑i y0(i)w>x(i) log(1+ew>x(i)) [Homework] Unlike in linear regression, we cannot solve w analytically in a closed form via ∇w logP(X|w) =

N

t=1

[y0(i) σ(w>x(i))]x(i) = 0 However, we can still evaluate ∇w logP(X|w) and use the iterative methods to solve w

E.g., stochastic gradient descent

It can be shown that logP(X|w) is concave in terms of w [1]

So, iterative algorithms converges

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 15 / 25

slide-53
SLIDE 53

Outline

1

Probabilistic Models

2

Maximum Likelihood Estimation Linear Regression Logistic Regression

3

Maximum A Posteriori Estimation

4

Bayesian Estimation**

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 16 / 25

slide-54
SLIDE 54

MAP Estimation

So far, we solve w by ML estimation: argmax

w P(X|w)

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 17 / 25

slide-55
SLIDE 55

MAP Estimation

So far, we solve w by ML estimation: argmax

w P(X|w)

In MAP estimation, we solve argmax

w P(w|X) = argmax w P(X|w)P(w)

P(w) models our preference or prior knowledge about w

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 17 / 25

slide-56
SLIDE 56

MAP Estimation for Linear Regression

MAP estimation in linear regression: argmax

w log[P(X|w)P(w)]

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 18 / 25

slide-57
SLIDE 57

MAP Estimation for Linear Regression

MAP estimation in linear regression: argmax

w log[P(X|w)P(w)]

If we assume that w ⇠ N (0,β 1I) log[P(X|w)P(w)] = logP(X|w)+logP(w)

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 18 / 25

slide-58
SLIDE 58

MAP Estimation for Linear Regression

MAP estimation in linear regression: argmax

w log[P(X|w)P(w)]

If we assume that w ⇠ N (0,β 1I) log[P(X|w)P(w)] = logP(X|w)+logP(w) ∝ ∑i

  • y(i) w>x(i)2

+log q

1 (2π)Ddet(β 1I) exp

⇥ 1

2(w0)>(β 1I)1(w0)

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 18 / 25

slide-59
SLIDE 59

MAP Estimation for Linear Regression

MAP estimation in linear regression: argmax

w log[P(X|w)P(w)]

If we assume that w ⇠ N (0,β 1I) log[P(X|w)P(w)] = logP(X|w)+logP(w) ∝ ∑i

  • y(i) w>x(i)2

+log q

1 (2π)Ddet(β 1I) exp

⇥ 1

2(w0)>(β 1I)1(w0)

⇤ ∝ ∑i

  • y(i) w>x(i)2 βw>w

P(w) corresponds to the weight decay term in Ridge regression

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 18 / 25

slide-60
SLIDE 60

MAP Estimation for Linear Regression

MAP estimation in linear regression: argmax

w log[P(X|w)P(w)]

If we assume that w ⇠ N (0,β 1I) log[P(X|w)P(w)] = logP(X|w)+logP(w) ∝ ∑i

  • y(i) w>x(i)2

+log q

1 (2π)Ddet(β 1I) exp

⇥ 1

2(w0)>(β 1I)1(w0)

⇤ ∝ ∑i

  • y(i) w>x(i)2 βw>w

P(w) corresponds to the weight decay term in Ridge regression MAP estimation provides a way to design complicated yet interpretable regularization terms

E.g., we have LASSO by letting P(w) ⇠ Laplace(0,b) [Proof] We can also let P(w) be a mixture of Gaussians

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 18 / 25

slide-61
SLIDE 61

Remarks on ML and MAP Estimation

Theorem (Consistency) The ML estimator ΘML is consistent, i.e., limN!∞ ΘML

Pr

  • ! Θ⇤ as long as

the “true” P(y|x;Θ⇤) lies within our model F.

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 19 / 25

slide-62
SLIDE 62

Remarks on ML and MAP Estimation

Theorem (Consistency) The ML estimator ΘML is consistent, i.e., limN!∞ ΘML

Pr

  • ! Θ⇤ as long as

the “true” P(y|x;Θ⇤) lies within our model F. Theorem (Cramér-Rao Lower Bound [2]) At a fixed (large) number N of examples, no consistent estimator of Θ⇤ has a lower expected MSE (mean square error) than the ML estimator ΘML. That is, ΘML has a low sample complexity (or is statistic efficient)

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 19 / 25

slide-63
SLIDE 63

Remarks on ML and MAP Estimation

Theorem (Consistency) The ML estimator ΘML is consistent, i.e., limN!∞ ΘML

Pr

  • ! Θ⇤ as long as

the “true” P(y|x;Θ⇤) lies within our model F. Theorem (Cramér-Rao Lower Bound [2]) At a fixed (large) number N of examples, no consistent estimator of Θ⇤ has a lower expected MSE (mean square error) than the ML estimator ΘML. That is, ΘML has a low sample complexity (or is statistic efficient) ML estimation is popular due to its consistency and efficiency

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 19 / 25

slide-64
SLIDE 64

Remarks on ML and MAP Estimation

Theorem (Consistency) The ML estimator ΘML is consistent, i.e., limN!∞ ΘML

Pr

  • ! Θ⇤ as long as

the “true” P(y|x;Θ⇤) lies within our model F. Theorem (Cramér-Rao Lower Bound [2]) At a fixed (large) number N of examples, no consistent estimator of Θ⇤ has a lower expected MSE (mean square error) than the ML estimator ΘML. That is, ΘML has a low sample complexity (or is statistic efficient) ML estimation is popular due to its consistency and efficiency When N is small that yields overfitting behavior, we can use MAP estimation to introduce bias and reduce variance

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 19 / 25

slide-65
SLIDE 65

Remarks on ML and MAP Estimation

Theorem (Consistency) The ML estimator ΘML is consistent, i.e., limN!∞ ΘML

Pr

  • ! Θ⇤ as long as

the “true” P(y|x;Θ⇤) lies within our model F. Theorem (Cramér-Rao Lower Bound [2]) At a fixed (large) number N of examples, no consistent estimator of Θ⇤ has a lower expected MSE (mean square error) than the ML estimator ΘML. That is, ΘML has a low sample complexity (or is statistic efficient) ML estimation is popular due to its consistency and efficiency When N is small that yields overfitting behavior, we can use MAP estimation to introduce bias and reduce variance

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 19 / 25

slide-66
SLIDE 66

Outline

1

Probabilistic Models

2

Maximum Likelihood Estimation Linear Regression Logistic Regression

3

Maximum A Posteriori Estimation

4

Bayesian Estimation**

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 20 / 25

slide-67
SLIDE 67

Bayesian Estimation

In ML/MAP estimation, we solve Θ first, then uses it as a constant to make prediction: ˆ y = argmax

y

P(y|x;Θ)

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 21 / 25

slide-68
SLIDE 68

Bayesian Estimation

In ML/MAP estimation, we solve Θ first, then uses it as a constant to make prediction: ˆ y = argmax

y

P(y|x;Θ) Bayesian estimation threats Θ as a random variable: ˆ y = argmax

y

P(y|x,X) = argmax

y

Z

P(y,Θ|x,X)dΘ

Makes prediction by considering all Θ’s (weighted by their chances)

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 21 / 25

slide-69
SLIDE 69

Bayesian Estimation

In ML/MAP estimation, we solve Θ first, then uses it as a constant to make prediction: ˆ y = argmax

y

P(y|x;Θ) Bayesian estimation threats Θ as a random variable: ˆ y = argmax

y

P(y|x,X) = argmax

y

Z

P(y,Θ|x,X)dΘ

Makes prediction by considering all Θ’s (weighted by their chances)

Bayesian estimation usually generalizes much better when the size N

  • f training set is small

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 21 / 25

slide-70
SLIDE 70

Bayesian vs. ML Estimation

Example: polynomial regression

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 22 / 25

slide-71
SLIDE 71

Bayesian vs. ML Estimation

Example: polynomial regression Red line: predictions by Bayesian estimation regressor

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 22 / 25

slide-72
SLIDE 72

Bayesian vs. ML Estimation

Example: polynomial regression Red line: predictions by Bayesian estimation regressor Shaded area: predictions by ML/MAP estimation regressors

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 22 / 25

slide-73
SLIDE 73

Bayesian vs. MAP Estimation

MAP gains some benefit of Bayesian approach by incorporating prior as bias(ΘMAP)

Reduces VarX(ΘMAP) when training set is small

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 23 / 25

slide-74
SLIDE 74

Bayesian vs. MAP Estimation

MAP gains some benefit of Bayesian approach by incorporating prior as bias(ΘMAP)

Reduces VarX(ΘMAP) when training set is small

However, does not work if ΘMAP is unrepresentative of the majority Θ in

R P(y,Θ|x,X)dΘ

E.g. when P(Θ|X) is a mixture of Gaussian

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 23 / 25

slide-75
SLIDE 75

Remarks

Bayesian estimation: ˆ y = argmax

y

P(y|x,X) = argmax

y

Z

P(y,Θ|x,X)dΘ Usually generalizes much better given a small training set

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 24 / 25

slide-76
SLIDE 76

Remarks

Bayesian estimation: ˆ y = argmax

y

P(y|x,X) = argmax

y

Z

P(y,Θ|x,X)dΘ Usually generalizes much better given a small training set Unfortunately, solution may not be tractable in many applications

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 24 / 25

slide-77
SLIDE 77

Remarks

Bayesian estimation: ˆ y = argmax

y

P(y|x,X) = argmax

y

Z

P(y,Θ|x,X)dΘ Usually generalizes much better given a small training set Unfortunately, solution may not be tractable in many applications Even tractable, incurs high computation cost

Not suitable for large-scale learning tasks

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 24 / 25

slide-78
SLIDE 78

Reference I

[1] Deepak Roy Chittajallu. Why is the error function minimized in logistic regression convex? http://mathgotchas.blogspot.tw/2011/10/ why-is-error-function-minimized-in.html, 2011. [2] Harald Cramér. Mathematical Methods of Statistics. Princeton university press, 1946.

Shan-Hung Wu (CS, NTHU) Probabilistic Models Machine Learning 25 / 25