Subhransu Maji
3 March 2015
CMPSCI 689: Machine Learning
5 March 2015
Probabilistic modeling Subhransu Maji CMPSCI 689: Machine Learning - - PowerPoint PPT Presentation
Probabilistic modeling Subhransu Maji CMPSCI 689: Machine Learning 3 March 2015 5 March 2015 Administrivia Mini-project 1 due Thursday, March 05 Turn in a hard copy In the next class Or in CS main office reception area by 4:00pm
3 March 2015
5 March 2015
Subhransu Maji (UMASS) CMPSCI 689 /32
Mini-project 1 due Thursday, March 05 Turn in a hard copy
Clearly write your name and student id in the front page Late submissions:
2
Subhransu Maji (UMASS) CMPSCI 689 /32
So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework unites the two Learning can be viewed as statistical inference Two kinds of data models
Two kinds of probability models
3
Subhransu Maji (UMASS) CMPSCI 689 /32
4
The data is generated according to a distribution D
expected loss among all classifiers
✏(ˆ y) = E(x,y)∼ D [`(y, ˆ y)] : expected loss of a predictor `(y, ˆ y) = ⇢ 1 if y 6= ˆ y
y ∈ {0, 1} (x, y) ∼ D(x, y) ˆ y = arg max
y
D(ˆ x, y)
Subhransu Maji (UMASS) CMPSCI 689 /32
This suggests that one way to learn a classifier is to estimate D
identically distributed assumption
5
(x1, y1) ∼ D (x2, y2) ∼ D (xn, yn) ∼ D ˆ D
Gaussian: N(µ, σ2)
…
Estimation Training data parametric distribution Estimate the parameters of the distribution
Subhransu Maji (UMASS) CMPSCI 689 /32
Coin toss: observed sequence {H, T, H, H} Probability of H: What is the value of that best explains the observed data? Maximum likelihood principle (MLE): pick parameters of the distribution that maximize the likelihood of the observed data Likelihood of data:
6
β β
dpβ(data) dβ = dβ3(1 − β) dβ = 3β2(1 − β) + β3(−1) = 0
= ⇒ β = 3 4
i.i.d data
pβ(data) = pβ(H,T,H,H) = pβ(H)pβ(T)pβ(H)pβ(H) = β × (1 − β) × β × β = β3(1 − β)
Subhransu Maji (UMASS) CMPSCI 689 /32
It is convenient to maximize the logarithm of the likelihood instead Log-likelihood of the observed data:
7
log pβ(data) = log pβ(H,T,H,H) = log pβ(H) + log pβ(T) + log pβ(H) + log pβ(H) = log β + log(1 − β) + log β + log β = 3 log β + log(1 − β)
Subhransu Maji (UMASS) CMPSCI 689 /32
Log-likelihood of observing H-many heads and T-many tails:
8
log pβ(data) = H log β + T log(1 − β) d[H log β + T log(1 − β)] dβ = H β − T 1 − β = 0 = ⇒ β = H H + T
Subhransu Maji (UMASS) CMPSCI 689 /32
Suppose you are rolling a k-sided die with parameters: You observe: Log-likelihood of the data:
9
θ1, θ2, . . . , θk x1, x2, . . . , xk log p(data) = X
k
xk log θk d log p(data) dθk = xk θk = 0 = ⇒ θk = ∞ X
k
θk = 1
Subhransu Maji (UMASS) CMPSCI 689 /32
Constrained optimization:
10
max
θ1,θ2...,θk
X
k
xk log θk
subject to:
X
k
θk = 1 xk θk = λ = ⇒ θk = xk λ λ = X
k
xk θk = xk P
k xk
min
λ
max
{θ1,θ2...,θk}
X
k
xk log θk + λ 1 − X
k
θk !
Subhransu Maji (UMASS) CMPSCI 689 /32
Consider the binary prediction problem Let the data be distributed according to a probability distribution:
11
pθ(y, x) = pθ(y, x1, x2, . . . , xD)
pθ(y, x) = pθ(y)pθ(x1|y)pθ(x2|x1, y) . . . pθ(xD|x1, x2, . . . , xD−1, y) = pθ(y)
D
Y
d=1
pθ(xd|x1, x2, . . . , xd−1, y)
pθ(xd|xd0, y) = pθ(xd|y), 8d0 6= d
Subhransu Maji (UMASS) CMPSCI 689 /32
Naive Bayes assumption:
12
pθ(xd|xd0, y) = pθ(xd|y), 8d0 6= d pθ(y, x) = pθ(y)
D
Y
d=1
pθ(xd|x1, x2, . . . , xd−1, y) = pθ(y)
D
Y
d=1
pθ(xd|y)
// simpler distribution
Subhransu Maji (UMASS) CMPSCI 689 /32
Case: binary labels and binary features
13
pθ(y) = Bernoulli(θ0) pθ(xd|y = 1) = Bernoulli(θ+
d )
pθ(xd|y = −1) = Bernoulli(θ−
d )
1+2D parameters
// label +1 // label -1
pθ(y, x) = pθ(y)
D
Y
d=1
pθ(xd|y) = θ[y=+1] (1 − θ0)[y=−1] ... ×
D
Y
d=1
θ+[xd=1,y=+1]
d
(1 − θ+
d )[xd=0,y=+1]
... ×
D
Y
d=1
θ−[xd=1,y=−1]
d
(1 − θ−
d )[xd=0,y=−1]
Subhransu Maji (UMASS) CMPSCI 689 /32
Given data we can estimate the parameters by maximizing data likelihood Similar to the coin toss example the maximum likelihood estimates are:
14
ˆ θ0 = P
n[yn = +1]
N
// fraction of the data with label as +1 // fraction of the instances with 1 among +1
ˆ θ−
d =
P
n[xd,n = 1, yn = −1]
P
n[yn = −1]
ˆ θ+
d =
P
n[xd,n = 1, yn = +1]
P
n[yn = +1]
// fraction of the instances with 1 among -1
inductive bias
Subhransu Maji (UMASS) CMPSCI 689 /32
To make predictions compute the posterior distribution:
15
ˆ y = arg max
y
pθ(y|x) = arg max
y
pθ(y, x) pθ(x) = arg max
y
pθ(y, x)
// Bayes optimal prediction // Bayes rule
LR = pθ(+1, x) pθ(−1, x) ˆ y = ⇢ +1 LR ≥ 1 −1
LLR = log (pθ(+1, x)) − log (pθ(−1, x)) ˆ y = ⇢ +1 LLR ≥ 0 −1
Subhransu Maji (UMASS) CMPSCI 689 /32
16
Naive bayes classifier has a linear decision boundary!
LLR = log (pθ(+1, x)) − log (pθ(−1, x)) = log θ0
D
Y
d=1
θ+[xd=1]
d
(1 − θ+
d )[xd=0]
! − log (1 − θ0)
D
Y
d=1
θ−[xd=1]
d
(1 − θ−
d )[xd=0]
! = log θ0 − log(1 − θ0) +
D
X
d=1
[xd = 1]
d − log θ− d
D
X
d=1
[xd = 0]
d ) − log(1 − θ− d )
✓ θ0 1 − θ0 ◆ +
D
X
d=1
[xd = 1] log ✓θ+
d
θ−
d
◆ +
D
X
d=1
[xd = 0] log ✓1 − θ+
d
1 − θ−
d
◆ = log ✓ θ0 1 − θ0 ◆ +
D
X
d=1
xd log ✓θ+
d
θ−
d
◆ +
D
X
d=1
(1 − xd) log ✓1 − θ+
d
1 − θ−
d
◆ = log ✓ θ0 1 − θ0 ◆ +
D
X
d=1
xd ✓ log ✓θ+
d
θ−
d
◆ − log ✓1 − θ+
d
1 − θ−
d
◆◆ +
D
X
d=1
log ✓1 − θ+
d
1 − θ−
d
◆ = wT x + b
Subhransu Maji (UMASS) CMPSCI 689 /32
Generative models:
In most cases we are given x and are only interested in the labels y Conditional models:
17
Subhransu Maji (UMASS) CMPSCI 689 /32
Assume that y has a linear relationship with x Generative story of the dataset:
➡ Compute: ➡ Compute: ➡ Compute:
This can be written as: , and
18
✏n = N(0, 2) yn = tn + ✏n tn = wT xn yn ∼ N(wT xn, σ2) log(D) = X
n
−(yn − wT xn)2 2σ2 + constants p(yn|xn) = 1 σ √ 2π exp ✓ −(yn − wT xn)2 2σ2 ◆
Maximizing log-likelihood is equivalent to minimizing squared error
Subhransu Maji (UMASS) CMPSCI 689 /32
The sigmoid function:
➡ Compute: ➡ Compute: ➡ Compute:
19
σ(z) = 1 1 + exp[−z] tn = σ(wT xn) zn = Bernoulli(tn) yn = 2zn − 1
Subhransu Maji (UMASS) CMPSCI 689 /32
The log-likelihood of the dataset is:
20
log(D) = X
n
[yn = +1] log (wT xn) + [yn = −1] log(1 − (wT xn)) = X
n
[yn = +1] log (wT xn) + [yn = −1] log((−wT xn)) = X
n
log (ynwT xn) = X
n
− log(1 + exp(−ynwT xn)) = X
n
−`(log)(yn, wT xn)
Maximizing log-likelihood is equivalent to minimizing logistic loss This is also called as logistic regression
// ignoring constants
Subhransu Maji (UMASS) CMPSCI 689 /32
Coin toss: {H,H,H,H} → β = 1 Maximum likelihood estimation (MLE):
21
arg max
θ
p(D|θ)
likelihood data probability
likelihood
arg max
θ
p(θ|D) = arg max
θ
p(θ, D) p(D) = arg max
θ
p(θ)p(D|θ) p(D) = Z
θ
p(θ, D)dθ
prior
arg max
θ
[log p(θ) + log p(D|θ)]
log-prior log-likelihood
Subhransu Maji (UMASS) CMPSCI 689 /32
Beta distribution as a prior on β
22
Beta(β; a, b) = cβa−1(1 − β)b−1 βMAP = a + H − 1 a + H + b + T − 2 βMLE = H H + T Mode = a − 1 a + b − 2 p(β|D) ∝ p(β)p(D|β) ∝ βa−1(1 − β)b−1βH(1 − β)T = Beta(a + H, b + T)
Subhransu Maji (UMASS) CMPSCI 689 /32
If the prior and posterior are in the same family, then the prior is conjugate to the likelihood
Dirichlet is conjugate to Multinomial
23
posterior prior likelihood
Dirichlet(θ; a1 + k1, a2 + k2, . . . , an + kn) Dirichlet(θ; a1, a2, . . . , an) ∝ Y
k
θak−1
k
Subhransu Maji (UMASS) CMPSCI 689 /32
Assume that y has a linear relationship with x Generative story of the dataset:
➡ Compute: ➡ Compute: ➡ Compute:
Assume a Gaussian prior on the weights:
24
✏n = N(0, 2) yn = tn + ✏n tn = wT xn
MAP is same as l2 regularized least-squares regression
p(w) = N(0D, τ 2ID) = c exp X
i
− w2
i
2τ 2 ! arg max
w
X
i
− w2
i
2τ 2 + X
n
−(yn − wT xn)2 2σ2 + constants yn ∼ N(wT xn, σ2)
c exp X
i
−|wi| b !
Laplace prior for l1
Subhransu Maji (UMASS) CMPSCI 689 /32
So far we assumed that the probability distribution was parametric
parameters of the probability distribution However, the data distribution can be complicated
Non-parametric density models offer a flexible alternative
25
Subhransu Maji (UMASS) CMPSCI 689 /32
This is the simplest example of a non-parametric density model
26
too large too small
p(x)
Subhransu Maji (UMASS) CMPSCI 689 /32
Histograms are sums of delta functions centered at each point
The function K is called the kernel function These density estimators are also called as Parzen window estimators Set hyperparameters by cross-validation
27
K(x − xi) = ⇢
1 b
|x − xi| ≤ b
2
p(x) = 1 N
N
X
i=1
K(x − xi)
Subhransu Maji (UMASS) CMPSCI 689 /32
28
Rectangle kernel K(x − xi) =
⇢
1 b
|x − xi| ≤ b
2
Subhransu Maji (UMASS) CMPSCI 689 /32
29
Gaussian kernel
K(x − xi) = 1 σ √ 2π exp ✓ −(x − xi)2 2σ2 ◆
Subhransu Maji (UMASS) CMPSCI 689 /32
Estimate p(x | +1) and p(x | -1) separately Compute likelihood ratio: p(+1)p(x |+1) / p(-1)p(x |-1)
distance to the kth nearest neighbor
30
small width large width
Figure from Duda et al.
Subhransu Maji (UMASS) CMPSCI 689 /32
Probabilistic modeling views learning as statistical inference Two ways to estimate parameters of the distribution
Two kinds of data models
➡ Example: Naive bayes, Kernel density
➡ Example: Linear and logistic regression
Two kinds of probability models
➡ Learning by MLE and MAP
➡ Learning by cross validation
31
Subhransu Maji (UMASS) CMPSCI 689 /32
Figure of the logistic and linear regression are from Wikipedia Figure of the beta distribution is from Wikipedia Figures for kernel density estimation are from http:// www.mglerner.com/blog/?p=28 (the page has an interactive demo) Parzen window figure: “Pattern Classification”, Duda, Hart & Stork Some slides are based on the CIML book by Hal Daume III
32