6.1 Gaussians
6 Bayesian Kernel Methods
Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15
6.1 Gaussians 6 Bayesian Kernel Methods Alexander Smola - - PowerPoint PPT Presentation
6.1 Gaussians 6 Bayesian Kernel Methods Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15 Normal Distribution http://www.gaussianprocess.org/gpml/chapters/ The Normal Distribution Gaussians in
6 Bayesian Kernel Methods
Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15
http://www.gaussianprocess.org/gpml/chapters/
samples in R2
p(x) = 1 √ 2πσ2 e−
1 2σ2 (x−µ)
Σ = U >ΛU p(x) = (2π)− d
2 e− 1 2 (U(x−µ))>Λ1U(x−µ)
p(x) = (2π)− d
2 |Σ|−1 e− 1 2 (x−µ)>Σ1(x−µ)
(ex is unit vector for x)
Wishart, ...
φ(x) = x φ(x) = ex φ(x) = ✓ x, 1 2xx> ◆ φ(x) = x
(ex is unit vector for x)
Wishart, ...
φ(x) = x φ(x) = ex φ(x) = ✓ x, 1 2xx> ◆ φ(x) = x −∂θ log p(X; θ) = m " E[φ(x)] − 1 m
n
X
i=1
φ(xi) #
Σ = U >ΛU
principal components principal components
p(x) = (2π)− d
2
d
Y
i=1
Λ−1
ii e− 1
2 (U(x−µ))>Λ1U(x−µ)
averages behave like Gaussians
a given mean and covariance.
µ = 1 m
m
X
i=1
xi and Σ = 1 m
m
X
i=1
xix>
i − µµ>
averages behave like Gaussians
X: data m: sample size mu = (1/m)*sum(X,2) sigma = (1/m)*X*X’- mu*mu’
µ = 1 m
m
X
i=1
xi and Σ = 1 m
m
X
i=1
xix>
i − µµ>
x ∼ N(µ, Σ) x = µ + Lz z ∼ N(0, 1) Σ = LL> E ⇥ (x − µ)(x − µ)>⇤ = E ⇥ Lzz>L>⇤ = LE ⇥ zz>⇤ L> = LL> = Σ p(x) = 1 2π e 1
2 kxk2 =
⇒ p(φ, r) = 1 2π e 1
2 r2
F(φ, r) = φ 2π · h 1 − e− 1
2 r2i
p(x) = 1 2π e 1
2 kxk2 =
⇒ p(φ, r) = 1 2π e 1
2 r2
F(φ, r) = φ 2π · h 1 − e− 1
2 r2i
r Φ
Draw radial and angle component separately tmp1 = rand() tmp2 = rand() r = sqrt(-2*log(tmp1)) x1 = r*sin(tmp2/(2*pi)) x2 = r*cos(tmp2/(2*pi))
F(φ, r) = φ 2π · h 1 − e− 1
2 r2i
Draw radial and angle component separately tmp1 = rand() tmp2 = rand() r = sqrt(-2*log(tmp1)) x1 = r*sin(tmp2/(2*pi)) x2 = r*cos(tmp2/(2*pi))
F(φ, r) = φ 2π · h 1 − e− 1
2 r2i
Why can we use tmp1 instead of 1-tmp1?
H-WBC H-RBC H-Hgb H-Hct H-MCV H-MCH H-MCHC H-MCHC A1 8.0000 4.8200 14.1000 41.0000 85.0000 29.0000 34.0000 A2 7.3000 5.0200 14.7000 43.0000 86.0000 29.0000 34.0000 A3 4.3000 4.4800 14.1000 41.0000 91.0000 32.0000 35.0000 A4 7.5000 4.4700 14.9000 45.0000 101.0000 33.0000 33.0000 A5 7.3000 5.5200 15.4000 46.0000 84.0000 28.0000 33.0000 A6 6.9000 4.8600 16.0000 47.0000 97.0000 33.0000 34.0000 A7 7.8000 4.6800 14.7000 43.0000 92.0000 31.0000 34.0000 A8 8.6000 4.8200 15.8000 42.0000 88.0000 33.0000 37.0000 A9 5.1000 4.7100 14.0000 43.0000 92.0000 30.0000 32.0000
Instances Features
10 20 30 40 50 60 100 200 300 400 500 600 700 800 900 1000 measurement Value
Measurement
One plot per person …
0 50 150 250 350 450 50 100 150 200 250 300 350 400 450 500 550
C-Triglycerides C-LDH
100200300400500 200 400 600 1 2 3 4
C-Triglycerides C-LDH M-EPI
Bi-variate Tri-variate
Even 3 dimensions are already difficult. How to extend this?
minimize
rankP =k
1 m
m
X
i=1
kxi Pxik2 where 1 m
m
X
i=1
xi = µ
assume centering
tr " 1 m
m
X
i=1
xix>
i − Pxix> i P >
# tr Σ − tr PΣP >
maximize this
minimize
rankP =k
1 m
m
X
i=1
kxi Pxik2 where 1 m
m
X
i=1
xi = µ
assume centering
tr " 1 m
m
X
i=1
xix>
i − Pxix> i P >
# tr Σ − tr PΣP >
maximize this
minimize
rankP =k
1 m
m
X
i=1
kxi Pxik2 where 1 m
m
X
i=1
xi = µ
assume centering
tr " 1 m
m
X
i=1
xix>
i − Pxix> i P >
# tr Σ − tr PΣP >
maximize this
maximize this
Residual = tr Σ − tr PΣP > =
d
X
i=1
σ2
i − k
X
i=1
σ2
i = d
X
i=k+1
σ2
i
x = z + ✏ where z ∼ N(µ, Σ) and ✏ ∼ N(0, 21) Σ + σ21 σ2
i + σ2
everything below the noise threshold
maximize this
Residual = tr Σ − tr PΣP > =
d
X
i=1
σ2
i − k
X
i=1
σ2
i = d
X
i=k+1
σ2
i
x = z + ✏ where z ∼ N(µ, Σ) and ✏ ∼ N(0, 21) Σ + σ21 σ2
i + σ2
height weight
assume Gaussian correlation
p(weight|height) = p(height, weight) p(height) ∝ p(height, weight)
p(x2|x1) ∝ exp " −1 2 x1 − µ1 x2 − µ2 > Σ11 Σ12 Σ12 Σ22 1 x1 − µ1 x2 − µ2 #
keep linear and quadratic terms of exponent
Correlated Observations Assume that the random variables t 2 Rn, t0 2 Rn0 are jointly normal with mean (µ, µ0) and covariance matrix K p(t, t0) / exp 1 2 t µ t0 µ0 > Ktt Ktt0 K>
tt0 Kt0t0
1 t µ t0 µ0 ! . Inference Given t, estimate t0 via p(t0|t). Translation into machine learning language: we learn t0 from t. Practical Solution Since t0|t ⇠ N(˜ µ, ˜ K), we only need to collect all terms in p(t, t0) depending on t0 by matrix inversion, hence ˜ K = Kt0t0 K>
tt0K1 tt Ktt0 and ˜
µ = µ0 + K>
tt0
⇥ K1
tt (t µ)
⇤ | {z }
independent of t0
Handbook of Matrices, Lütkepohl 1997 (big timesaver)
Use where and
p(x) = (2π)− d
2 |Σ|−1 e− 1 2 (x−µ)>Σ1(x−µ)
x ∼ N(µ, Σ) x = µ + Lz z ∼ N(0, 1) Σ = LL> µ = 1 m
m
X
i=1
xi and Σ = 1 m
m
X
i=1
xix>
i − µµ>
p(x2|x1) ∝ exp " −1 2 x1 − µ1 x2 − µ2 > Σ11 Σ12 Σ12 Σ22 1 x1 − µ1 x2 − µ2 #
6 Bayesian Kernel Methods
Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15
Key Idea Instead of a fixed set of random variables t, t0 we assume a stochastic process t : X ! R, e.g. X = Rn. Previously we had X = {age, height, weight, . . .}. Definition of a Gaussian Process A stochastic process t : X ! R, where all (t(x1), . . . , t(xm)) are normally distributed. Parameters of a GP Mean µ(x) := E[t(x)] Covariance Function k(x, x0) := Cov(t(x), t(x0)) Simplifying Assumption We assume knowledge of k(x, x0) and set µ = 0.
(evaluated at many points)
p(t|X) = (2π) m
2 |K|1 exp
✓ −1 2(t − µ)>K1(t − µ) ◆ where Kij = k(xi, xj) and µi = µ(xi)
Covariance Function Function of two arguments Leads to matrix with nonnegative eigenvalues Describes correlation between pairs of observations Kernel Function of two arguments Leads to matrix with nonnegative eigenvalues Similarity measure between pairs of observations Lucky Guess We suspect that kernels and covariance functions are the same . . .
yes!
p(t|X) = (2π) m
2 |K|1 exp
✓ −1 2(t − µ)>K1(t − µ) ◆
p(weight|height) = p(height, weight) p(height) ∝ p(height, weight)
X = {height, weight}
Good for experimental design
˜ K = Kt0t0 − K>
tt0K1 tt Ktt0 and ˜
µ = µ0 + K>
tt0
⇥ K1
tt (t − µ)
⇤ p(t, t0) / exp 1 2 t µ t0 µ0 > Ktt Ktt0 K>
tt0 Kt0t0
1 t µ t0 µ0 ! Inference
Linear kernel: k(x, x0) = hx, x0i Kernel matrix X>X Mean and covariance ˜ K = X0>X0 X0>X(X>X)1X>X0 = X0>(1 PX)X0. ˜ µ = X0>⇥ X(X>X)1t ⇤ ˜ µ is a linear function of X0. Problem The covariance matrix X>X has at most rank n. After n observations (x 2 Rn) the variance vanishes. This is not realistic. “Flat pancake” or “cigar” distribution.
x t
x t y
‘fatten up’ covariance
x t y
‘fatten up’ covariance
x t y
‘fatten up’ covariance
t ∼ N(µ, K) yi ∼ N(ti, σ2)
Indirect Model Instead of observing t(x) we observe y = t(x) + ξ, where ξ is a nuisance term. This yields p(Y |X) = Z
m
Y
i=1
p(yi|ti)p(t|X)dt where we can now find a maximum a posteriori solution for t by maximizing the integrand (we will use this later). Additive Normal Noise If ξ ∼ N(0, σ2) then y is the sum of two Gaussian ran- dom variables. Means and variances add up. y ∼ N(µ, K + σ21).
k(x, X)>(K(X, X) + σ21)1y
Covariance Matrices Additive noise K = Kkernel + σ21 Predictive mean and variance ˜ K = Kt0t0 K>
tt0K1 tt Ktt0 and ˜
µ = K>
tt0K1 tt t
Pointwise prediction
With Noise
˜ K = Kt0t0 + σ21 − K>
tt0
1 Ktt0 and ˜ µ = µ0 + K>
tt0
h Ktt + σ21 1 (y − µ) i
ktrtr = k(xtrain,xtrain) + sigma2 * eye(mtr) ktetr = k(xtest,xtrain) ktete = k(xtest,xtest) alpha = ytr/ktrtr %better if you use cholesky yte = ktetr * alpha sigmate = ktete + sigma2 * eye(mte) + ... ktetr * (ktetr/ktrtr)’ ˜ K = Kt0t0 + σ21 − K>
tt0
1 Ktt0 and ˜ µ = µ0 + K>
tt0
h Ktt + σ21 1 (y − µ) i
Gaussian Process on Parameters t ⇠ N(µ, K) where Kij = k(xi, xj) Linear Model in Feature Space t(x) = hΦ(x), wi + µ(x) where w ⇠ N(0, 1) The covariance between t(x) and t(x0) is then given by Ew [hΦ(x), wihw, Φ(x0)i] = hΦ(x), Φ(x0)i = k(x, x0) Conclusion
Linear model in feature space induces a Gaussian Process
µ → µ and K → K + σ21
˜ K = Kt0t0 + σ21 − K>
tt0
1 Ktt0 and ˜ µ = µ0 + K>
tt0
h Ktt + σ21 1 (y − µ) i
x t y
t ∼ N(µ, K) and yi ∼ N(ti, σ2) i.e. p(yi|ti) =
2 e− 1 2σ2 (yi−ti)2
t ∼ N(µ, K) and p(yi|ti) = 1 1 + e−yiti
p(yi|ti) = 1 1 + e−yiti
φ(x) = x g(θ) = log ⇥ e−1·θ + e1·θ⇤ = log 2 cosh θ p(x|θ) = exp(x · θ − g(θ)) = exθ e−θ + eθ = 1 1 + e−2xθ
Logistic function
We can integrate out the latent variable t.
Closed form solution is not possible (we cannot solve the integral in t).
t ∼ N(µ, K) and yi ∼ N(ti, σ2) hence y ∼ N(µ, K + σ21) t ∼ N(µ, K) and yi ∼ Logistic(ti)
p(t|y, x) ∝ p(t|x)
m
Y
i=1
p(yi|ti) ∝ exp ✓ −1 2t>K1t ◆ m Y
i=1
1 1 + eyiti
is very very expensive (e.g. MCMC)
p(y0|y, x, x0) = Z d(t, t0)p(y0|t0)p(y|t)p(t, t0|x, x0) ˆ t := argmax
t
p(y|t)p(t|x) ˆ t0(x0) := argmax
t0
p(ˆ t, t0|x, x0) y0|y, x, x0 ∼ Logistic(ˆ t0(x0))
minimize
t
1 2t>K1t +
m
X
i=1
log
t0 = K>
tt0K1 tt t
precompute
p(y0|t0) = 1 1 + ey0t0
minimize
t
1 2t>K1t +
m
X
i=1
log
minimize
α
1 2α>Kα +
m
X
i=1
max (0, 1 − yi[Kα]i) α = K−1t minimize
α
1 2α>Kα +
m
X
i=1
log (1 + exp yi[Kα]i)
if f(x) > 1
1 2(1 − f(x))2
if f(x) ∈ [0, 1]
1 2 − f(x)
if f(x) < 0
max(0, 1 − f(x))
(asymptotically) linear (asymptotically) 0
log h 1 + e−f(x)i
(with many hacks to make it scale)
(different loss and parametrization )
α = K−1t
Vector Machines ftp://publications.ai.mit.edu/ai-publications/pdf/AIM-1606.pdf
regularization operators and Support Vector Kernels http://alex.smola.org/teaching/berkeley2012/slides/ Smola1998connection.pdf
Kernel Eigenvalue Problem http://www.mitpressjournals.org/doi/abs/10.1162/089976698300017467
alex.smola.org/papers/2001/SchHerSmo01.pdf
http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2007_1047.pdf
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.3414