Alexander J. Smola: Exponential Families and Kernels, Page 1
Exponential Families and Kernels Lecture 1 Alexander J. Smola - - PowerPoint PPT Presentation
Exponential Families and Kernels Lecture 1 Alexander J. Smola - - PowerPoint PPT Presentation
Exponential Families and Kernels Lecture 1 Alexander J. Smola Alex.Smola@nicta.com.au Machine Learning Program National ICT Australia RSISE, The Australian National University Alexander J. Smola: Exponential Families and Kernels, Page 1
Outline
Alexander J. Smola: Exponential Families and Kernels, Page 2
Exponential Families Maximum likelihood and Fisher information Priors (conjugate and normal) Conditioning and Feature Spaces Conditional distributions and inner products Clifford Hammersley Decomposition Applications Classification and novelty detection Regression Applications Conditional random fields Intractable models and semidefinite approximations
Lecture 1
Alexander J. Smola: Exponential Families and Kernels, Page 3
Model Log partition function Expectations and derivatives Maximum entropy formulation Examples Normal distribution Discrete events Laplacian distribution Poisson distribution Beta distribution Estimation Maximum Likelihood Estimator Fisher Information Matrix and Cramer Rao Theorem Normal Priors and Conjugate Priors
The Exponential Family
Alexander J. Smola: Exponential Families and Kernels, Page 4
Definition A family of probability distributions which satisfy p(x; θ) = exp(φ(x), θ − g(θ)) Details φ(x) is called the sufficient statistics of x. X is the domain out of which x is drawn (x ∈ X). g(θ) is the log-partition function and it ensures that the distribution integrates out to 1. g(θ) = log
- X
exp(φ(x), θ)dx
Example: Binomial Distribution
Alexander J. Smola: Exponential Families and Kernels, Page 5
Tossing coins With probability p we have heads and with probability 1− p we see tails. So we have p(x) = px(1 − p)1−x where x ∈ {0, 1} =: X Massaging the math p(x) = exp log p(x) = exp (x log p + (1 − x) log(1 − p)) = exp
- (x, 1 − x)
- φ(x)
, (log p, log(1 − p))
- θ
- The Normalization Once we relax the restriction on θ ∈ R2
we need g(θ) which yields g(θ) = log
- eθ1 + eθ2
Example: Binomial Distribution
Alexander J. Smola: Exponential Families and Kernels, Page 6
Example: Laplace Distribution
Alexander J. Smola: Exponential Families and Kernels, Page 7
Atomic decay At any time, with probability θdx an atom will decay in the time interval [x, x + dx] if it still exists. Consulting your physics book tells us that this gives us the density p(x) = θ exp(θx) where x ∈ [0, ∞) =: X Massaging the math p(x) = exp
- −x
- φ(x)
, θ − − log θ
g(θ)
Example: Laplace Distribution
Alexander J. Smola: Exponential Families and Kernels, Page 8
Example: Normal Distribution
Alexander J. Smola: Exponential Families and Kernels, Page 9
Engineer’s favorite p(x) = 1 √ 2πσ2 exp
- − 1
2σ2(x − µ)2
- where x ∈ R =: X
Massaging the math p(x) = exp
- − 1
2σ2x2 + µ σ2x − µ2 2σ2 − 1 2 log(2πσ2)
- = exp
- (x, x2)
φ(x)
, θ − µ2 2σ2 + 1 2 log(2πσ2)
- g(θ)
- Finally we need to solve (µ, σ2) for θ. Tedious algebra
yields θ2 := −1
2σ−2 and θ1 := µσ−2. We have
g(θ) = −1 4θ2
1θ−1 2
+ 1 2 log 2π − 1 2 log −2θ2
Example: Normal Distribution
Alexander J. Smola: Exponential Families and Kernels, Page 10
Example: Multinomial Distribution
Alexander J. Smola: Exponential Families and Kernels, Page 11
Many discrete events Assume that we have disjoint events [1..n] =: X which all may occur with a certain probability px. Guessing the answer Use the map φ : x → ex, that is, ex is an element of the canonical basis (0, . . . , 0, 1, 0, . . .). This gives p(x) = exp(ex, θ − g(θ)) where the normalization is g(θ) = log
n
- i=1
exp(θi)
Example: Multinomial Distribution
Alexander J. Smola: Exponential Families and Kernels, Page 12
Example: Poisson Distribution
Alexander J. Smola: Exponential Families and Kernels, Page 13
Limit of Binomial distribution Probability of observing x ∈ N events which are all in- dependent (e.g. raindrops per square meter, crimes per day, cancer incidents) p(x) = exp (x · θ − log Γ(x + 1) − exp(θ)) . Hence φ(x) = x and g(θ) = eθ. Differences We have a normalization dependent on x alone, namely Γ(x + 1). This leaves the rest of the theory unchanged. The domain is countably infinite. Effectively this assumes the measure 1
x! on the domain
N.
Example: Poisson Distribution
Alexander J. Smola: Exponential Families and Kernels, Page 14
Example: Beta Distribution
Alexander J. Smola: Exponential Families and Kernels, Page 15
Usage Often used as prior on Binomial distributions (it is a conjugate prior as we will see later). Mathematical Form p(x) = exp((log x, log(1 − x)), (θ1, θ2)−log B(θ1+1, θ2+1)) where the domain is x ∈ [0, 1] and g(θ) = log B(θ1 + 1, θ2 + 1) = log Γ(θ1 + 1) + log Γ(θ2 + 1) − log Γ(θ1 + θ2 + 2) Here B(α, β) is the Beta function.
Example: Beta Distribution
Alexander J. Smola: Exponential Families and Kernels, Page 16
Example: Gamma Distribution
Alexander J. Smola: Exponential Families and Kernels, Page 17
Usage Popular as a prior on coefficients Obtained from integral over waiting times in Poisson distribution Mathematical Form p(x) = exp((log x, x), (θ1, θ2)−log Γ(θ1+1)+(θ1+1) log −θ2) where the domain is x ∈ [0, ∞] and g(θ) = log Γ(θ1 + 1) + (θ1 + 1) log −θ2) Note that θ ∈ [0, ∞) × (−∞, 0).
Example: Gamma Distribution
Alexander J. Smola: Exponential Families and Kernels, Page 18
Zoology of Exponential Families
Alexander J. Smola: Exponential Families and Kernels, Page 19
Name φ(x) Domain Measure Binomial (x, 1 − x) {0, 1} discrete Multinomial ex {1, . . . , n} discrete Poisson x N0 discrete Laplace x [0, ∞) Lebesgue Normal (x, x2) R Lebesgue Beta (log x, log(1 − x)) [0, 1] Lebesgue Gamma (log x, x) [0, ∞) Lebesgue Wishart (log |X|, X) X 0 Lebesgue Dirichlet log x x ∈ Rn
+, x1 = 1 Lebesgue
Recall
Alexander J. Smola: Exponential Families and Kernels, Page 20
Definition A family of probability distributions which satisfy p(x; θ) = exp(φ(x), θ − g(θ)) Details φ(x) is called the sufficient statistics of x. X is the domain out of which x is drawn (x ∈ X). g(θ) is the log-partition function and it ensures that the distribution integrates out to 1. g(θ) = log
- X
exp(φ(x), θ)dx
Benefits: Log-partition function is nice
Alexander J. Smola: Exponential Families and Kernels, Page 21
g(θ) generates cumulants: g(θ) = log
- exp(φ(x), θ)dx
Taking the derivative wrt. θ we can see that ∂θg(θ) =
- φ(x) exp(φ(x), θ)dx
- exp(φ(x), θ)dx
= Ex∼p(x;θ) [φ(x)] ∂2
θg(θ) = Covx∼p(x;θ) [φ(x)]
. . . and so on for higher order cumulants . . . Corollary: g(θ) is convex
Benefits: Simple Estimation
Alexander J. Smola: Exponential Families and Kernels, Page 22
Likelihood of a set: Given X := {x1, . . . , xm} we get p(X; θ) =
m
- i=1
p(xi; θ) = exp m
- i=1
φ(xi), θ − mg(θ)
- Maximum Likelihood
We want to minimize the negative log-likelihood, i.e. minimize
θ
g(θ) −
- 1
m
m
- i=1
φ(xi), θ
- =
⇒ E[φ(x)] = 1 m
m
- i=1
φ(xi) =: µ Solving the maximum likelihood problem is easy.
Application: Laplace distribution
Alexander J. Smola: Exponential Families and Kernels, Page 23
Estimate the decay constant of an atom: We use exponential family notation where p(x; θ) = exp((−x), θ − (− log θ)) Computing µ Since φ(x) = −x all we need to do is average over all decay times that we observe. Solving for Maximum Likelihood The maximum likelihood condition yields µ = ∂θg(θ) = ∂θ(− log θ) = −1 θ This leads to θ = −1
µ.
Benefits: Maximum Entropy Estimate
Alexander J. Smola: Exponential Families and Kernels, Page 24
Entropy Basically it’s the number of bits needed to encode a ran- dom variable. It is defined as H(p) =
- −p(x) log p(x)dx where we set 0 log 0 := 0
Maximum Entropy Density The density p(x) satisfying E[φ(x)] ≥ η with maximum entropy is exp(φ(x), θ − g(θ)). Corollary The most vague density with a given variance is the Gaussian distribution. Corollary The most vague density with a given mean is the Lapla- cian distribution.
Using it
Alexander J. Smola: Exponential Families and Kernels, Page 25
Observe Data x1, . . . , xm drawn from distribution p(x|θ) Compute Likelihood p(X|θ) =
m
- i=1
exp(φ(xi), θ − g(θ)) Maximize it Take the negative log and minimize, which leads to ∂θg(θ) = 1 m
m
- i=1
φ(xi) This can be solved analytically or (whenever this is im- possible or we are lazy) by Newton’s method.
Application: Discrete Events
Alexander J. Smola: Exponential Families and Kernels, Page 26
Simple Data Discrete random variables (e.g. tossing a dice). Outcome 1 2 3 4 5 6 Counts 3 6 2 1 4 4 Probabilities 0.15 0.30 0.10 0.05 0.20 0.20 Maximum Likelihood Solution Count the number of outcomes and use the relative fre- quency of occurrence as estimates for the probability: pemp(x) = #x m Problems Bad idea if we have few data. Bad idea if we have continuous random variables.
Tossing a dice
Alexander J. Smola: Exponential Families and Kernels, Page 27
Fisher Information and Efficiency
Alexander J. Smola: Exponential Families and Kernels, Page 28
Fisher Score Vθ(x) := ∂θ log p(x; θ) This tells us the influence of x on estimating θ. Its ex- pected value vanishes, since E [∂θ log p(X; θ)] =
- p(X; θ)∂θ log p(X; θ)dX
= ∂θ
- p(X; θ)dX = 0.
Fisher Information Matrix It is the covariance matrix of the Fisher scores, that is I := Cov[Vθ(x)]
Cramer Rao Theorem
Alexander J. Smola: Exponential Families and Kernels, Page 29
Efficiency Covariance of estimator ˆ θ(X) rescaled by I: 1/e := det Cov[ˆ θ(X)]Cov[∂θ log p(X; θ)] Theorem The efficiency for unbiased estimators is never better (i.e. larger) than 1. Equality is achieved for MLEs. Proof (scalar case only) By Cauchy-Schwartz we have
- Eθ
- (Vθ(X) − Eθ [Vθ(X)])
- ˆ
θ(X) − Eθ
- ˆ
θ(X) 2 ≤Eθ
- (Vθ(X) − Eθ [Vθ(X)])2
Eθ
- ˆ
θ(X) − Eθ
- ˆ
θ(X) 2 = IB
Cramer Rao Theorem
Alexander J. Smola: Exponential Families and Kernels, Page 30
Proof At the same time, Eθ [Vθ(X)] = 0 implies that Eθ
- (Vθ(X) − Eθ [Vθ(X)])
- ˆ
θ(X) − Eθ
- ˆ
θ(X)
- =Eθ
- Vθ(X)ˆ
θ(X)
- =
- p(X|θ)∂θ log p(X|θ)ˆ
θ(X)dX
- =∂θ
- p(X|θ)ˆ
θ(X)dX = ∂θθ = 1. Cautionary Note This does not imply that a biased estimator might not have lower variance.
Fisher and Exponential Families
Alexander J. Smola: Exponential Families and Kernels, Page 31
Fisher Score Vθ(x) = ∂θ log p(x; θ) = φ(x) − ∂θg(θ) Fisher Information I = Cov[Vθ(x)] = Cov[φ(x) − ∂θg(θ)] = ∂2
θg(θ)
Efficiency of estimator can be obtained directly from log- partition function. Outer Product Matrix It is given (up to an offset) by φ(x), φ(x′). This leads to Kernel-PCA . . .
Priors
Alexander J. Smola: Exponential Families and Kernels, Page 32
Problems with Maximum Likelihood With not enough data, parameter estimates will be bad. Prior to the rescue Often we know where the solution should be. So we encode the latter by means of a prior p(θ). Normal Prior Simply set p(θ) ∝ exp(− 1
2σ2θ2).
Posterior p(θ|X) ∝ exp m
- i=1
φ(xi), θ − g(θ) − 1 2σ2θ2
Tossing a dice with priors
Alexander J. Smola: Exponential Families and Kernels, Page 33
Conjugate Priors
Alexander J. Smola: Exponential Families and Kernels, Page 34
Problem with Normal Prior The posterior looks different from the likelihood. So many of the Maximum Likelihood optimization algorithms may not work ... Idea What if we had a prior which looked like additional data, that is p(θ|X) ∼ p(X|θ) For exponential families this is easy. Simply set p(θ|a) ∝ exp(θ, m0a − m0g(θ)) Posterior p(θ|X) ∝ exp
- (m + m0)
mµ + m0a m + m0 , θ
- − g(θ)
Example: Multinomial Distribution
Alexander J. Smola: Exponential Families and Kernels, Page 35
Laplace Rule A conjugate prior with parameters (a, m0) in the multino- mial family could be to set a = (1
n, 1 n, . . . , 1 n). This is often
also called the Dirichlet prior. It leads to p(x) = #x + m0/n m + m0 instead of p(x) = #x m Example Outcome 1 2 3 4 5 6 Counts 3 6 2 1 4 4 MLE 0.15 0.30 0.10 0.05 0.20 0.20 MAP (m0 = 6) 0.25 0.27 0.12 0.08 0.19 0.19 MAP (m0 = 100) 0.16 0.19 0.16 0.15 0.17 0.17
Optimization Problems
Alexander J. Smola: Exponential Families and Kernels, Page 36
Maximum Likelihood minimize
θ m
- i=1
g(θ) − φ(xi), θ = ⇒ ∂θg(θ) = 1 m
m
- i=1
φ(xi) Normal Prior minimize
θ m
- i=1
g(θ) − φ(xi), θ + 1 2σ2θ2 Conjugate Prior minimize
θ m
- i=1
g(θ) − φ(xi), θ + m0g(θ) − m0˜ µ, θ equivalently solve ∂θg(θ) = 1 m + m0
m
- i=1
φ(xi) + m0 m + m0 ˜ µ
Summary
Alexander J. Smola: Exponential Families and Kernels, Page 37
Model Log partition function Expectations and derivatives Maximum entropy formulation A Zoo of Densities Estimation Maximum Likelihood Estimator Fisher Information Matrix and Cramer Rao Theorem Normal Priors and Conjugate Priors Fisher information and log-partition function
Alexander J. Smola: Exponential Families and Kernels, Page 1
Exponential Families and Kernels Lecture 2
Alexander J. Smola Alex.Smola@nicta.com.au Machine Learning Program National ICT Australia RSISE, The Australian National University
Outline
Alexander J. Smola: Exponential Families and Kernels, Page 2
Exponential Families Maximum likelihood and Fisher information Priors (conjugate and normal) Conditioning and Feature Spaces Conditional distributions and inner products Clifford Hammersley Decomposition Applications Classification and novelty detection Regression Applications Conditional random fields Intractable models and semidefinite approximations
Lecture 2
Alexander J. Smola: Exponential Families and Kernels, Page 3
Clifford Hammersley Theorem and Graphical Models Decomposition results Key connection Conditional Distributions Log partition function Expectations and derivatives Inner product formulation and kernels Gaussian Processes Applications Classification + Regression Conditional Random Fields Spatial Poisson Models
Graphical Model
Alexander J. Smola: Exponential Families and Kernels, Page 4
Conditional Independence x, x′ are conditionally independent given c, if p(x, x′|c) = p(x|c)p(x′|c) Distributions can be simplified greatly by conditional independence assumptions. Markov Network Given a graph G(V, E) with vertices V and edges E associate a random variable x ∈ R|V | with G. Subsets of random variables xS, xS′ are conditionally independent given xC if removing the vertices C from G(V, E) decomposes the graph into disjoint subsets containing S, S′.
Conditional Independence
Alexander J. Smola: Exponential Families and Kernels, Page 5
Cliques
Alexander J. Smola: Exponential Families and Kernels, Page 6
Definition Subset of the graph which is fully connected Maximal Cliques (they define the graph) Advantage Easy to specify dependencies between variables Use graph algorithms for inference
Hammersley Clifford Theorem
Alexander J. Smola: Exponential Families and Kernels, Page 7
Problem Specify p(x) with conditional independence properties. Theorem p(x) = 1 Z exp
- c∈C
ψc(xc)
- whenever p(x) is nonzero on the entire domain.
Application Apply decomposition for exponential families where p(x) = exp(φ(x), θ − g(θ)). Corollary The sufficient statistics φ(x) decompose according to φ(x) = (. . . , φc(xc), . . .) = ⇒ φ(x), φ(x′) =
- c
φc(xc), φc(x′
c)
Proof
Alexander J. Smola: Exponential Families and Kernels, Page 8
Step 1: Obtain linear functional Combing the exponential setting with the CH theorem: Φ(x), θ =
- c∈C
ψc(xc) − log Z + g(θ) for all x, θ. Step 2: Orthonormal basis in θ Pick an orthonormal basis and swallow Z, g. This gives Φ(x), ei =
- c∈C
ηi
c(xc) for some ηi c(xc).
Step 3: Reconstruct sufficient statistics Φc(xc) := (η1
c(xc), η2 c(xc), . . .)
which allows us to compute Φ(x), θ =
- c∈C
- i
θiΦi
c(xc).
Example: Normal Distributions
Alexander J. Smola: Exponential Families and Kernels, Page 9
Sufficient Statistics Recall that for normal distributions φ(x) = (x, xx⊤). Clifford Hammersley Application φ(x) must decompose into subsets involving only vari- ables from each maximal clique. The linear term x is OK by default. The only nonzero terms coupling xixj are those corre- sponding to an edge in the graph G(V, E). Inverse Covariance Matrix The natural parameter aligned with xx⊤ is the inverse covariance matrix. Its sparsity mirrors G(V, E). Hence a sparse inverse kernel matrix corresponds to graphical model!
Example: Normal Distributions
Alexander J. Smola: Exponential Families and Kernels, Page 10
Density p(x|θ) = exp
n
- i=1
xiθ1i +
n
- i,j=1
xixjθ2ij − g(θ) Here θ2 = Σ−1, is the inverse covariance matrix. We have that (Σ−1)[ij] = 0 only if (i, j) share an edge.
Conditional Distributions
Alexander J. Smola: Exponential Families and Kernels, Page 11
Conditional Density p(x|θ) = exp(φ(x), θ − g(θ)) p(y|x, θ) = exp(φ(x, y), θ − g(θ|x)) Log-partition function g(θ|x) = log
- y
exp(φ(x, y), θ)dy Sufficient Criterion p(x, y|θ) is a member of the exponential family itself. Key Idea Avoid computing φ(x, y) directly, only evaluate inner products via k((x, y), (x′, y′)) := φ(x, y), φ(x′, y′)
Conditional Distributions
Alexander J. Smola: Exponential Families and Kernels, Page 12
Maximum a Posteriori Estimation − log p(θ|X) =
m
- i=1
−φ(xi), θ + mg(θ) + 1 2σ2θ2 + c − log p(θ|X, Y ) =
m
- i=1
−φ(xi, yi), θ + g(θ|xi) + 1 2σ2θ2 + c Solving the Problem The problem is strictly convex in θ. Direct solution is impossible if we cannot compute φ(x, y) directly. Solve convex problem in expansion coefficients. Expand θ in a linear combination of φ(xi, y).
Joint Feature Map
Alexander J. Smola: Exponential Families and Kernels, Page 13
Representer Theorem
Alexander J. Smola: Exponential Families and Kernels, Page 14
Objective Function − log p(θ|X, Y ) =
m
- i=1
−φ(xi, yi), θ + g(θ|xi) + 1 2σ2θ2 + c Decomposition Decompose θ into θ = θ + θ⊥ where θ ∈ span{φ(xi, y) where 1 ≤ i ≤ m and y ∈ Y} Both g(θ|xi)and φ(xi, yi), θare independent of θ⊥. Theorem − log p(θ|X, Y ) is minimized for θ⊥ = 0, hence θ = θ. Consequence If span{φ(xi, y) where 1 ≤ i ≤ m and y ∈ Y} is finite di- mensional, we have a parametric optimization problem.
Using It
Alexander J. Smola: Exponential Families and Kernels, Page 15
Expansion θ =
m
- i=1
- y∈Y
αiyφ(xi, y) Inner Product φ(x, y), θ =
m
- i=1
- y∈Y
αiyk((x, y), (xi, y)) Norm θ2 =
m
- i,j=1
- y,y′∈Y
αiyαjy′k((xi, y), (xj, y′)) Log-partition function g(θ|x) = log
- y∈Y
exp (φ(x, y), θ)
The Gaussian Process Link
Alexander J. Smola: Exponential Families and Kernels, Page 16
Normal Prior on θ . . . θ ∼ N(0, σ21) . . . yields Normal Prior on t(x, y) = φ(x, y), θ Distribution of projected Gaussian is Gaussian. The mean vanishes Eθ[t(x, y)] = φ(x, y), Eθ[θ] = 0 The covariance yields Cov[t(x, y), t(x′, y′)] = Eθ [φ(x, y), θθ, φ(x′, y′)] = σ2φ(x, y), φ(x′, y′)
- :=k((x,y),(x′,y′))
. . . so we have a Gaussian Process on x . . . with kernel k((x, y), (x′, y′)) = σ2φ(x, y), φ(x′, y′).
Linear Covariance
Alexander J. Smola: Exponential Families and Kernels, Page 17
Laplacian Covariance
Alexander J. Smola: Exponential Families and Kernels, Page 18
Gaussian Covariance
Alexander J. Smola: Exponential Families and Kernels, Page 19
Polynomial (Order 3)
Alexander J. Smola: Exponential Families and Kernels, Page 20
B3-Spline Covariance
Alexander J. Smola: Exponential Families and Kernels, Page 21
Sample from Gaussian RBF
Alexander J. Smola: Exponential Families and Kernels, Page 22
Sample from Gaussian RBF
Alexander J. Smola: Exponential Families and Kernels, Page 23
Sample from Gaussian RBF
Alexander J. Smola: Exponential Families and Kernels, Page 24
Sample from Gaussian RBF
Alexander J. Smola: Exponential Families and Kernels, Page 25
Sample from Gaussian RBF
Alexander J. Smola: Exponential Families and Kernels, Page 26
Sample from linear kernel
Alexander J. Smola: Exponential Families and Kernels, Page 27
Sample from linear kernel
Alexander J. Smola: Exponential Families and Kernels, Page 28
Sample from linear kernel
Alexander J. Smola: Exponential Families and Kernels, Page 29
Sample from linear kernel
Alexander J. Smola: Exponential Families and Kernels, Page 30
Sample from linear kernel
Alexander J. Smola: Exponential Families and Kernels, Page 31
General Strategy
Alexander J. Smola: Exponential Families and Kernels, Page 32
Choose a suitable sufficient statistic φ(x, y) Conditionally multinomial distribution leads to Gaus- sian Process multiclass estimator: we have a distribu- tion over n classes which depends on x. Conditionally Gaussian leads to Gaussian Process re- gression: we have a normal distribution over a random variable which depends on the location. Note: we estimate mean and variance. Conditionally Poisson distributions yield locally vary- ing Poisson processes. This has no name yet ... Solve the optimization problem This is typically convex. The bottom line Instead of choosing k(x, x′) choose k((x, y), (x′, y′)).
Example: GP Classification
Alexander J. Smola: Exponential Families and Kernels, Page 33
Sufficient Statistic We pick φ(x, y) = φ(x) ⊗ ey, that is k((x, y), (x′, y′)) = k(x, x′)δyy′ where y, y′ ∈ {1, . . . , n} Kernel Expansion By the representer theorem we get that θ =
m
- i=1
- y
αiyφ(xi, y) Optimization Problem Big mess . . . but convex.
A Toy Example
Alexander J. Smola: Exponential Families and Kernels, Page 34
Noisy Data
Alexander J. Smola: Exponential Families and Kernels, Page 35
Summary
Alexander J. Smola: Exponential Families and Kernels, Page 36
Clifford Hammersley Theorem and Graphical Models Decomposition results Key connection Normal distribution Conditional Distributions Log partition function Expectations and derivatives Inner product formulation and kernels Gaussian Processes Applications Generalized kernel trick Conditioning gives existing estimation methods back
Alexander J. Smola: Exponential Families and Kernels, Page 1
Exponential Families and Kernels Lecture 3
Alexander J. Smola Alex.Smola@nicta.com.au Machine Learning Program National ICT Australia RSISE, The Australian National University
Outline
Alexander J. Smola: Exponential Families and Kernels, Page 2
Exponential Families Maximum likelihood and Fisher information Priors (conjugate and normal) Conditioning and Feature Spaces Conditional distributions and inner products Clifford Hammersley Decomposition Applications Classification and novelty detection Regression Applications Conditional random fields Intractable models and semidefinite approximations
Lecture 3
Alexander J. Smola: Exponential Families and Kernels, Page 3
Novelty Detection Density estimation Thresholding and likelihood ratio Classification Log partition function Optimization problem Examples Clustering and transduction Regression Conditional normal distribution Estimating the covariance Heteroscedastic estimators
Density Estimation
Alexander J. Smola: Exponential Families and Kernels, Page 4
Maximum a Posteriori minimize
θ m
- i=1
g(θ) − φ(xi), θ + 1 2σ2θ2 Advantages Convex optimization problem Concentration of measure Problems Normalization g(θ) may be painful to compute For density estimation we need no normalized p(x|θ) No need to perform particularly well in high density regions
Novelty Detection
Alexander J. Smola: Exponential Families and Kernels, Page 5
Novelty Detection
Alexander J. Smola: Exponential Families and Kernels, Page 6
Optimization Problem MAP
m
- i=1
− log p(xi|θ) + 1 2σ2θ2 Novelty
m
- i=1
max
- − log
p(xi|θ) exp(ρ − g(θ)), 0
- + 1
2θ2
m
- i=1
max(ρ − φ(xi), θ, 0) + 1 2θ2 Advantages No normalization g(θ) needed No need to perform particularly well in high density regions (estimator focuses on low-density regions) Quadratic program
Geometric Interpretation
Alexander J. Smola: Exponential Families and Kernels, Page 7
Idea Find hyperplane that has maximum distance from ori- gin, yet is still closer to the origin than the observations. Hard Margin minimize 1 2θ2 subject to θ, xi ≥ 1 Soft Margin minimize 1 2θ2 + C
m
- i=1
ξi subject to θ, xi ≥ 1 − ξi ξi ≥ 0
Dual Optimization Problem
Alexander J. Smola: Exponential Families and Kernels, Page 8
Primal Problem minimize 1 2θ2 + C
m
- i=1
ξi subject to θ, xi − 1 + ξi ≥ 0 and ξi ≥ 0 Lagrange Function We construct a Lagrange Function L by subtracting the constraints, multiplied by Lagrange multipliers (αi and ηi), from the Primal Objective Function. L has a saddlepoint at the optimal solution. L = 1 2θ2 + C
m
- i=1
ξi −
m
- i=1
αi (θ, xi − 1 + ξi) −
m
- i=1
ηiξi where αi, ηi ≥ 0. For instance, if ξi < 0 we could increase L without bound via ηi.
Dual Problem, Part II
Alexander J. Smola: Exponential Families and Kernels, Page 9
Optimality Conditions ∂θL = θ −
m
- i=1
αixi = 0 = ⇒ θ =
m
- i=1
αixi ∂ξiL = C − αi − ηi = 0 = ⇒ αi ∈ [0, C] Now we substitute the two optimality conditions back into L and eliminate the primal variables. Dual Problem minimize 1 2
m
- i=1
αiαjxi, xj −
m
- i=1
αi subject to αi ∈ [0, C] Convexity ensures uniqueness of the optimum.
The ν-Trick
Alexander J. Smola: Exponential Families and Kernels, Page 10
Problem Depending on how we choose C, the number of points selected as lying on the “wrong” side of the hyperplane H := {x|θ, x = 1} will vary. We would like to specify a certain fraction ν before- hand. We want to make the setting more adaptive to the data. Solution Use adaptive hyperplane that separates data from the
- rigin, i.e. find
H := {x|θ, x = ρ}, where the threshold ρ is adaptive.
The ν-Trick
Alexander J. Smola: Exponential Families and Kernels, Page 11
Primal Problem minimize 1 2θ2 +
m
- i=1
ξi − mνρ subject to θ, xi − ρ + ξi ≥ 0 and ξi ≥ 0 Dual Problem minimize 1 2
m
- i=1
αiαjxi, xj subject to αi ∈ [0, 1] and
m
- i=1
αi = νm. Difference to before The
i αi term vanishes from the objective function but
we get one more constraint, namely
i αi = νm.
The ν-Property
Alexander J. Smola: Exponential Families and Kernels, Page 12
Optimization Problem minimize 1 2θ2 +
m
- i=1
ξi − mνρ subject to θ, xi − ρ + ξi ≥ 0 and ξi ≥ 0 Theorem At most a fraction of ν points will lie on the “wrong” side of the margin, i.e., yif(xi) < 1. At most a fraction of 1 − ν points will lie on the “right” side of the margin, i.e., yif(xi) > 1. In the limit, those fractions will become exact. Proof Idea At optimum, shift ρ slightly: only the active constraints will have an influence on the objective function.
Classification
Alexander J. Smola: Exponential Families and Kernels, Page 13
Maximum a Posteriori Estimation − log p(θ|X, Y ) =
m
- i=1
−φ(xi, yi), θ + g(θ|xi) + 1 2σ2θ2 + c Domain Finite set of observations Y = {1, . . . , m} Log-partition function g(θ|x) easy to compute. Optional centering φ(x, y) → φ(x, y) + c leaves p(y|x, θ) unchanged (offsets both terms). Gaussian Process Connection Inner product t(x, y) = φ(x, y), θ is drawn from Gaus- sian process, so same setting as in literature.
Classification
Alexander J. Smola: Exponential Families and Kernels, Page 14
Sufficient Statistic We pick φ(x, y) = φ(x) ⊗ ey, that is k((x, y), (x′, y′)) = k(x, x′)δyy′ where y, y′ ∈ {1, . . . , n} Kernel Expansion By the representer theorem we get that θ =
m
- i=1
- y
αiyφ(xi, y) Optimization Problem Big mess . . . but convex. Solve by Newton or Block-Jacobi method.
A Toy Example
Alexander J. Smola: Exponential Families and Kernels, Page 15
Noisy Data
Alexander J. Smola: Exponential Families and Kernels, Page 16
SVM Connection
Alexander J. Smola: Exponential Families and Kernels, Page 17
Problems with GP Classification Optimize even where classification is good Only sign of classification needed Only “strongest” wrong class matters Want to classify with a margin Optimization Problem MAP
m
- i=1
− log p(yi|xi, θ) + 1 2σ2θ2 SVM
m
- i=1
max
- ρ − log
p(yi|xi, θ) maxy=yi p(y|xi, θ), 0
- + 1
2θ2
m
- i=1
max(ρ − φ(xi, yi), θ + max
y=yi φ(xi, y), θ, 0) + 1
2θ2
Binary Classification
Alexander J. Smola: Exponential Families and Kernels, Page 18
Sufficient Statistics Offset in φ(x, y) can be arbitrary Pick such that φ(x, y) = yφ(x)where y ∈ {±1}. Kernel matrix becomes Kij = k((xi, yi), (xj, yj)) = yiyjk(xi, xj) Optimization Problem The max over other classes becomes max
y=yi φ(xi, y), θ = −yφ(xi), θ
Overall problem
m
- i=1
max(ρ − 2yiφ(xi), θ, 0) + 1 2θ2
Geometrical Interpretation
Alexander J. Smola: Exponential Families and Kernels, Page 19
Minimize 1 2θ2 subject to yi(θ, xi + b) ≥ 1 for all i.
Optimization Problem
Alexander J. Smola: Exponential Families and Kernels, Page 20
Linear Function f(x) = θ, x + b Mathematical Programming Setting If we require error-free classification with a margin, i.e., yf(x) ≥ 1, we obtain: minimize 1 2θ2 subject to yi(θ, xi + b) − 1 ≥ 0 for all 1 ≤ i ≤ m Result The dual of the optimization problem is a simple quadratic program (more later ...). Connection back to conditional probabilities Offset b takes care of bias towards one of the classes.
Regression
Alexander J. Smola: Exponential Families and Kernels, Page 21
Maximum a Posteriori Estimation − log p(θ|X, Y ) =
m
- i=1
−φ(xi, yi), θ + g(θ|xi) + 1 2σ2θ2 + c Domain Continuous domain of observations Y = R Log-partition function g(θ|x) easy to compute in closed form as normal distribution. Gaussian Process Connection Inner product t(x, y) = φ(x, y), θ is drawn from Gaussian
- process. In particular also rescaled mean and covari-
ance.
Regression
Alexander J. Smola: Exponential Families and Kernels, Page 22
Sufficient Statistic (Standard Model) We pick φ(x, y) = (yφ(x), y2), that is k((x, y), (x′, y′)) = k(x, x′)yy′ + y2y′2 where y, y′ ∈ R Traditionally the variance is fixed, that is θ2 = const.. Sufficient Statistic (Fancy Model) We pick φ(x, y) = (yφ1(x), y2φ2(x)), that is k((x, y), (x′, y′)) = k1(x, x′)yy′+k2(x, x′)y2y′2 where y, y′ ∈ R We estimate mean and variance simultaneously. Kernel Expansion By the representer theorem (and more algebra) we get θ = m
- i=1
αi1φ1(xi),
m
- i=1
αi2φ2(xi)
Training Data
Alexander J. Smola: Exponential Families and Kernels, Page 23
Mean
k⊤(x)(K + σ21)−1y
Alexander J. Smola: Exponential Families and Kernels, Page 24
Variance k(x, x) + σ2 −
k⊤(x)(K + σ21)−1 k(x)
Alexander J. Smola: Exponential Families and Kernels, Page 25
Putting everything together . . .
Alexander J. Smola: Exponential Families and Kernels, Page 26
Another Example
Alexander J. Smola: Exponential Families and Kernels, Page 27
Adaptive Variance Method
Alexander J. Smola: Exponential Families and Kernels, Page 28
Optimization Problem:
minimize
m
- i=1
−1 4
m
- j=1
α1jk1(xi, xj)
⊤
m
- j=1
α2jk2(xi, xj)
−1
m
- j=1
α1jk1(xi, xj) −1 2 log det −2
m
- j=1
α2jk2(xi, xj) −
m
- j=1
- y⊤
i α1jk1(xi, xj) + (y⊤ j α2jyj)k2(xi, xj)
-
+ 1 2σ2
- i,j
α⊤
1iα1jk1(xi, xj) + tr
- α2iα⊤
2j
- k2(xi, xj).
subject to 0 ≻
m
- i=1
α2ik(xi, xj)
Properties of the problem: The problem is convex The log-determinant from the normalization of the Gaussian acts as a barrrier function. We get a semidefinite program.
Heteroscedastic Regression
Alexander J. Smola: Exponential Families and Kernels, Page 29
Natural Parameters
Alexander J. Smola: Exponential Families and Kernels, Page 30
Lecture 3
Alexander J. Smola: Exponential Families and Kernels, Page 31
Novelty Detection Density estimation Thresholding and likelihood ratio Classification Log partition function Optimization problem Examples Clustering and transduction Regression Conditional normal distribution Estimating the covariance Heteroscedastic estimators
Alexander J. Smola: Exponential Families and Kernels, Page 1
Exponential Families and Kernels Lecture 4
Alexander J. Smola Alex.Smola@nicta.com.au Machine Learning Program National ICT Australia RSISE, The Australian National University
Outline
Alexander J. Smola: Exponential Families and Kernels, Page 2
Exponential Families Maximum likelihood and Fisher information Priors (conjugate and normal) Conditioning and Feature Spaces Conditional distributions and inner products Clifford Hammersley Decomposition Applications Classification and novelty detection Regression Applications Conditional random fields Intractable models and semidefinite approximations
Lecture 4
Alexander J. Smola: Exponential Families and Kernels, Page 3
Conditional Random Fields Structured random variables Subspace representer theorem and decomposition Derivatives and conditional expectations Inference and Message Passing Dynamic programming Message passing and junction trees Intractable cases Semidefinite Relaxations Marginal polytopes Fenchel duality and entropy Relaxations for conditional random fields
Hammersley Clifford Corollary
Alexander J. Smola: Exponential Families and Kernels, Page 4
Decomposition The sufficient statistics φ(x) decompose according to φ(x) = (. . . , φc(xc), . . .) Consequently we can write the kernel via k(x, x′) = φ(x), φ(x′) =
- c
φc(xc), φc(x′
c) =
- c
kc(xc, x′
c)
Conditional Random Fields
Alexander J. Smola: Exponential Families and Kernels, Page 5
Key Points Cliques are (xt, yt), (xt, xt+1), and (yt, yt+1) We can drop cliques in (xt, xt+1): they do not affect p(y|x, θ): p(y|x, θ) = exp
t
φxy(xt, yt), θxy,t + φyy(yt, yt+1), θyy,t+ φxx(xt, xt+1), θxx,t − g(θ|x)
Computational Issues
Alexander J. Smola: Exponential Families and Kernels, Page 6
Key Points Compute g(θ|x) via dynamic Assume stationarity of the model, that is θc does not depend on the position of the Dynamic Programming g(θ|x) = log
- y1,...,yT
T
- t=1
exp (φxy(xt, yt), θxy + φyy(yt, yt+1), θyy)
- Mt(yt,yt+1)
= log
- y1
- y2
M1(y1, y2)
- y3
M2(y2, y3) . . .
- yT
MT(yT−1, yT) So we can compute g(θ|x), p(yt|x, θ) and p(yt, yt+1|x, θ) via dynamic programming.
Forward Backward Algorithm
Alexander J. Smola: Exponential Families and Kernels, Page 7
Key Idea Store sum over all y1, . . . , yt−1 (forward pass) and over all yt+1, . . . , yT as intermediate values We get those values for all positions t in one sweep. Extend this to message passing (when we have trees).
Minimization
Alexander J. Smola: Exponential Families and Kernels, Page 8
Objective Function − log p(θ|X, Y ) =
m
- i=1
−φ(xi, yi), θ + g(θ|xi) + 1 2σ2θ2 + c ∂θ − log p(θ|X, Y ) =
m
- i=1
−φ(xi, yi) + E [φ(xi, yi)|xi] + 1 σ2θ We only need E [φxy(xit, yit)|xi] and E
- φyy(yit, yi(t+1))|xi
- .
Kernel Trick Conditional expectations of Φ(xit, yit) cannot be com- puted explicitly but inner products can. φxy(x′
t, y′ t), E [φxy(xt, yt)|x] = E [k((x′ t, y′ t), (xt, yt)|x]
Only need marginals p(yt|x, θ) and p(yt, yt+1|x, θ), which we get via dynamic programming.
Subspace Representer Theorem
Alexander J. Smola: Exponential Families and Kernels, Page 9
Representer Theorem Solutions of the MAP problem are given by θ ∈ span{φ(xi, y) for all y ∈ Y and 1 ≤ i ≤ n} Big Problem |Y| could be huge, e.g. for sequence annotation 2n. Solution Exploit decomposition of φ(x, y) into sufficient statis- tics on cliques. Restriction of Y to cliques is much smaller. θc ∈ span{φc(xci, yc) for all yc ∈ Yc and 1 ≤ i ≤ n} Rather than 2n we now get 2|c|.
CRFs and HMMs
Alexander J. Smola: Exponential Families and Kernels, Page 10
Conditional Random Field: maximize p(y|x, θ) Time t − 2 t − 1 t t + 1 t + 2 X
- x
- x
- x
- x
- x
Y
- y
- y
- y
- y
- y
Hidden Markov Model: maximize p(x, y|θ) Time t − 2 t − 1 t t + 1 t + 2 X
- x
- x
- x
- x
- x
Y
- y
- y
- y
- y
- y
Equivalence Theorem
Alexander J. Smola: Exponential Families and Kernels, Page 11
Theorem CRFs and HMMs yield identical probability estimates for p(y|x, θ), if the set of functions is equally expressive. Proof Write out pCRF(y|x, θ) and pHMM(x, y|θ), and show that they only differ in the normalization. This disappears when computing pHMM(y|x, θ). Consequence Differential training for current HMM implementations.
Message Passing
Alexander J. Smola: Exponential Families and Kernels, Page 12
Message Passing
Alexander J. Smola: Exponential Families and Kernels, Page 13
Idea Extend the forward-backward idea to trees. Algorithm Given clique potentials M(yi, yj) Initialize messages µij(yj) = 1 Update outgoing messages by µij(yj) =
- yi∈Yi
- k=j
µki(yi)Mij(yi, yj) Here (i, k) is an edge in the graph. Theorem The message passing algorithm converges after n itera- tions (n is diameter of graph). Hack Use this for graphs with loops and hope . . .
Junction Trees
Alexander J. Smola: Exponential Families and Kernels, Page 14
Stock standard algorithms available to transform graph into junction tree. Now we can use message passing . . .
Junction Tree Algorithm
Alexander J. Smola: Exponential Families and Kernels, Page 15
Idea Messages involve variables in the separator sets. Algorithm Given clique potentials Mc(yc) and separator sets s. Initialize messages µc,s(ys) = 1 Update outgoing messages by µc,s(ys) =
- yc\ys
- s′=s
µc′,s′(ys′)Mc(yc) Here s′ is a separator set connecting c with c′. Theorem The message passing algorithm converges after n itera- tions (n is diameter of the hypergraph). Hack Use this for graphs with loops and hope . . .
Example
Alexander J. Smola: Exponential Families and Kernels, Page 16
Problems
Alexander J. Smola: Exponential Families and Kernels, Page 17
Scaling The algorithm scales exponentially in the treewidth. Messages are of size d|Ys|. Convergence with loops Use of message passing may or may not converge. No real proof available. Workaround Use a subset of the graph and solve the inference prob- lem with this. Average over spanning trees. Workaround Use sampling methods for inference.
A Better Way
Alexander J. Smola: Exponential Families and Kernels, Page 18
Fenchel Duality Compute dual of log-partition function via g∗(µ) = sup
θ∈Θ
µ, θ − g(θ) (Θ is a convex domain) Entropy and Expectation Parameters The maximum of the optimization problem is obtained for µ = ∂θg(θ). This leads to H =
- − log p(x|θ)p(x|θ)dθ = −µ(θ), θ + g(θ) = −g∗(µ)
Strong Duality Dualizing again leads to g(θ) = sup
µ∈M
θ, µ + H(µ)
Semidefinite Relaxation
Alexander J. Smola: Exponential Families and Kernels, Page 19
Optimization Problem g(θ) = sup
µ∈M
θ, µ + H(µ) Here M is the set of all possible marginals. Relaxations on M The polytope M is convex (by duality), however it is hard to compute (as hard as g(θ)). So we relax it to ˜ M by impose constraints on higher order moments, such as Interval and linear inequality constraints. SDP constraints on the covariance matrix. Upper bound on H(µ) Gaussian bound on the covariance via G(µ). So we get g(θ) ≤ sup
µ∈ ˜ M
θ, µ + G(µ)
Application to CRFs
Alexander J. Smola: Exponential Families and Kernels, Page 20
Optimization Problem − log p(θ|X, Y ) =
m
- i=1
−φ(xi, yi), θ + g(θ|xi) + 1 2σ2θ2 + c ≤
m
- i=1
sup
µi∈ ˜ Mi
θ, µi − φ(xi, yi) + H(µi) + 1 2σ2θ2 Technical Details Minimization over θ and µi can be swapped (saddle- point property of a convex-concave problem) to obtain dual problem in θ. Map from µ to moments in y|x via invertible sufficient statistics map. Constrained max-det problem.
Summary
Alexander J. Smola: Exponential Families and Kernels, Page 21
Conditional Random Fields Structured random variables Subspace representer theorem and decomposition Derivatives and conditional expectations Inference and Message Passing Dynamic programming Message passing and junction trees Intractable cases Semidefinite Relaxations Marginal polytopes Fenchel duality and entropy Relaxations for conditional random fields
Shameless Plugs
Alexander J. Smola: Exponential Families and Kernels, Page 22