[PPT] - Exponential Families and Kernels Lecture 1 Alexander J. Smola PowerPoint Presentation

SLIDE 1

Alexander J. Smola: Exponential Families and Kernels, Page 1

Exponential Families and Kernels Lecture 1

Alexander J. Smola Alex.Smola@nicta.com.au Machine Learning Program National ICT Australia RSISE, The Australian National University

SLIDE 2

Outline

Alexander J. Smola: Exponential Families and Kernels, Page 2

Exponential Families Maximum likelihood and Fisher information Priors (conjugate and normal) Conditioning and Feature Spaces Conditional distributions and inner products Clifford Hammersley Decomposition Applications Classification and novelty detection Regression Applications Conditional random fields Intractable models and semidefinite approximations

SLIDE 3

Lecture 1

Alexander J. Smola: Exponential Families and Kernels, Page 3

Model Log partition function Expectations and derivatives Maximum entropy formulation Examples Normal distribution Discrete events Laplacian distribution Poisson distribution Beta distribution Estimation Maximum Likelihood Estimator Fisher Information Matrix and Cramer Rao Theorem Normal Priors and Conjugate Priors

SLIDE 4

The Exponential Family

Alexander J. Smola: Exponential Families and Kernels, Page 4

Definition A family of probability distributions which satisfy p(x; θ) = exp(φ(x), θ − g(θ)) Details φ(x) is called the sufficient statistics of x. X is the domain out of which x is drawn (x ∈ X). g(θ) is the log-partition function and it ensures that the distribution integrates out to 1. g(θ) = log

X

exp(φ(x), θ)dx

SLIDE 5

Example: Binomial Distribution

Alexander J. Smola: Exponential Families and Kernels, Page 5

Tossing coins With probability p we have heads and with probability 1− p we see tails. So we have p(x) = px(1 − p)1−x where x ∈ {0, 1} =: X Massaging the math p(x) = exp log p(x) = exp (x log p + (1 − x) log(1 − p)) = exp

(x, 1 − x)
φ(x)

, (log p, log(1 − p))

θ
The Normalization Once we relax the restriction on θ ∈ R2

we need g(θ) which yields g(θ) = log

eθ1 + eθ2

SLIDE 6

Example: Binomial Distribution

Alexander J. Smola: Exponential Families and Kernels, Page 6

SLIDE 7

Example: Laplace Distribution

Alexander J. Smola: Exponential Families and Kernels, Page 7

Atomic decay At any time, with probability θdx an atom will decay in the time interval [x, x + dx] if it still exists. Consulting your physics book tells us that this gives us the density p(x) = θ exp(θx) where x ∈ [0, ∞) =: X Massaging the math p(x) = exp

−x
φ(x)

, θ − − log θ

g(θ)

SLIDE 8

Example: Laplace Distribution

Alexander J. Smola: Exponential Families and Kernels, Page 8

SLIDE 9

Example: Normal Distribution

Alexander J. Smola: Exponential Families and Kernels, Page 9

Engineer’s favorite p(x) = 1 √ 2πσ2 exp

− 1

2σ2(x − µ)2

where x ∈ R =: X

Massaging the math p(x) = exp

− 1

2σ2x2 + µ σ2x − µ2 2σ2 − 1 2 log(2πσ2)

= exp
(x, x2)

φ(x)

, θ − µ2 2σ2 + 1 2 log(2πσ2)

g(θ)
Finally we need to solve (µ, σ2) for θ. Tedious algebra

yields θ2 := −1

2σ−2 and θ1 := µσ−2. We have

g(θ) = −1 4θ2

1θ−1 2

+ 1 2 log 2π − 1 2 log −2θ2

SLIDE 10

Example: Normal Distribution

Alexander J. Smola: Exponential Families and Kernels, Page 10

SLIDE 11

Example: Multinomial Distribution

Alexander J. Smola: Exponential Families and Kernels, Page 11

Many discrete events Assume that we have disjoint events [1..n] =: X which all may occur with a certain probability px. Guessing the answer Use the map φ : x → ex, that is, ex is an element of the canonical basis (0, . . . , 0, 1, 0, . . .). This gives p(x) = exp(ex, θ − g(θ)) where the normalization is g(θ) = log

n

i=1

exp(θi)

SLIDE 12

Example: Multinomial Distribution

Alexander J. Smola: Exponential Families and Kernels, Page 12

SLIDE 13

Example: Poisson Distribution

Alexander J. Smola: Exponential Families and Kernels, Page 13

Limit of Binomial distribution Probability of observing x ∈ N events which are all in- dependent (e.g. raindrops per square meter, crimes per day, cancer incidents) p(x) = exp (x · θ − log Γ(x + 1) − exp(θ)) . Hence φ(x) = x and g(θ) = eθ. Differences We have a normalization dependent on x alone, namely Γ(x + 1). This leaves the rest of the theory unchanged. The domain is countably infinite. Effectively this assumes the measure 1

x! on the domain

N.

SLIDE 14

Example: Poisson Distribution

Alexander J. Smola: Exponential Families and Kernels, Page 14

SLIDE 15

Example: Beta Distribution

Alexander J. Smola: Exponential Families and Kernels, Page 15

Usage Often used as prior on Binomial distributions (it is a conjugate prior as we will see later). Mathematical Form p(x) = exp((log x, log(1 − x)), (θ1, θ2)−log B(θ1+1, θ2+1)) where the domain is x ∈ [0, 1] and g(θ) = log B(θ1 + 1, θ2 + 1) = log Γ(θ1 + 1) + log Γ(θ2 + 1) − log Γ(θ1 + θ2 + 2) Here B(α, β) is the Beta function.

SLIDE 16

Example: Beta Distribution

Alexander J. Smola: Exponential Families and Kernels, Page 16

SLIDE 17

Example: Gamma Distribution

Alexander J. Smola: Exponential Families and Kernels, Page 17

Usage Popular as a prior on coefficients Obtained from integral over waiting times in Poisson distribution Mathematical Form p(x) = exp((log x, x), (θ1, θ2)−log Γ(θ1+1)+(θ1+1) log −θ2) where the domain is x ∈ [0, ∞] and g(θ) = log Γ(θ1 + 1) + (θ1 + 1) log −θ2) Note that θ ∈ [0, ∞) × (−∞, 0).

SLIDE 18

Example: Gamma Distribution

Alexander J. Smola: Exponential Families and Kernels, Page 18

SLIDE 19

Zoology of Exponential Families

Alexander J. Smola: Exponential Families and Kernels, Page 19

Name φ(x) Domain Measure Binomial (x, 1 − x) {0, 1} discrete Multinomial ex {1, . . . , n} discrete Poisson x N0 discrete Laplace x [0, ∞) Lebesgue Normal (x, x2) R Lebesgue Beta (log x, log(1 − x)) [0, 1] Lebesgue Gamma (log x, x) [0, ∞) Lebesgue Wishart (log |X|, X) X 0 Lebesgue Dirichlet log x x ∈ Rn

+, x1 = 1 Lebesgue

SLIDE 20

Recall

Alexander J. Smola: Exponential Families and Kernels, Page 20

Definition A family of probability distributions which satisfy p(x; θ) = exp(φ(x), θ − g(θ)) Details φ(x) is called the sufficient statistics of x. X is the domain out of which x is drawn (x ∈ X). g(θ) is the log-partition function and it ensures that the distribution integrates out to 1. g(θ) = log

X

exp(φ(x), θ)dx

SLIDE 21

Benefits: Log-partition function is nice

Alexander J. Smola: Exponential Families and Kernels, Page 21

g(θ) generates cumulants: g(θ) = log

exp(φ(x), θ)dx

Taking the derivative wrt. θ we can see that ∂θg(θ) =

φ(x) exp(φ(x), θ)dx
exp(φ(x), θ)dx

= Ex∼p(x;θ) [φ(x)] ∂2

θg(θ) = Covx∼p(x;θ) [φ(x)]

. . . and so on for higher order cumulants . . . Corollary: g(θ) is convex

SLIDE 22

Benefits: Simple Estimation

Alexander J. Smola: Exponential Families and Kernels, Page 22

Likelihood of a set: Given X := {x1, . . . , xm} we get p(X; θ) =

m

i=1

p(xi; θ) = exp m

i=1

φ(xi), θ − mg(θ)

Maximum Likelihood

We want to minimize the negative log-likelihood, i.e. minimize

θ

g(θ) −

1

m

i=1

φ(xi), θ

=

⇒ E[φ(x)] = 1 m

m

i=1

φ(xi) =: µ Solving the maximum likelihood problem is easy.

SLIDE 23

Application: Laplace distribution

Alexander J. Smola: Exponential Families and Kernels, Page 23

Estimate the decay constant of an atom: We use exponential family notation where p(x; θ) = exp((−x), θ − (− log θ)) Computing µ Since φ(x) = −x all we need to do is average over all decay times that we observe. Solving for Maximum Likelihood The maximum likelihood condition yields µ = ∂θg(θ) = ∂θ(− log θ) = −1 θ This leads to θ = −1

µ.

SLIDE 24

Benefits: Maximum Entropy Estimate

Alexander J. Smola: Exponential Families and Kernels, Page 24

Entropy Basically it’s the number of bits needed to encode a ran- dom variable. It is defined as H(p) =

−p(x) log p(x)dx where we set 0 log 0 := 0

Maximum Entropy Density The density p(x) satisfying E[φ(x)] ≥ η with maximum entropy is exp(φ(x), θ − g(θ)). Corollary The most vague density with a given variance is the Gaussian distribution. Corollary The most vague density with a given mean is the Lapla- cian distribution.

SLIDE 25

Using it

Alexander J. Smola: Exponential Families and Kernels, Page 25

Observe Data x1, . . . , xm drawn from distribution p(x|θ) Compute Likelihood p(X|θ) =

m

i=1

exp(φ(xi), θ − g(θ)) Maximize it Take the negative log and minimize, which leads to ∂θg(θ) = 1 m

m

i=1

φ(xi) This can be solved analytically or (whenever this is im- possible or we are lazy) by Newton’s method.

SLIDE 26

Application: Discrete Events

Alexander J. Smola: Exponential Families and Kernels, Page 26

Simple Data Discrete random variables (e.g. tossing a dice). Outcome 1 2 3 4 5 6 Counts 3 6 2 1 4 4 Probabilities 0.15 0.30 0.10 0.05 0.20 0.20 Maximum Likelihood Solution Count the number of outcomes and use the relative fre- quency of occurrence as estimates for the probability: pemp(x) = #x m Problems Bad idea if we have few data. Bad idea if we have continuous random variables.

SLIDE 27

Tossing a dice

Alexander J. Smola: Exponential Families and Kernels, Page 27

SLIDE 28

Fisher Information and Efficiency

Alexander J. Smola: Exponential Families and Kernels, Page 28

Fisher Score Vθ(x) := ∂θ log p(x; θ) This tells us the influence of x on estimating θ. Its ex- pected value vanishes, since E [∂θ log p(X; θ)] =

p(X; θ)∂θ log p(X; θ)dX

= ∂θ

p(X; θ)dX = 0.

Fisher Information Matrix It is the covariance matrix of the Fisher scores, that is I := Cov[Vθ(x)]

SLIDE 29

Cramer Rao Theorem

Alexander J. Smola: Exponential Families and Kernels, Page 29

Efficiency Covariance of estimator ˆ θ(X) rescaled by I: 1/e := det Cov[ˆ θ(X)]Cov[∂θ log p(X; θ)] Theorem The efficiency for unbiased estimators is never better (i.e. larger) than 1. Equality is achieved for MLEs. Proof (scalar case only) By Cauchy-Schwartz we have

Eθ
(Vθ(X) − Eθ [Vθ(X)])
ˆ

θ(X) − Eθ

ˆ

θ(X) 2 ≤Eθ

(Vθ(X) − Eθ [Vθ(X)])2

Eθ

ˆ

θ(X) − Eθ

ˆ

θ(X) 2 = IB

SLIDE 30

Cramer Rao Theorem

Alexander J. Smola: Exponential Families and Kernels, Page 30

Proof At the same time, Eθ [Vθ(X)] = 0 implies that Eθ

(Vθ(X) − Eθ [Vθ(X)])
ˆ

θ(X) − Eθ

ˆ

θ(X)

=Eθ
Vθ(X)ˆ

θ(X)

=
p(X|θ)∂θ log p(X|θ)ˆ

θ(X)dX

=∂θ
p(X|θ)ˆ

θ(X)dX = ∂θθ = 1. Cautionary Note This does not imply that a biased estimator might not have lower variance.

SLIDE 31

Fisher and Exponential Families

Alexander J. Smola: Exponential Families and Kernels, Page 31

Fisher Score Vθ(x) = ∂θ log p(x; θ) = φ(x) − ∂θg(θ) Fisher Information I = Cov[Vθ(x)] = Cov[φ(x) − ∂θg(θ)] = ∂2

θg(θ)

Efficiency of estimator can be obtained directly from log- partition function. Outer Product Matrix It is given (up to an offset) by φ(x), φ(x′). This leads to Kernel-PCA . . .

SLIDE 32

Priors

Alexander J. Smola: Exponential Families and Kernels, Page 32

Problems with Maximum Likelihood With not enough data, parameter estimates will be bad. Prior to the rescue Often we know where the solution should be. So we encode the latter by means of a prior p(θ). Normal Prior Simply set p(θ) ∝ exp(− 1

2σ2θ2).

Posterior p(θ|X) ∝ exp m

i=1

φ(xi), θ − g(θ) − 1 2σ2θ2

SLIDE 33

Tossing a dice with priors

Alexander J. Smola: Exponential Families and Kernels, Page 33

SLIDE 34

Conjugate Priors

Alexander J. Smola: Exponential Families and Kernels, Page 34

Problem with Normal Prior The posterior looks different from the likelihood. So many of the Maximum Likelihood optimization algorithms may not work ... Idea What if we had a prior which looked like additional data, that is p(θ|X) ∼ p(X|θ) For exponential families this is easy. Simply set p(θ|a) ∝ exp(θ, m0a − m0g(θ)) Posterior p(θ|X) ∝ exp

(m + m0)

mµ + m0a m + m0 , θ

− g(θ)

SLIDE 35

Example: Multinomial Distribution

Alexander J. Smola: Exponential Families and Kernels, Page 35

Laplace Rule A conjugate prior with parameters (a, m0) in the multino- mial family could be to set a = (1

n, 1 n, . . . , 1 n). This is often

also called the Dirichlet prior. It leads to p(x) = #x + m0/n m + m0 instead of p(x) = #x m Example Outcome 1 2 3 4 5 6 Counts 3 6 2 1 4 4 MLE 0.15 0.30 0.10 0.05 0.20 0.20 MAP (m0 = 6) 0.25 0.27 0.12 0.08 0.19 0.19 MAP (m0 = 100) 0.16 0.19 0.16 0.15 0.17 0.17

SLIDE 36

Optimization Problems

Alexander J. Smola: Exponential Families and Kernels, Page 36

Maximum Likelihood minimize

θ m

i=1

g(θ) − φ(xi), θ = ⇒ ∂θg(θ) = 1 m

m

i=1

φ(xi) Normal Prior minimize

θ m

i=1

g(θ) − φ(xi), θ + 1 2σ2θ2 Conjugate Prior minimize

θ m

i=1

g(θ) − φ(xi), θ + m0g(θ) − m0˜ µ, θ equivalently solve ∂θg(θ) = 1 m + m0

m

i=1

φ(xi) + m0 m + m0 ˜ µ

SLIDE 37

Summary

Alexander J. Smola: Exponential Families and Kernels, Page 37

Model Log partition function Expectations and derivatives Maximum entropy formulation A Zoo of Densities Estimation Maximum Likelihood Estimator Fisher Information Matrix and Cramer Rao Theorem Normal Priors and Conjugate Priors Fisher information and log-partition function

SLIDE 38

Alexander J. Smola: Exponential Families and Kernels, Page 1

Exponential Families and Kernels Lecture 2

Alexander J. Smola Alex.Smola@nicta.com.au Machine Learning Program National ICT Australia RSISE, The Australian National University

SLIDE 39

Outline

Alexander J. Smola: Exponential Families and Kernels, Page 2

Exponential Families Maximum likelihood and Fisher information Priors (conjugate and normal) Conditioning and Feature Spaces Conditional distributions and inner products Clifford Hammersley Decomposition Applications Classification and novelty detection Regression Applications Conditional random fields Intractable models and semidefinite approximations

SLIDE 40

Lecture 2

Alexander J. Smola: Exponential Families and Kernels, Page 3

Clifford Hammersley Theorem and Graphical Models Decomposition results Key connection Conditional Distributions Log partition function Expectations and derivatives Inner product formulation and kernels Gaussian Processes Applications Classification + Regression Conditional Random Fields Spatial Poisson Models

SLIDE 41

Graphical Model

Alexander J. Smola: Exponential Families and Kernels, Page 4

Conditional Independence x, x′ are conditionally independent given c, if p(x, x′|c) = p(x|c)p(x′|c) Distributions can be simplified greatly by conditional independence assumptions. Markov Network Given a graph G(V, E) with vertices V and edges E associate a random variable x ∈ R|V | with G. Subsets of random variables xS, xS′ are conditionally independent given xC if removing the vertices C from G(V, E) decomposes the graph into disjoint subsets containing S, S′.

SLIDE 42

Conditional Independence

Alexander J. Smola: Exponential Families and Kernels, Page 5

SLIDE 43

Cliques

Alexander J. Smola: Exponential Families and Kernels, Page 6

Definition Subset of the graph which is fully connected Maximal Cliques (they define the graph) Advantage Easy to specify dependencies between variables Use graph algorithms for inference

SLIDE 44

Hammersley Clifford Theorem

Alexander J. Smola: Exponential Families and Kernels, Page 7

Problem Specify p(x) with conditional independence properties. Theorem p(x) = 1 Z exp

c∈C

ψc(xc)

whenever p(x) is nonzero on the entire domain.

Application Apply decomposition for exponential families where p(x) = exp(φ(x), θ − g(θ)). Corollary The sufficient statistics φ(x) decompose according to φ(x) = (. . . , φc(xc), . . .) = ⇒ φ(x), φ(x′) =

c

φc(xc), φc(x′

c)

SLIDE 45

Proof

Alexander J. Smola: Exponential Families and Kernels, Page 8

Step 1: Obtain linear functional Combing the exponential setting with the CH theorem: Φ(x), θ =

c∈C

ψc(xc) − log Z + g(θ) for all x, θ. Step 2: Orthonormal basis in θ Pick an orthonormal basis and swallow Z, g. This gives Φ(x), ei =

c∈C

ηi

c(xc) for some ηi c(xc).

Step 3: Reconstruct sufficient statistics Φc(xc) := (η1

c(xc), η2 c(xc), . . .)

which allows us to compute Φ(x), θ =

c∈C
i

θiΦi

c(xc).

SLIDE 46

Example: Normal Distributions

Alexander J. Smola: Exponential Families and Kernels, Page 9

Sufficient Statistics Recall that for normal distributions φ(x) = (x, xx⊤). Clifford Hammersley Application φ(x) must decompose into subsets involving only vari- ables from each maximal clique. The linear term x is OK by default. The only nonzero terms coupling xixj are those corre- sponding to an edge in the graph G(V, E). Inverse Covariance Matrix The natural parameter aligned with xx⊤ is the inverse covariance matrix. Its sparsity mirrors G(V, E). Hence a sparse inverse kernel matrix corresponds to graphical model!

SLIDE 47

Example: Normal Distributions

Alexander J. Smola: Exponential Families and Kernels, Page 10

Density p(x|θ) = exp  

n

i=1

xiθ1i +

n

i,j=1

xixjθ2ij − g(θ)   Here θ2 = Σ−1, is the inverse covariance matrix. We have that (Σ−1)[ij] = 0 only if (i, j) share an edge.

SLIDE 48

Conditional Distributions

Alexander J. Smola: Exponential Families and Kernels, Page 11

Conditional Density p(x|θ) = exp(φ(x), θ − g(θ)) p(y|x, θ) = exp(φ(x, y), θ − g(θ|x)) Log-partition function g(θ|x) = log

y

exp(φ(x, y), θ)dy Sufficient Criterion p(x, y|θ) is a member of the exponential family itself. Key Idea Avoid computing φ(x, y) directly, only evaluate inner products via k((x, y), (x′, y′)) := φ(x, y), φ(x′, y′)

SLIDE 49

Conditional Distributions

Alexander J. Smola: Exponential Families and Kernels, Page 12

Maximum a Posteriori Estimation − log p(θ|X) =

m

i=1

−φ(xi), θ + mg(θ) + 1 2σ2θ2 + c − log p(θ|X, Y ) =

m

i=1

−φ(xi, yi), θ + g(θ|xi) + 1 2σ2θ2 + c Solving the Problem The problem is strictly convex in θ. Direct solution is impossible if we cannot compute φ(x, y) directly. Solve convex problem in expansion coefficients. Expand θ in a linear combination of φ(xi, y).

SLIDE 50

Joint Feature Map

Alexander J. Smola: Exponential Families and Kernels, Page 13

SLIDE 51

Representer Theorem

Alexander J. Smola: Exponential Families and Kernels, Page 14

Objective Function − log p(θ|X, Y ) =

m

i=1

−φ(xi, yi), θ + g(θ|xi) + 1 2σ2θ2 + c Decomposition Decompose θ into θ = θ + θ⊥ where θ ∈ span{φ(xi, y) where 1 ≤ i ≤ m and y ∈ Y} Both g(θ|xi)and φ(xi, yi), θare independent of θ⊥. Theorem − log p(θ|X, Y ) is minimized for θ⊥ = 0, hence θ = θ. Consequence If span{φ(xi, y) where 1 ≤ i ≤ m and y ∈ Y} is finite di- mensional, we have a parametric optimization problem.

SLIDE 52

Using It

Alexander J. Smola: Exponential Families and Kernels, Page 15

Expansion θ =

m

i=1
y∈Y

αiyφ(xi, y) Inner Product φ(x, y), θ =

m

i=1
y∈Y

αiyk((x, y), (xi, y)) Norm θ2 =

m

i,j=1
y,y′∈Y

αiyαjy′k((xi, y), (xj, y′)) Log-partition function g(θ|x) = log

y∈Y

exp (φ(x, y), θ)

SLIDE 53

The Gaussian Process Link

Alexander J. Smola: Exponential Families and Kernels, Page 16

Normal Prior on θ . . . θ ∼ N(0, σ21) . . . yields Normal Prior on t(x, y) = φ(x, y), θ Distribution of projected Gaussian is Gaussian. The mean vanishes Eθ[t(x, y)] = φ(x, y), Eθ[θ] = 0 The covariance yields Cov[t(x, y), t(x′, y′)] = Eθ [φ(x, y), θθ, φ(x′, y′)] = σ2φ(x, y), φ(x′, y′)

:=k((x,y),(x′,y′))

. . . so we have a Gaussian Process on x . . . with kernel k((x, y), (x′, y′)) = σ2φ(x, y), φ(x′, y′).

SLIDE 54

Linear Covariance

Alexander J. Smola: Exponential Families and Kernels, Page 17

SLIDE 55

Laplacian Covariance

Alexander J. Smola: Exponential Families and Kernels, Page 18

SLIDE 56

Gaussian Covariance

Alexander J. Smola: Exponential Families and Kernels, Page 19

SLIDE 57

Polynomial (Order 3)

Alexander J. Smola: Exponential Families and Kernels, Page 20

SLIDE 58

B3-Spline Covariance

Alexander J. Smola: Exponential Families and Kernels, Page 21

SLIDE 59

Sample from Gaussian RBF

Alexander J. Smola: Exponential Families and Kernels, Page 22

SLIDE 60

Sample from Gaussian RBF

Alexander J. Smola: Exponential Families and Kernels, Page 23

SLIDE 61

Sample from Gaussian RBF

Alexander J. Smola: Exponential Families and Kernels, Page 24

SLIDE 62

Sample from Gaussian RBF

Alexander J. Smola: Exponential Families and Kernels, Page 25

SLIDE 63

Sample from Gaussian RBF

Alexander J. Smola: Exponential Families and Kernels, Page 26

SLIDE 64

Sample from linear kernel

Alexander J. Smola: Exponential Families and Kernels, Page 27

SLIDE 65

Sample from linear kernel

Alexander J. Smola: Exponential Families and Kernels, Page 28

SLIDE 66

Sample from linear kernel

Alexander J. Smola: Exponential Families and Kernels, Page 29

SLIDE 67

Sample from linear kernel

Alexander J. Smola: Exponential Families and Kernels, Page 30

SLIDE 68

Sample from linear kernel

Alexander J. Smola: Exponential Families and Kernels, Page 31

SLIDE 69

General Strategy

Alexander J. Smola: Exponential Families and Kernels, Page 32

Choose a suitable sufficient statistic φ(x, y) Conditionally multinomial distribution leads to Gaus- sian Process multiclass estimator: we have a distribu- tion over n classes which depends on x. Conditionally Gaussian leads to Gaussian Process re- gression: we have a normal distribution over a random variable which depends on the location. Note: we estimate mean and variance. Conditionally Poisson distributions yield locally vary- ing Poisson processes. This has no name yet ... Solve the optimization problem This is typically convex. The bottom line Instead of choosing k(x, x′) choose k((x, y), (x′, y′)).

SLIDE 70

Example: GP Classification

Alexander J. Smola: Exponential Families and Kernels, Page 33

Sufficient Statistic We pick φ(x, y) = φ(x) ⊗ ey, that is k((x, y), (x′, y′)) = k(x, x′)δyy′ where y, y′ ∈ {1, . . . , n} Kernel Expansion By the representer theorem we get that θ =

m

i=1
y

αiyφ(xi, y) Optimization Problem Big mess . . . but convex.

SLIDE 71

A Toy Example

Alexander J. Smola: Exponential Families and Kernels, Page 34

SLIDE 72

Noisy Data

Alexander J. Smola: Exponential Families and Kernels, Page 35

SLIDE 73

Summary

Alexander J. Smola: Exponential Families and Kernels, Page 36

Clifford Hammersley Theorem and Graphical Models Decomposition results Key connection Normal distribution Conditional Distributions Log partition function Expectations and derivatives Inner product formulation and kernels Gaussian Processes Applications Generalized kernel trick Conditioning gives existing estimation methods back

SLIDE 74

Alexander J. Smola: Exponential Families and Kernels, Page 1

Exponential Families and Kernels Lecture 3

Alexander J. Smola Alex.Smola@nicta.com.au Machine Learning Program National ICT Australia RSISE, The Australian National University

SLIDE 75

Outline

Alexander J. Smola: Exponential Families and Kernels, Page 2

Exponential Families Maximum likelihood and Fisher information Priors (conjugate and normal) Conditioning and Feature Spaces Conditional distributions and inner products Clifford Hammersley Decomposition Applications Classification and novelty detection Regression Applications Conditional random fields Intractable models and semidefinite approximations

SLIDE 76

Lecture 3

Alexander J. Smola: Exponential Families and Kernels, Page 3

Novelty Detection Density estimation Thresholding and likelihood ratio Classification Log partition function Optimization problem Examples Clustering and transduction Regression Conditional normal distribution Estimating the covariance Heteroscedastic estimators

SLIDE 77

Density Estimation

Alexander J. Smola: Exponential Families and Kernels, Page 4

Maximum a Posteriori minimize

θ m

i=1

g(θ) − φ(xi), θ + 1 2σ2θ2 Advantages Convex optimization problem Concentration of measure Problems Normalization g(θ) may be painful to compute For density estimation we need no normalized p(x|θ) No need to perform particularly well in high density regions

SLIDE 78

Novelty Detection

Alexander J. Smola: Exponential Families and Kernels, Page 5

SLIDE 79

Novelty Detection

Alexander J. Smola: Exponential Families and Kernels, Page 6

Optimization Problem MAP

m

i=1

− log p(xi|θ) + 1 2σ2θ2 Novelty

m

i=1

max

− log

p(xi|θ) exp(ρ − g(θ)), 0

+ 1

2θ2

m

i=1

max(ρ − φ(xi), θ, 0) + 1 2θ2 Advantages No normalization g(θ) needed No need to perform particularly well in high density regions (estimator focuses on low-density regions) Quadratic program

SLIDE 80

Geometric Interpretation

Alexander J. Smola: Exponential Families and Kernels, Page 7

Idea Find hyperplane that has maximum distance from ori- gin, yet is still closer to the origin than the observations. Hard Margin minimize 1 2θ2 subject to θ, xi ≥ 1 Soft Margin minimize 1 2θ2 + C

m

i=1

ξi subject to θ, xi ≥ 1 − ξi ξi ≥ 0

SLIDE 81

Dual Optimization Problem

Alexander J. Smola: Exponential Families and Kernels, Page 8

Primal Problem minimize 1 2θ2 + C

m

i=1

ξi subject to θ, xi − 1 + ξi ≥ 0 and ξi ≥ 0 Lagrange Function We construct a Lagrange Function L by subtracting the constraints, multiplied by Lagrange multipliers (αi and ηi), from the Primal Objective Function. L has a saddlepoint at the optimal solution. L = 1 2θ2 + C

m

i=1

ξi −

m

i=1

αi (θ, xi − 1 + ξi) −

m

i=1

ηiξi where αi, ηi ≥ 0. For instance, if ξi < 0 we could increase L without bound via ηi.

SLIDE 82

Dual Problem, Part II

Alexander J. Smola: Exponential Families and Kernels, Page 9

Optimality Conditions ∂θL = θ −

m

i=1

αixi = 0 = ⇒ θ =

m

i=1

αixi ∂ξiL = C − αi − ηi = 0 = ⇒ αi ∈ [0, C] Now we substitute the two optimality conditions back into L and eliminate the primal variables. Dual Problem minimize 1 2

m

i=1

αiαjxi, xj −

m

i=1

αi subject to αi ∈ [0, C] Convexity ensures uniqueness of the optimum.

SLIDE 83

The ν-Trick

Alexander J. Smola: Exponential Families and Kernels, Page 10

Problem Depending on how we choose C, the number of points selected as lying on the “wrong” side of the hyperplane H := {x|θ, x = 1} will vary. We would like to specify a certain fraction ν before- hand. We want to make the setting more adaptive to the data. Solution Use adaptive hyperplane that separates data from the

rigin, i.e. find

H := {x|θ, x = ρ}, where the threshold ρ is adaptive.

SLIDE 84

The ν-Trick

Alexander J. Smola: Exponential Families and Kernels, Page 11

Primal Problem minimize 1 2θ2 +

m

i=1

ξi − mνρ subject to θ, xi − ρ + ξi ≥ 0 and ξi ≥ 0 Dual Problem minimize 1 2

m

i=1

αiαjxi, xj subject to αi ∈ [0, 1] and

m

i=1

αi = νm. Difference to before The

i αi term vanishes from the objective function but

we get one more constraint, namely

i αi = νm.

SLIDE 85

The ν-Property

Alexander J. Smola: Exponential Families and Kernels, Page 12

Optimization Problem minimize 1 2θ2 +

m

i=1

ξi − mνρ subject to θ, xi − ρ + ξi ≥ 0 and ξi ≥ 0 Theorem At most a fraction of ν points will lie on the “wrong” side of the margin, i.e., yif(xi) < 1. At most a fraction of 1 − ν points will lie on the “right” side of the margin, i.e., yif(xi) > 1. In the limit, those fractions will become exact. Proof Idea At optimum, shift ρ slightly: only the active constraints will have an influence on the objective function.

SLIDE 86

Classification

Alexander J. Smola: Exponential Families and Kernels, Page 13

Maximum a Posteriori Estimation − log p(θ|X, Y ) =

m

i=1

−φ(xi, yi), θ + g(θ|xi) + 1 2σ2θ2 + c Domain Finite set of observations Y = {1, . . . , m} Log-partition function g(θ|x) easy to compute. Optional centering φ(x, y) → φ(x, y) + c leaves p(y|x, θ) unchanged (offsets both terms). Gaussian Process Connection Inner product t(x, y) = φ(x, y), θ is drawn from Gaus- sian process, so same setting as in literature.

SLIDE 87

Classification

Alexander J. Smola: Exponential Families and Kernels, Page 14

Sufficient Statistic We pick φ(x, y) = φ(x) ⊗ ey, that is k((x, y), (x′, y′)) = k(x, x′)δyy′ where y, y′ ∈ {1, . . . , n} Kernel Expansion By the representer theorem we get that θ =

m

i=1
y

αiyφ(xi, y) Optimization Problem Big mess . . . but convex. Solve by Newton or Block-Jacobi method.

SLIDE 88

A Toy Example

Alexander J. Smola: Exponential Families and Kernels, Page 15

SLIDE 89

Noisy Data

Alexander J. Smola: Exponential Families and Kernels, Page 16

SLIDE 90

SVM Connection

Alexander J. Smola: Exponential Families and Kernels, Page 17

Problems with GP Classification Optimize even where classification is good Only sign of classification needed Only “strongest” wrong class matters Want to classify with a margin Optimization Problem MAP

m

i=1

− log p(yi|xi, θ) + 1 2σ2θ2 SVM

m

i=1

max

ρ − log

p(yi|xi, θ) maxy=yi p(y|xi, θ), 0

+ 1

2θ2

m

i=1

max(ρ − φ(xi, yi), θ + max

y=yi φ(xi, y), θ, 0) + 1

2θ2

SLIDE 91

Binary Classification

Alexander J. Smola: Exponential Families and Kernels, Page 18

Sufficient Statistics Offset in φ(x, y) can be arbitrary Pick such that φ(x, y) = yφ(x)where y ∈ {±1}. Kernel matrix becomes Kij = k((xi, yi), (xj, yj)) = yiyjk(xi, xj) Optimization Problem The max over other classes becomes max

y=yi φ(xi, y), θ = −yφ(xi), θ

Overall problem

m

i=1

max(ρ − 2yiφ(xi), θ, 0) + 1 2θ2

SLIDE 92

Geometrical Interpretation

Alexander J. Smola: Exponential Families and Kernels, Page 19

Minimize 1 2θ2 subject to yi(θ, xi + b) ≥ 1 for all i.

SLIDE 93

Optimization Problem

Alexander J. Smola: Exponential Families and Kernels, Page 20

Linear Function f(x) = θ, x + b Mathematical Programming Setting If we require error-free classification with a margin, i.e., yf(x) ≥ 1, we obtain: minimize 1 2θ2 subject to yi(θ, xi + b) − 1 ≥ 0 for all 1 ≤ i ≤ m Result The dual of the optimization problem is a simple quadratic program (more later ...). Connection back to conditional probabilities Offset b takes care of bias towards one of the classes.

SLIDE 94

Regression

Alexander J. Smola: Exponential Families and Kernels, Page 21

Maximum a Posteriori Estimation − log p(θ|X, Y ) =

m

i=1

−φ(xi, yi), θ + g(θ|xi) + 1 2σ2θ2 + c Domain Continuous domain of observations Y = R Log-partition function g(θ|x) easy to compute in closed form as normal distribution. Gaussian Process Connection Inner product t(x, y) = φ(x, y), θ is drawn from Gaussian

process. In particular also rescaled mean and covari-

ance.

SLIDE 95

Regression

Alexander J. Smola: Exponential Families and Kernels, Page 22

Sufficient Statistic (Standard Model) We pick φ(x, y) = (yφ(x), y2), that is k((x, y), (x′, y′)) = k(x, x′)yy′ + y2y′2 where y, y′ ∈ R Traditionally the variance is fixed, that is θ2 = const.. Sufficient Statistic (Fancy Model) We pick φ(x, y) = (yφ1(x), y2φ2(x)), that is k((x, y), (x′, y′)) = k1(x, x′)yy′+k2(x, x′)y2y′2 where y, y′ ∈ R We estimate mean and variance simultaneously. Kernel Expansion By the representer theorem (and more algebra) we get θ = m

i=1

αi1φ1(xi),

m

i=1

αi2φ2(xi)

SLIDE 96

Training Data

Alexander J. Smola: Exponential Families and Kernels, Page 23

SLIDE 97

Mean

k⊤(x)(K + σ21)−1y

Alexander J. Smola: Exponential Families and Kernels, Page 24

SLIDE 98

Variance k(x, x) + σ2 −

k⊤(x)(K + σ21)−1 k(x)

Alexander J. Smola: Exponential Families and Kernels, Page 25

SLIDE 99

Putting everything together . . .

Alexander J. Smola: Exponential Families and Kernels, Page 26

SLIDE 100

Another Example

Alexander J. Smola: Exponential Families and Kernels, Page 27

SLIDE 101

Adaptive Variance Method

Alexander J. Smola: Exponential Families and Kernels, Page 28

Optimization Problem:

minimize

m

i=1

  −1 4  

m

j=1

α1jk1(xi, xj)  

⊤ 



m

j=1

α2jk2(xi, xj)  

−1 



m

j=1

α1jk1(xi, xj)   −1 2 log det −2  

m

j=1

α2jk2(xi, xj)   −

m

j=1
y⊤

i α1jk1(xi, xj) + (y⊤ j α2jyj)k2(xi, xj)



 + 1 2σ2

i,j

α⊤

1iα1jk1(xi, xj) + tr

α2iα⊤

2j

k2(xi, xj).

subject to 0 ≻

m

i=1

α2ik(xi, xj)

Properties of the problem: The problem is convex The log-determinant from the normalization of the Gaussian acts as a barrrier function. We get a semidefinite program.

SLIDE 102

Heteroscedastic Regression

Alexander J. Smola: Exponential Families and Kernels, Page 29

SLIDE 103

Natural Parameters

Alexander J. Smola: Exponential Families and Kernels, Page 30

SLIDE 104

Lecture 3

Alexander J. Smola: Exponential Families and Kernels, Page 31

Novelty Detection Density estimation Thresholding and likelihood ratio Classification Log partition function Optimization problem Examples Clustering and transduction Regression Conditional normal distribution Estimating the covariance Heteroscedastic estimators

SLIDE 105

Alexander J. Smola: Exponential Families and Kernels, Page 1

Exponential Families and Kernels Lecture 4

Alexander J. Smola Alex.Smola@nicta.com.au Machine Learning Program National ICT Australia RSISE, The Australian National University

SLIDE 106

Outline

Alexander J. Smola: Exponential Families and Kernels, Page 2

Exponential Families Maximum likelihood and Fisher information Priors (conjugate and normal) Conditioning and Feature Spaces Conditional distributions and inner products Clifford Hammersley Decomposition Applications Classification and novelty detection Regression Applications Conditional random fields Intractable models and semidefinite approximations

SLIDE 107

Lecture 4

Alexander J. Smola: Exponential Families and Kernels, Page 3

Conditional Random Fields Structured random variables Subspace representer theorem and decomposition Derivatives and conditional expectations Inference and Message Passing Dynamic programming Message passing and junction trees Intractable cases Semidefinite Relaxations Marginal polytopes Fenchel duality and entropy Relaxations for conditional random fields

SLIDE 108

Hammersley Clifford Corollary

Alexander J. Smola: Exponential Families and Kernels, Page 4

Decomposition The sufficient statistics φ(x) decompose according to φ(x) = (. . . , φc(xc), . . .) Consequently we can write the kernel via k(x, x′) = φ(x), φ(x′) =

c

φc(xc), φc(x′

c) =

c

kc(xc, x′

c)

SLIDE 109

Conditional Random Fields

Alexander J. Smola: Exponential Families and Kernels, Page 5

Key Points Cliques are (xt, yt), (xt, xt+1), and (yt, yt+1) We can drop cliques in (xt, xt+1): they do not affect p(y|x, θ): p(y|x, θ) = exp

t

φxy(xt, yt), θxy,t + φyy(yt, yt+1), θyy,t+ φxx(xt, xt+1), θxx,t − g(θ|x)

SLIDE 110

Computational Issues

Alexander J. Smola: Exponential Families and Kernels, Page 6

Key Points Compute g(θ|x) via dynamic Assume stationarity of the model, that is θc does not depend on the position of the Dynamic Programming g(θ|x) = log

y1,...,yT

T

t=1

exp (φxy(xt, yt), θxy + φyy(yt, yt+1), θyy)

Mt(yt,yt+1)

= log

y1
y2

M1(y1, y2)

y3

M2(y2, y3) . . .

yT

MT(yT−1, yT) So we can compute g(θ|x), p(yt|x, θ) and p(yt, yt+1|x, θ) via dynamic programming.

SLIDE 111

Forward Backward Algorithm

Alexander J. Smola: Exponential Families and Kernels, Page 7

Key Idea Store sum over all y1, . . . , yt−1 (forward pass) and over all yt+1, . . . , yT as intermediate values We get those values for all positions t in one sweep. Extend this to message passing (when we have trees).

SLIDE 112

Minimization

Alexander J. Smola: Exponential Families and Kernels, Page 8

Objective Function − log p(θ|X, Y ) =

m

i=1

−φ(xi, yi), θ + g(θ|xi) + 1 2σ2θ2 + c ∂θ − log p(θ|X, Y ) =

m

i=1

−φ(xi, yi) + E [φ(xi, yi)|xi] + 1 σ2θ We only need E [φxy(xit, yit)|xi] and E

φyy(yit, yi(t+1))|xi
.

Kernel Trick Conditional expectations of Φ(xit, yit) cannot be com- puted explicitly but inner products can. φxy(x′

t, y′ t), E [φxy(xt, yt)|x] = E [k((x′ t, y′ t), (xt, yt)|x]

Only need marginals p(yt|x, θ) and p(yt, yt+1|x, θ), which we get via dynamic programming.

SLIDE 113

Subspace Representer Theorem

Alexander J. Smola: Exponential Families and Kernels, Page 9

Representer Theorem Solutions of the MAP problem are given by θ ∈ span{φ(xi, y) for all y ∈ Y and 1 ≤ i ≤ n} Big Problem |Y| could be huge, e.g. for sequence annotation 2n. Solution Exploit decomposition of φ(x, y) into sufficient statis- tics on cliques. Restriction of Y to cliques is much smaller. θc ∈ span{φc(xci, yc) for all yc ∈ Yc and 1 ≤ i ≤ n} Rather than 2n we now get 2|c|.

SLIDE 114

CRFs and HMMs

Alexander J. Smola: Exponential Families and Kernels, Page 10

Conditional Random Field: maximize p(y|x, θ) Time t − 2 t − 1 t t + 1 t + 2 X

x
x
x
x
x

Y

y
y
y
y
y

Hidden Markov Model: maximize p(x, y|θ) Time t − 2 t − 1 t t + 1 t + 2 X

x
x
x
x
x

Y

y
y
y
y
y

SLIDE 115

Equivalence Theorem

Alexander J. Smola: Exponential Families and Kernels, Page 11

Theorem CRFs and HMMs yield identical probability estimates for p(y|x, θ), if the set of functions is equally expressive. Proof Write out pCRF(y|x, θ) and pHMM(x, y|θ), and show that they only differ in the normalization. This disappears when computing pHMM(y|x, θ). Consequence Differential training for current HMM implementations.

SLIDE 116

Message Passing

Alexander J. Smola: Exponential Families and Kernels, Page 12

SLIDE 117

Message Passing

Alexander J. Smola: Exponential Families and Kernels, Page 13

Idea Extend the forward-backward idea to trees. Algorithm Given clique potentials M(yi, yj) Initialize messages µij(yj) = 1 Update outgoing messages by µij(yj) =

yi∈Yi
k=j

µki(yi)Mij(yi, yj) Here (i, k) is an edge in the graph. Theorem The message passing algorithm converges after n itera- tions (n is diameter of graph). Hack Use this for graphs with loops and hope . . .

SLIDE 118

Junction Trees

Alexander J. Smola: Exponential Families and Kernels, Page 14

Stock standard algorithms available to transform graph into junction tree. Now we can use message passing . . .

SLIDE 119

Junction Tree Algorithm

Alexander J. Smola: Exponential Families and Kernels, Page 15

Idea Messages involve variables in the separator sets. Algorithm Given clique potentials Mc(yc) and separator sets s. Initialize messages µc,s(ys) = 1 Update outgoing messages by µc,s(ys) =

yc\ys
s′=s

µc′,s′(ys′)Mc(yc) Here s′ is a separator set connecting c with c′. Theorem The message passing algorithm converges after n itera- tions (n is diameter of the hypergraph). Hack Use this for graphs with loops and hope . . .

SLIDE 120

Example

Alexander J. Smola: Exponential Families and Kernels, Page 16

SLIDE 121

Problems

Alexander J. Smola: Exponential Families and Kernels, Page 17

Scaling The algorithm scales exponentially in the treewidth. Messages are of size d|Ys|. Convergence with loops Use of message passing may or may not converge. No real proof available. Workaround Use a subset of the graph and solve the inference prob- lem with this. Average over spanning trees. Workaround Use sampling methods for inference.

SLIDE 122

A Better Way

Alexander J. Smola: Exponential Families and Kernels, Page 18

Fenchel Duality Compute dual of log-partition function via g∗(µ) = sup

θ∈Θ

µ, θ − g(θ) (Θ is a convex domain) Entropy and Expectation Parameters The maximum of the optimization problem is obtained for µ = ∂θg(θ). This leads to H =

− log p(x|θ)p(x|θ)dθ = −µ(θ), θ + g(θ) = −g∗(µ)

Strong Duality Dualizing again leads to g(θ) = sup

µ∈M

θ, µ + H(µ)

SLIDE 123

Semidefinite Relaxation

Alexander J. Smola: Exponential Families and Kernels, Page 19

Optimization Problem g(θ) = sup

µ∈M

θ, µ + H(µ) Here M is the set of all possible marginals. Relaxations on M The polytope M is convex (by duality), however it is hard to compute (as hard as g(θ)). So we relax it to ˜ M by impose constraints on higher order moments, such as Interval and linear inequality constraints. SDP constraints on the covariance matrix. Upper bound on H(µ) Gaussian bound on the covariance via G(µ). So we get g(θ) ≤ sup

µ∈ ˜ M

θ, µ + G(µ)

SLIDE 124

Application to CRFs

Alexander J. Smola: Exponential Families and Kernels, Page 20

Optimization Problem − log p(θ|X, Y ) =

m

i=1

−φ(xi, yi), θ + g(θ|xi) + 1 2σ2θ2 + c ≤

m

i=1

sup

µi∈ ˜ Mi

θ, µi − φ(xi, yi) + H(µi) + 1 2σ2θ2 Technical Details Minimization over θ and µi can be swapped (saddle- point property of a convex-concave problem) to obtain dual problem in θ. Map from µ to moments in y|x via invertible sufficient statistics map. Constrained max-det problem.

SLIDE 125

Summary

Alexander J. Smola: Exponential Families and Kernels, Page 21

Conditional Random Fields Structured random variables Subspace representer theorem and decomposition Derivatives and conditional expectations Inference and Message Passing Dynamic programming Message passing and junction trees Intractable cases Semidefinite Relaxations Marginal polytopes Fenchel duality and entropy Relaxations for conditional random fields

SLIDE 126

Shameless Plugs

Alexander J. Smola: Exponential Families and Kernels, Page 22

We are hiring. For details contact Alex.Smola@nicta.com.au (http://www.nicta.com.au) Positions PhD scholarships Postdoctoral positions, Senior researchers Long-term visitors (sabbaticals etc.) More details on kernels http://www.kernel-machines.org http://www.learning-with-kernels.org Schölkopf and Smola: Learning with Kernels Machine Learning Summer School http://www.mlss.cc MLSS’05 Canberra, Australia, 23/1-5/2/2005