[PPT] - From Maxent to Machine Learning and Back T. Sears ANU March 2007 PowerPoint Presentation

SLIDE 1

From Maxent to Machine Learning and Back

T. Sears

ANU

March 2007

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 1 / 36

SLIDE 2

50 Years Ago . . .

”The principles and mathematical methods of statistical mechanics are seen to be of much more general applicability. . . In the problem of prediction, the maximization of entropy is not an application of a law of physics, but merely a method of reasoning which ensures that no unconscious arbitrary assumptions have been introduced.” –E.T. Jaynes, 1957

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 2 / 36

SLIDE 3

“. . . a method of reasoning . . . ”

Jenkins, if I want another yes-man I’ll build one.

SLIDE 4

Outline

1

Generalizing Maxent

2

Two Examples

3

Broader Comparisons

4

Extensions/Conclusions

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 4 / 36

SLIDE 5

Generalizing Maxent

The Classic Maxent Problem

Minimize negative entropy subject to linear constraints: min

p S(p) := N

i=1

pi log(pi) subject to Ap = b pi ≥ 0 A is M × N. M < N, a ”wide” matrix. b is a data vector. A := B 1

T

contains a normalization

constraint.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 6 / 36

SLIDE 7

Generalizing Maxent

Extending the Classic Maxent Problem

min

p S(p)

subject to Ap = b Original problem.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 7 / 36

SLIDE 8

Generalizing Maxent

Extending the Classic Maxent Problem

min

p S(p) + δ{0} (Ap − b)

Original problem. Convert constraints to a convex function.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 7 / 36

SLIDE 9

Generalizing Maxent

Extending the Classic Maxent Problem

min

p S(p) + δ{0} (||Ap − b||P)

Original problem. Convert constraints to a convex function. Use any norm...

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 7 / 36

SLIDE 10

Generalizing Maxent

Extending the Classic Maxent Problem

min

p S(p) + δǫBP(Ap − b)

Original problem. Convert constraints to a convex function. Use any norm... and relax constraints.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 7 / 36

SLIDE 11

Generalizing Maxent

Extending the Classic Maxent Problem

min

p ∆F(p, p0) + δǫBP(Ap − b)

Original problem. Convert constraints to a convex function. Use any norm... and relax constraints. Generalize SBG entropy to Bregman divergence.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 7 / 36

SLIDE 12

Generalizing Maxent

Extending the Classic Maxent Problem

min

µ F ∗(ATµ + p∗ 0) + µ, b + ǫ ||µ||Q

Original problem. Convert constraints to a convex function. Use any norm... and relax constraints. Generalize SBG entropy to Bregman divergence. Find the Fenchel dual problem to solve.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 7 / 36

SLIDE 13

Generalizing Maxent

Extending the Classic Maxent Problem

min

µ F ∗(ATµ + p∗ 0) + µ, b

”Likelihood”

+ ǫ ||µ||Q

”Prior”

Original problem. Convert constraints to a convex function. Use any norm... and relax constraints. Generalize SBG entropy to Bregman divergence. Find the Fenchel dual problem to solve. It’s a more general form of the MAP problem.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 7 / 36

SLIDE 14

Generalizing Maxent

Characterizing the solution

Compare to statistical models

After solving for ¯ µ we can recover the optimal primal solution: ¯ p = ∇F ∗

”Family”

(

”Score”

AT ¯

µ +p∗

0)

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 8 / 36

SLIDE 15

Generalizing Maxent

Characterizing the solution

Compare to statistical models

After solving for ¯ µ we can recover the optimal primal solution: ¯ p = ∇F ∗

”Family”

(

”Score”

AT ¯

µ +p∗

0)

¯ p comes from a family of distributions.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 8 / 36

SLIDE 16

Generalizing Maxent

Characterizing the solution

Compare to statistical models

After solving for ¯ µ we can recover the optimal primal solution: ¯ p = ∇F ∗

”Family”

(

”Score”

AT ¯

µ +p∗

0)

¯ p comes from a family of distributions. Entropy function (F) determines the family (∇F ∗).

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 8 / 36

SLIDE 17

Generalizing Maxent

Characterizing the solution

Compare to statistical models

After solving for ¯ µ we can recover the optimal primal solution: ¯ p = ∇F ∗

”Family”

(

”Score”

AT ¯

µ +p∗

0)

¯ p comes from a family of distributions. Entropy function (F) determines the family (∇F ∗). SBG entropy → exponential family.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 8 / 36

SLIDE 18

Generalizing Maxent

Characterizing the solution

Compare to statistical models

After solving for ¯ µ we can recover the optimal primal solution: ¯ p = ∇F ∗

”Family”

(

”Score”

AT ¯

µ +p∗

0)

¯ p comes from a family of distributions. Entropy function (F) determines the family (∇F ∗). SBG entropy → exponential family. Any “nice” F → some family.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 8 / 36

SLIDE 19

Generalizing Maxent

Generalizing the Exponential Family

q-Exponential

3 2 1 1 2 3 2 4 6 8

expq

q 1.5 q 1. q 0.5

Asymptote for q 1.5

expq(p) :=    (1 + (1 − q)p)

1 1−q

+

q = 1 exp(p) q = 1

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 9 / 36

SLIDE 20

Generalizing Maxent

Tail Behavior

Tail Behavior 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.2 0.4 0.6 0.8 1.0

expq

q > 1 naturally gives fat tails. q < 1 truncates the tail.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 10 / 36

SLIDE 21

Two Examples

Loaded Die Example

Setup

A die with 6 faces.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 12 / 36

SLIDE 23

Two Examples

Loaded Die Example

Setup

A die with 6 faces. Expected value of 4.5, instead of 3.5 for a “fair die”.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 12 / 36

SLIDE 24

Two Examples

Loaded Die Example

Setup

A die with 6 faces. Expected value of 4.5, instead of 3.5 for a “fair die”. For this problem: A = 1 2 3 4 5 6 1 1 1 1 1 1

and

b = 4.5 1

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 12 / 36

SLIDE 25

Two Examples

Loaded Die Example

Setup

A die with 6 faces. Expected value of 4.5, instead of 3.5 for a “fair die”. For this problem: A = 1 2 3 4 5 6 1 1 1 1 1 1

and

b = 4.5 1

Find p, assuming S → Sq, p0 is uniform, ǫ = 0.
T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 12 / 36

SLIDE 26

Two Examples

Loaded Die Example

Setup

A die with 6 faces. Expected value of 4.5, instead of 3.5 for a “fair die”. For this problem: A = 1 2 3 4 5 6 1 1 1 1 1 1

and

b = 4.5 1

Find p, assuming S → Sq, p0 is uniform, ǫ = 0.
T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 12 / 36

SLIDE 27

Two Examples

Loaded Die Example

Setup

A die with 6 faces. Expected value of 4.5, instead of 3.5 for a “fair die”. For this problem: A = 1 2 3 4 5 6 1 1 1 1 1 1

and

b = 4.5 1

Find p, assuming S → Sq, p0 is uniform, ǫ = 0.
T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 12 / 36

SLIDE 28

Two Examples

Loaded Die Example

0.0 0.1 0.2 0.3

Probability 0.1 1. 1.9

q qSensitivity of Each Event Varies

Higher q raises weight on face 1 and face 6. Opposite for 3,4,5. Task: Make a two-way market on each die face. Which is easiest?

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 13 / 36

SLIDE 29

Two Examples

Example: The Dantzig Selector

Entropy Function as Prior Information

Background: Consider a variation on linear regression y = Xβ. Choose β via min

β ||β||1 + δǫB∞(XT(Xβ − y))

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 14 / 36

SLIDE 30

Two Examples

Example: The Dantzig Selector

Entropy Function as Prior Information

Background: Consider a variation on linear regression y = Xβ. Choose β via min

β ||β||1 + δǫB∞(XT(Xβ − y))

The non-zero entries of the solution can exactly identify the correct set of regressors with high probability under special conditions. (Candace and Tao, Ann. Stat. 2007)

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 14 / 36

SLIDE 31

Two Examples

Example: The Dantzig Selector

Entropy Function as Prior Information

Background: Consider a variation on linear regression y = Xβ. Choose β via min

β ||β||1 + δǫB∞(XT(Xβ − y))

The non-zero entries of the solution can exactly identify the correct set of regressors with high probability under special conditions. (Candace and Tao, Ann. Stat. 2007) Special conditions: low noise, sparse “true” model β.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 14 / 36

SLIDE 32

Two Examples

Example: The Dantzig Selector

Entropy Function as Prior Information

Background: Consider a variation on linear regression y = Xβ. Choose β via min

β ||β||1 + δǫB∞(XT(Xβ − y))

The non-zero entries of the solution can exactly identify the correct set of regressors with high probability under special conditions. (Candace and Tao, Ann. Stat. 2007) Special conditions: low noise, sparse “true” model β. Application area: Compressed Sensing.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 14 / 36

SLIDE 33

Two Examples

Dantzig Selector

Connection

Change of variables (“+/-” trick) β =

I

| − I

p,
T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 15 / 36

SLIDE 34

Two Examples

Dantzig Selector

Connection

Change of variables (“+/-” trick) β =

I

| − I

p,

||β||1 can be approached using Sq with q → 0.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 15 / 36

SLIDE 35

Two Examples

Dantzig Selector

Connection

Change of variables (“+/-” trick) β =

I

| − I

p,

||β||1 can be approached using Sq with q → 0. Entropy function Sq captures part of the prior knowledge.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 15 / 36

SLIDE 36

Broader Comparisons

Making Broader Comparisons

Value Regularization

Problem: Model “preferences” over parameters can’t be easily compared. Solution: Compare outputs instead (Rifkin and Lippert, JMLR, 2007). Many methods can be viewed as also solving min

y R(y) + L(y − b)

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 17 / 36

SLIDE 38

Broader Comparisons

Making Broader Comparisons

Value Regularization

Problem: Model “preferences” over parameters can’t be easily compared. Solution: Compare outputs instead (Rifkin and Lippert, JMLR, 2007). Many methods can be viewed as also solving min

y R(y) + L(y − b)

The regularizer, R, wants smooth outputs, y.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 17 / 36

SLIDE 39

Broader Comparisons

Making Broader Comparisons

Value Regularization

Problem: Model “preferences” over parameters can’t be easily compared. Solution: Compare outputs instead (Rifkin and Lippert, JMLR, 2007). Many methods can be viewed as also solving min

y R(y) + L(y − b)

The regularizer, R, wants smooth outputs, y. The loss L, wants a close fit to the data, b (e.g. match labels).

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 17 / 36

SLIDE 40

Broader Comparisons

Making Broader Comparisons

Value Regularization

Problem: Model “preferences” over parameters can’t be easily compared. Solution: Compare outputs instead (Rifkin and Lippert, JMLR, 2007). Many methods can be viewed as also solving min

y R(y) + L(y − b)

The regularizer, R, wants smooth outputs, y. The loss L, wants a close fit to the data, b (e.g. match labels). These goals typically compete.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 17 / 36

SLIDE 41

Broader Comparisons

Generalized Maxent and Value Regularization

To apply this idea to maxent: Change variables y = Ap.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 18 / 36

SLIDE 42

Broader Comparisons

Generalized Maxent and Value Regularization

To apply this idea to maxent: Change variables y = Ap. The regularizer corresponds to an image function: R(y) = AS(y) = min

p S(p) + δ{0}(Ap − y)

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 18 / 36

SLIDE 43

Broader Comparisons

Generalized Maxent and Value Regularization

To apply this idea to maxent: Change variables y = Ap. The regularizer corresponds to an image function: R(y) = AS(y) = min

p S(p) + δ{0}(Ap − y)

Loss is straightforward: L(y) = δǫBp(y − b)

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 18 / 36

SLIDE 44

Broader Comparisons

SVMs and Value Regularization

The Support Vector Machine (SVM, Vapnik) is one of the best known machine learning algorithms. Loss function is the soft margin hinge loss: 1

2 max(0, 1 − by)

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 19 / 36

SLIDE 45

Broader Comparisons

SVMs and Value Regularization

The Support Vector Machine (SVM, Vapnik) is one of the best known machine learning algorithms. Loss function is the soft margin hinge loss: 1

2 max(0, 1 − by)

Regularizer uses a data-dependent positive definite matrix K

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 19 / 36

SLIDE 46

Broader Comparisons

SVMs and Value Regularization

The Support Vector Machine (SVM, Vapnik) is one of the best known machine learning algorithms. Loss function is the soft margin hinge loss: 1

2 max(0, 1 − by)

Regularizer uses a data-dependent positive definite matrix K In value regularization terms the objective function is: 1 2λyTK−1yT

R

+

i

hingeloss(yi, bi)

L
T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 19 / 36

SLIDE 47

Broader Comparisons

SVMs and Value Regularization

The Support Vector Machine (SVM, Vapnik) is one of the best known machine learning algorithms. Loss function is the soft margin hinge loss: 1

2 max(0, 1 − by)

Regularizer uses a data-dependent positive definite matrix K In value regularization terms the objective function is: 1 2λyTK−1yT

R

+

i

hingeloss(yi, bi)

L
T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 19 / 36

SLIDE 48

Broader Comparisons

SVMs and Value Regularization

The Support Vector Machine (SVM, Vapnik) is one of the best known machine learning algorithms. Loss function is the soft margin hinge loss: 1

2 max(0, 1 − by)

Regularizer uses a data-dependent positive definite matrix K In value regularization terms the objective function is: 1 2λyTK−1yT

R

+

i

hingeloss(yi, bi)

L

Compare to the generalized maxent objective function. AS(y)

R

+ δǫBp(y − b)

L
T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 19 / 36

SLIDE 49

Extensions/Conclusions

Other Models, Briefly

Many NLP models owe direct debt. Connection is easily seen.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 21 / 36

SLIDE 51

Extensions/Conclusions

Other Models, Briefly

Many NLP models owe direct debt. Connection is easily seen. Conditional models, graphical models.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 21 / 36

SLIDE 52

Extensions/Conclusions

Other Models, Briefly

Many NLP models owe direct debt. Connection is easily seen. Conditional models, graphical models.

Use exponential family (SBG entropy), almost always.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 21 / 36

SLIDE 53

Extensions/Conclusions

Other Models, Briefly

Many NLP models owe direct debt. Connection is easily seen. Conditional models, graphical models.

Use exponential family (SBG entropy), almost always. Often replace marginal distributions with empirical counterparts. Strong assumption, big simplification.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 21 / 36

SLIDE 54

Extensions/Conclusions

Other Models, Briefly

Many NLP models owe direct debt. Connection is easily seen. Conditional models, graphical models.

Use exponential family (SBG entropy), almost always. Often replace marginal distributions with empirical counterparts. Strong assumption, big simplification.

Non-probabilistic models. Relax normalization. Use +/- trick.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 21 / 36

SLIDE 55

Extensions/Conclusions

Other Models, Briefly

Many NLP models owe direct debt. Connection is easily seen. Conditional models, graphical models.

Use exponential family (SBG entropy), almost always. Often replace marginal distributions with empirical counterparts. Strong assumption, big simplification.

Non-probabilistic models. Relax normalization. Use +/- trick. Continuous/mixed models. p becomes a function, A becomes an

perator. Call in the mathematicians and approximation theory.
T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 21 / 36

SLIDE 56

Extensions/Conclusions

Summary

There is a class of models based on convex functions, which have interchangeable parts.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 22 / 36

SLIDE 57

Extensions/Conclusions

Summary

There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 22 / 36

SLIDE 58

Extensions/Conclusions

Summary

There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation. Fenchel duality permits a quick switch of model assumptions.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 22 / 36

SLIDE 59

Extensions/Conclusions

Summary

There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation. Fenchel duality permits a quick switch of model assumptions. Benefit: modular approach allows exploration of model space, by the modeler, or the computer.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 22 / 36

SLIDE 60

Extensions/Conclusions

Summary

There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation. Fenchel duality permits a quick switch of model assumptions. Benefit: modular approach allows exploration of model space, by the modeler, or the computer. Key required tool: flexible, non-smooth optimization tools.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 22 / 36

SLIDE 61

Extensions/Conclusions

Summary

There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation. Fenchel duality permits a quick switch of model assumptions. Benefit: modular approach allows exploration of model space, by the modeler, or the computer. Key required tool: flexible, non-smooth optimization tools. Harder: Characterize “prior knowledge” represented in choice of Regularizer and Loss.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 22 / 36

SLIDE 62

Extensions/Conclusions

Summary

There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation. Fenchel duality permits a quick switch of model assumptions. Benefit: modular approach allows exploration of model space, by the modeler, or the computer. Key required tool: flexible, non-smooth optimization tools. Harder: Characterize “prior knowledge” represented in choice of Regularizer and Loss. Harder: Incorporate/factor out knowledge of the task(s) to be performed with the model.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 22 / 36

SLIDE 63

Extensions/Conclusions

The End

Thank You

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 23 / 36

SLIDE 64

Appendix

Software for Experiments

Apply quasi-Newton method (LMVM) to dual problem.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 25 / 36

SLIDE 66

Appendix

Software for Experiments

Apply quasi-Newton method (LMVM) to dual problem. Objective function requires matrix-vector multiplication (AT · vM×1)

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 25 / 36

SLIDE 67

Appendix

Software for Experiments

Apply quasi-Newton method (LMVM) to dual problem. Objective function requires matrix-vector multiplication (AT · vM×1) Gradient requires additional matrix-vector multiplication (A · vN×1)

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 25 / 36

SLIDE 68

Appendix

Software for Experiments

Apply quasi-Newton method (LMVM) to dual problem. Objective function requires matrix-vector multiplication (AT · vM×1) Gradient requires additional matrix-vector multiplication (A · vN×1) Built on PETSC/TAO/Elefant. Will run single or parallel (MPI) with a simple switch.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 25 / 36

SLIDE 69

Appendix

Software for Experiments

Apply quasi-Newton method (LMVM) to dual problem. Objective function requires matrix-vector multiplication (AT · vM×1) Gradient requires additional matrix-vector multiplication (A · vN×1) Built on PETSC/TAO/Elefant. Will run single or parallel (MPI) with a simple switch. Additional features to accommodate non-smooth duals to constraint relaxations.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 25 / 36

SLIDE 70

Appendix

Software for Experiments

Apply quasi-Newton method (LMVM) to dual problem. Objective function requires matrix-vector multiplication (AT · vM×1) Gradient requires additional matrix-vector multiplication (A · vN×1) Built on PETSC/TAO/Elefant. Will run single or parallel (MPI) with a simple switch. Additional features to accommodate non-smooth duals to constraint relaxations. Possible synergy: Choon-Hui, Alex, and Vishy announce high performance non-smooth optimization package.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 25 / 36

SLIDE 71

Appendix Generalizing the Maxent Problem

Classic Maxent Solution

Exponential Family Distribution

Constraint equivalent to: A p =

1⊤

B

p =

1 bB

Normalization is “just another feature”.

Try to “hide” its existence in the solution: ¯ p = exp[AT ¯ µ] = exp[BT ¯ µ + 1¯ µ1] = exp[BT ¯ µB − 1T(¯ µB)] = 1 Z(¯ µB) exp[BT ¯ µB] T is the log-partition function. Z is the partition function.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 26 / 36

SLIDE 72

Appendix Generalizing the Maxent Problem

Convex Analysis Recap

A quick detour

The convex conjugate of a convex function F ∗(p∗) := sup

p∈dom(F)

{p∗, p − F(p)} . F is Legendre if

1 C = int(dom F) is non-empty 2 F is differentiable on C 3 ||∇F (p) || → ∞ as p → bdry(dom F)

For Legendre functions (in the int(dom F) ) we have p = ∇F ∗(p∗).

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 27 / 36

SLIDE 73

Appendix Generalizing the Maxent Problem

A More General Objective Function

Bregman Divergence

∆F(p, q) := F(p) − F(q) − ∇F(q), p − q Let q be uniform (qi = 1/N). S is SBG entropy. ∆S(p, q) =

i

+pi log(pi) − qi log(qi) − (1 + log(qi))(pi − qi) =

i

−pi log(1/pi) +

i

pi log(N) −

i

pi +

i

qi = S(p) + log(N). ∆S is relative entropy when q not uniform. But we are not restricted to SBG entropy . . . .

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 28 / 36

SLIDE 74

Appendix Generalizing the Maxent Problem

A More General Maxent Problem

New Objective Function

min

p∈Rn ∆F(p, p0) subject to Ap = b and pi ≥ 0.

Solve it by using the Fenchel dual max

µ∈dom F ∗ −F ∗(ATµ + p∗ 0) + b, µ

where (if F is Legendre) ¯ p = ∇F ∗ AT ¯ µ + p∗

.
T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 29 / 36

SLIDE 75

Appendix Generalizing the Maxent Problem

Solution to the Problem

New Distribution Families

This solution is more general but similar to the exponential family. ¯ p = ∇F ∗(BT ¯ µB + p∗

0 + 1¯

µ1) = ∇F ∗(BT ¯ µB + p∗

0 − 1T(¯

µB)) Here, T(µB) is defined implicitly via 1

T∇F ∗(BT ¯

µB + p∗

0 − 1T(¯

µB)) = 1

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 30 / 36

SLIDE 76

Appendix The Consequences of Normalization

Scale Function Properties

Analog to partition function

T is not simple to calculate. But we can deduce that T is convex and use implicit differentiation to calculate its gradient. 0 = (B − ∇T(µB)1

T)∇2F ∗(BTµB + p∗ 0 − 1T(µB))1

which on rearrangement gives ∇T(µB) = B ∇2F ∗(¯ p∗)1 1

T∇2F ∗(¯

p∗)1 = B q

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 31 / 36

SLIDE 77

Appendix The Consequences of Normalization

Escort Distribution

When F is additively separable q is indeed a probability distribution. (Can you see why?) q := ∇2F ∗(¯ p∗)1 1

T∇2F ∗(¯

p∗)1 So B q is an expectation. When does p = q?

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 32 / 36

SLIDE 78

Appendix Phi-Exponential Families

A Concrete Class of Entropies

based on φ-logarithms

log(p) = p

1

1 x dx Usual construction

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 33 / 36

SLIDE 79

Appendix Phi-Exponential Families

A Concrete Class of Entropies

based on φ-logarithms

logφ(p) = p

1

1 φ(x)dx Deformed Log

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 33 / 36

SLIDE 80

Appendix Phi-Exponential Families

A Concrete Class of Entropies

based on φ-logarithms

logφ(p) = p

1

1 φ(x)dx Deformed Log Any positive increasing φ will do.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 33 / 36

SLIDE 81

Appendix Phi-Exponential Families

A Concrete Class of Entropies

based on φ-logarithms

logφ(p) = p

1

1 φ(x)dx Deformed Log Any positive increasing φ will do. Apply a scaling/smoothing normalization operation to obtain another such function: ψ(p)

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 33 / 36

SLIDE 82

Appendix Phi-Exponential Families

A Concrete Class of Entropies

based on φ-logarithms

logφ(p) = p

1

1 φ(x)dx Deformed Log Any positive increasing φ will do. Apply a scaling/smoothing normalization operation to obtain another such function: ψ(p) Form negative entropy term: sφ(p) = −p logψ(1/p)

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 33 / 36

SLIDE 83

Appendix Phi-Exponential Families

A Concrete Class of Entropies

based on φ-logarithms

logφ(p) = p

1

1 φ(x)dx Deformed Log Any positive increasing φ will do. Apply a scaling/smoothing normalization operation to obtain another such function: ψ(p) Form negative entropy term: sφ(p) = −p logψ(1/p) Leads to

Convenient gradient: sφ′(p) = logφ(p) + kφ φ-exponential family: ¯ p = expφ[AT ¯ µ + p∗

0 − kφ]

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 33 / 36

SLIDE 84

Appendix Phi-Exponential Families

Example from the Physics Literature

φ(x) = xq

1 2 p 1 2 Φp

a: Φp pq

q 0.5 q 1.5 q 1

Try this: Pick q (between 0 and 2) Let φ(p) = pq

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 34 / 36

SLIDE 85

Appendix Phi-Exponential Families

Example from the Physics Literature

φ(x) = xq

Yields the q-logarithm from the non-extensive thermodynamics literature. logq(x) := x1−q − 1 1 − q

1 2 3 2 1 1 2

logΦ logq

q 1 q 1

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 34 / 36

SLIDE 86

Appendix Phi-Exponential Families

Example from the Physics Literature

φ(x) = xq

1 2 1 2

b: Ψp 2 q x2q

q 1 q 0.5 q 1.5

Scaling/Smoothing operation: ψ(x) = 1/u u φ(u)du −1 In this case the operation only scales and reparameterizes φ to yield ψ.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 34 / 36

SLIDE 87

Appendix Phi-Exponential Families

Example from the Physics Literature

φ(x) = xq

Use this log to form negative entropy: −p logψ(1/p)

1 2 2 1 1 d: logΨp log2qp 2 q q 1 q 0.5 q 1.5

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 34 / 36

SLIDE 88

Appendix Phi-Exponential Families

Example from the Physics Literature

φ(x) = xq

1 2 1 1

e: sΦp p logqp 2 q

q 1 q 0.5 q 1.5

Only Legendre for q > 1. Why?

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 34 / 36

SLIDE 89

Appendix ¯ p as a Projection

Looking At Projections

q Examples p0 A p b Orthogonal Projection p

q → 0 Same as orthogonal projection.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 35 / 36

SLIDE 90

Appendix ¯ p as a Projection

Looking At Projections

q Examples p0 A p b Curved Projection p

q → 0 Same as orthogonal projection. q = .6

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 35 / 36

SLIDE 91

Appendix ¯ p as a Projection

Looking At Projections

q Examples p0 A p b Oblique Projection p

q → 0 Same as orthogonal projection. q = .6 Usual normalization. Actually relates directly to projection under SBG entropy.

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 35 / 36

SLIDE 92

Appendix ¯ p as a Projection

Looking At Projections

q Examples p0 A p b Curved Again p

q → 0 Same as orthogonal projection. q = .6 Usual normalization. Actually relates directly to projection under SBG entropy. q = 1.6

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 35 / 36

SLIDE 93

Appendix ¯ p as a Projection

Four Views of Optimality

Solution to primal problem

p0 A p b Bregman Projection p

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 36 / 36

SLIDE 94

Appendix ¯ p as a Projection

Four Views of Optimality

Solution to primal problem Intersection of e-flat and m-flat manifolds.

Q T Manifold Intersection p

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 36 / 36

SLIDE 95

Appendix ¯ p as a Projection

Four Views of Optimality

Solution to primal problem Intersection of e-flat and m-flat manifolds. Reverse distance solution. Non-convex!

Q pb Smallest Reverse Distance p

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 36 / 36

SLIDE 96

Appendix ¯ p as a Projection

Four Views of Optimality

Solution to primal problem Intersection of e-flat and m-flat manifolds. Reverse distance solution. Non-convex! Orthogonality conditions. Sometimes used in algorithm design.

p0 pb p PseudoOrthogonality

T. Sears (ANU)

From Maxent to Machine Learning and Back Maxent 2007 36 / 36

From Maxent to Machine Learning and Back

ANU

March 2007

50 Years Ago . . .

“. . . a method of reasoning . . . ”

Jenkins, if I want another yes-man I’ll build one.

Outline

1

Generalizing Maxent

2

Two Examples

3

Broader Comparisons

4

Extensions/Conclusions

You are here

1

Generalizing Maxent

2

Two Examples

3

Broader Comparisons

4

Extensions/Conclusions

The Classic Maxent Problem

Minimize negative entropy subject to linear constraints: min

p S(p) := N

pi log(pi) subject to Ap = b pi ≥ 0 A is M × N. M < N, a ”wide” matrix. b is a data vector. A := B 1

T

constraint.

Extending the Classic Maxent Problem

min

p S(p)

subject to Ap = b Original problem.

Extending the Classic Maxent Problem

min

p S(p) + δ{0} (Ap − b)

Original problem. Convert constraints to a convex function.

Extending the Classic Maxent Problem

min

p S(p) + δ{0} (||Ap − b||P)

Original problem. Convert constraints to a convex function. Use any norm...

Extending the Classic Maxent Problem

min

p S(p) + δǫBP(Ap − b)

Original problem. Convert constraints to a convex function. Use any norm... and relax constraints.

Extending the Classic Maxent Problem

min

p ∆F(p, p0) + δǫBP(Ap − b)

Original problem. Convert constraints to a convex function. Use any norm... and relax constraints. Generalize SBG entropy to Bregman divergence.

Extending the Classic Maxent Problem

min

µ F ∗(ATµ + p∗ 0) + µ, b + ǫ ||µ||Q

Original problem. Convert constraints to a convex function. Use any norm... and relax constraints. Generalize SBG entropy to Bregman divergence. Find the Fenchel dual problem to solve.

Extending the Classic Maxent Problem

min

µ F ∗(ATµ + p∗ 0) + µ, b

+ ǫ ||µ||Q

”Prior”

Original problem. Convert constraints to a convex function. Use any norm... and relax constraints. Generalize SBG entropy to Bregman divergence. Find the Fenchel dual problem to solve. It’s a more general form of the MAP problem.

Characterizing the solution

Compare to statistical models

After solving for ¯ µ we can recover the optimal primal solution: ¯ p = ∇F ∗

(

”Score”

µ +p∗

0)

Characterizing the solution

Compare to statistical models

After solving for ¯ µ we can recover the optimal primal solution: ¯ p = ∇F ∗

(

”Score”

µ +p∗

0)

¯ p comes from a family of distributions.

Characterizing the solution

Compare to statistical models

After solving for ¯ µ we can recover the optimal primal solution: ¯ p = ∇F ∗

(

”Score”