[PPT] - Another Walkthrough of Variational Bayes Bevan Jones ML for NLP PowerPoint Presentation

SLIDE 1

Another Walkthrough of Variational Bayes

Bevan Jones ML for NLP Reading Group The University of Edinburgh

18th of October, 2012

SLIDE 2

Variational Bayes?

Bayes ← Bayes’ Theorem
But the integral is intractable!

– Sampling

Gibbs, Metropolis Hastings, Slice Sampling, Particle Filters…

– Variational Bayes

Change the equations, replacing intractable integrals
This involves searching for a good approximation
Variational ← Calculus of Variations

– A way of searching through a space of functions for the “best” one

2

SLIDE 3

Useful Concepts

Probability/Information Theory

– Bayes’ Theorem – Expectations – Jensen’s Inequality – KL Divergence – Conjugacy?

Calculus

– Functionals & Functional Derivatives – Lagrange Multipliers

Logarithms

3

SLIDE 4

Useful Concepts

Probability/Information Theory

– Bayes’ Theorem – Expectations – Jensen’s Inequality – KL Divergence – Conjugacy

Calculus

– Functionals & Functional Derivatives – Lagrange Multipliers

Logarithms

4

SLIDE 5

Outline

Part I: Principles of Variational Bayes

– The posterior and it’s approximation – Finding the optimal approximation

Defining “optimal”
Doing the math

– The Mean Field Assumption – An inference procedure – The lower bound

Part II: Dirichlet-multinomial case study

– Intractability – The Mean Field Assumption – Dirichlet-multinomial math

5

SLIDE 6

We have some observed data:
We have a model relating latent variable z to the

data:

To guess z
The problem is one of computing
Or just as good

The (Log) Likelihood

6

SLIDE 7

Approximating p(z|x)

The integral in the expression for p(x) may not be

easily computed

But we might be able to get by with an

approximation for p(x, z)

We’ll focus on approximating only part of it

7

SLIDE 8

Choosing q

How to choose q?
Ideally, we want the q that is closest to p
Define a lower bound on p

– Make this a “function” of q

Maximize the lower bound to make it as tight

as possible

– Choose q accordingly

8

SLIDE 9

Choosing q

9

Define a lower bound F on p

SLIDE 10

Choosing q

10

Define a lower bound F on p

– Make this a “function” of q

SLIDE 11

Choosing q

11

SLIDE 12

Choosing q

12

Choose q to maximize the lower bound

SLIDE 13

Jensen’s Inequality

where f is concave

Bounding the Log Likelihood w/ Jensen’s Inequality

13

SLIDE 14

Jensen’s Inequality

where f is concave

Bounding the Log Likelihood w/ Jensen’s Inequality

14

SLIDE 15

Jensen’s Inequality

where f is concave

Bounding the Log Likelihood w/ Jensen’s Inequality

15

SLIDE 16

Jensen’s Inequality

where f is concave

Bounding the Log Likelihood w/ Jensen’s Inequality

16

SLIDE 17

Jensen’s Inequality

where f is concave

Bounding the Log Likelihood w/ Jensen’s Inequality

17

Apply Jensen’s

SLIDE 18

Jensen’s Inequality

where f is concave

Bounding the Log Likelihood w/ Jensen’s Inequality

18

SLIDE 19

We can’t calculate the log likelihood, but we can

compute the lower bound

Maximizing F tightens the lower bound on the

likelihood

What q maximizes F?
If q were a variable we could do this by taking

derivatives and solving for q

The Lower Bound

19

SLIDE 20

Functionals: the “Variational” in VB

Functional: a kind of “meta-function” that

takes a function as input

F[q] is a functional of q
Functionals can be optimized like functions

– take the derivative of F[q] with respect to q, – set the derivative to 0, and – solve for q

20

SLIDE 21

Derivatives

21

SLIDE 22

The change in functional as we change its

function argument

Functional Derivatives

22

SLIDE 23

Useful Derivatives

23

SLIDE 24

Useful Derivatives

24

SLIDE 25

Useful Derivatives

25

SLIDE 26

Useful Derivatives

26

SLIDE 27

Useful Derivatives

27

SLIDE 28

Useful Derivatives

28

SLIDE 29

Useful Derivatives

29

SLIDE 30

Calculating q

Use Lagrange multipliers

constraint

30

1. Find derivative wrt
2. Find derivative wrt
3. Solve for and

SLIDE 31

Calculating q

Use Lagrange multipliers

constraint

31

SLIDE 32

Calculating q

Use Lagrange multipliers

constraint

32

SLIDE 33

Calculating q

Use Lagrange multipliers

constraint

33

SLIDE 34

Calculating q

Use Lagrange multipliers

constraint

34

SLIDE 35

Calculating q

Use Lagrange multipliers

constraint

35

SLIDE 36

Calculating q

Use Lagrange multipliers

constraint

36

SLIDE 37

Calculating q

Use Lagrange multipliers

constraint

37

SLIDE 38

Calculating q

38

SLIDE 39

Calculating q

39

SLIDE 40

Calculating q

40

SLIDE 41

Calculating q

41

SLIDE 42

Calculating q

42

SLIDE 43

Calculating q

43

SLIDE 44

Maximizing F is minimizing the KL divergence
And

KL Divergence: An Alternative View

44

SLIDE 45

Maximizing F is minimizing the KL divergence

KL Divergence: An Alternative View

45

SLIDE 46

Maximizing F is minimizing the KL divergence

KL Divergence: An Alternative View

46

SLIDE 47

Optimal q

48

SLIDE 48

Where are we?

We’ve bounded the likelihood (Jensen’s Ineq.)
Made this bound tight (Lagrange Multipliers)
But the best approximation is no

approximation at all!

We need to constrain q so that it’s tractable

49

SLIDE 49

Optimal q in an Imperfect World

We can’t compute q(z)=p(z|x) directly
Instead, constrain the domain of F[q] to some

set of more tractable functions

This is usually done by making independence

assumptions

– The mean field assumption: cut all dependencies

50

SLIDE 50

We have some observed data:
We have a model relating latent variables z and θ to the

data:

To guess z and θ we need
But the integral is hard!
Why?

Mean Field Assumption

51

SLIDE 51

z and θ not independent 
Often p(z) is defined in terms θ
Example:
z = draws from a multinomial (words, POS tags, CFG

rules, …)

θ = the weights of those words, POS tags, CFG rules
But perhaps things would be tractable if they were

independent

Mean Field Assumption

52

SLIDE 52

Mean Field Assumption

53

SLIDE 53

The Lower Bound (Again)

54

SLIDE 54

The Lower Bound (Again)

55

SLIDE 55

The Lower Bound (Again)

56

SLIDE 56

The New Lower Bound

Apply independence assumption:

57

SLIDE 57

The New Lower Bound

Apply independence assumption:

58

SLIDE 58

The New Lower Bound

Apply independence assumption:

59

SLIDE 59

The New Lower Bound

Apply independence assumption:

60

SLIDE 60

The integrals get simpler
Even simpler after derivatives

The Benefit of Independence

61

SLIDE 61

Optimizing the Lower Bound

62

SLIDE 62

Optimal qθ(θ)

Use Lagrange multipliers

constraint

63

SLIDE 63

Optimal qz(z)

Use Lagrange multipliers

constraint

64

SLIDE 64

q ≠ p

65

SLIDE 65

q ≠ p

66

SLIDE 66

Estimating Parameters

Now we have our approximation q
We need to compute the expectations
Use EM-like procedure, alternating between the two

– It was hard to do this for p(z,θ|x) – It’s (hopefully) easy for q(z,θ)

if we’ve defined p to make use of conjugacy
and if we’ve chosen the right constraint for q

67

SLIDE 67

Calculating F

68

SLIDE 68

As a side effect of inference, we already have
It’s the log of the normalization constant for

q(z)

So, we really only need two more expectations

Calculating F

69

SLIDE 69

Uses for F

We can often use F in cases where we would

normally use the log likelihood

– Measuring convergence

No guarantee to maximize likelihood, but we do have F
Others

– Model selection

Choose the model with the highest lower bound

– Selecting the number of clusters

Pick the number that gives us the highest lower bound

– Parameter optimization

Again, optimize the lower bound w.r.t. the parameters

70

SLIDE 70

Conclusion to Part 1

VB is conceptually very simple

– Optimization (like 1st year calculus)

The challenge

– Choosing the space of approximations to

Avoid intractability
Closely fit the true distribution

71

SLIDE 71

Part II

Case Study: Dirichlet-Multinomial Mixture Model

72

SLIDE 72

Dirichlet-Multinomial Mixture Model

φ x z

N

π

K

α β

73

SLIDE 73

Dirichlet-Multinomial Mixture Model

φ x z

N

π

K

α β

74

Quantity of interest

SLIDE 74

Dirichlet-Multinomial Mixture Model

φ x z

N

π

K

α β

75

Quantity of interest

SLIDE 75

Generative model
Observed data:
What we’re interested in (the posterior)

Model

76

SLIDE 76

Generative model
Observed data:
What we’re interested in (the posterior)

Model

77

SLIDE 77

The posterior probability
Intractable integral (marginal likelihood)

The Intractable Integral

78

SLIDE 78

The marginal likelihood
We can compute

The Intractable Integral

79

SLIDE 79

The marginal likelihood
We can compute
But what about

The Intractable Integral

80

SLIDE 80

The marginal likelihood

The Intractable Integral

81

z1 z0 z2 φ z1 z0 z2

… …

SLIDE 81

The marginal likelihood
Integrating out induces dependencies between the ’s
There is a similar relationship between and the ’s
Calculations become intractable
But if we cut the dependencies between , , and …

The Intractable Integral

82

SLIDE 82

The approximation
Then our lower bound

becomes

The Mean Field Assumption

83

SLIDE 83

Optimizing F

Apply Lagrange multipliers just like before
In this case, we have simply replaced z, x, and

θ with vectors

The math is exactly the same
But we need to find the expectations we

skipped before

– Plug in the Dirichlet and multinomial distributions

84

SLIDE 84

Optimal q(z,θ)

85

Borrowed from the Mean Field Example
See slides 62-63
All we need to do is apply the particulars of the

mixture model

SLIDE 85

Optimal qθ(θ)

86

Factorize qθ(θ)

To get

SLIDE 86

Optimal qφ(φ): The Expectation

87

SLIDE 87

Optimal qφ(φ): The Expectation

88

SLIDE 88

Dirichlet Distribution

89

Notable for being the conjugate prior of the

multinomial

SLIDE 89

Optimal qφ(φ): The Numerator

90

SLIDE 90

Optimal qφ(φ): The Numerator

91

Def. of Dirichlet

SLIDE 91

Optimal qφ(φ): The Numerator

92

Previous slide

SLIDE 92

Optimal qφ(φ): The Numerator

93

SLIDE 93

Optimal qφ(φ): The Numerator

94

SLIDE 94

Optimal qφ(φ): The Numerator

95

SLIDE 95

Optimal qφ(φ): The Normalization

96

SLIDE 96

Optimal qφ(φ): The Normalization

97

from previous slide

SLIDE 97

Optimal qφ(φ): The Normalization

98

SLIDE 98

Optimal qφ(φ): The Normalization

99

Def. of Dirichlet

SLIDE 99

Optimal qφ(φ): The Normalization

100

SLIDE 100

Optimal qφ(φ): Conjugacy Helps

101

SLIDE 101

Optimal qφ(φ): Conjugacy Helps

102

SLIDE 102

Optimal qφ(φ): Conjugacy Helps

103

SLIDE 103

Optimal qφ(φ): Conjugacy Helps

104

SLIDE 104

Optimal qφ(φ): Conjugacy Helps

105

SLIDE 105

Optimal qφ(φ): Conjugacy Helps

106

SLIDE 106

Optimal qφ(φ): Conjugacy Helps

107

Def. of Dirichlet

SLIDE 107

Optimal qπ(π)

q(π) is essentially the same as q(φ)
The only difference is that there are multiple

π’s

So, q(π) should be a product of Dirichlets

108

SLIDE 108

Optimal qπ(π): The Expectation

109

SLIDE 109

Optimal qπ(π): The Numerator

110

SLIDE 110

Optimal qπ(π): The Numerator

111

Model Def. Previous work

SLIDE 111

Optimal qπ(π): The Numerator

112

SLIDE 112

Optimal qπ(π): The Denominator

113

SLIDE 113

Optimal qπ(π): The Denominator

114

SLIDE 114

Optimal qπ(π): The Denominator

115

SLIDE 115

Optimal qπ(π): The Denominator

116

Def. of Dirichlet

SLIDE 116

Optimal qπ(π): The Denominator

117

SLIDE 117

Optimal qπ(π): Putting Them Together

118

SLIDE 118

Optimal qπ(π): Putting Them Together

119

SLIDE 119

Optimal qπ(π): Putting Them Together

120

SLIDE 120

Optimal qπ(π): Putting Them Together

121

Def. of Dirichlet

SLIDE 121

Optimal qπ(π): Putting Them Together

122

SLIDE 122

Optimal qz(z)

123

Again, from our Mean Field Assumption

Example

See slide 63
Just apply the assumptions of our model

SLIDE 123

Optimal qz(z)

124

First, let’s work with the simpler multinomial

distribution

Side effect: a kind of estimate for the

multinomial parameter vector

SLIDE 124

A Useful Standard Result

The expectation under a Dirichlet of the log of

an individual scalar component of a Dirichlet random vector

The digamma function

125

SLIDE 125

Optimal qz(z): The Expectations

126

SLIDE 126

Optimal qz(z): The Expectations

127

Standard Result

SLIDE 127

Optimal qz(z): The Expectations

128

SLIDE 128

Optimal qz(z): The Expectations

129

SLIDE 129

Optimal qz(z): The Expectations

130

SLIDE 130

Now, let’s work with the product of multinomials
Side effect: a kind of set of multinomial

parameter vectors

This is essentially the same math required for

HMMs and PCFGs

Optimal qz(z): The Expectations

131

SLIDE 131

Optimal qz(z): The Expectations

132

SLIDE 132

Optimal qz(z): Putting It Together

133

SLIDE 133

Optimal qz(z): Putting It Together

134

SLIDE 134

Optimal qz(z): Putting It Together

135

SLIDE 135

Implications of Assumption

We should get the same result with an even

weaker assumption

136

SLIDE 136

Inference

“E-Step”: Compute Expected Counts

– Topic counts – Topic-word pair counts

“M-Step”: Compute Ratios

– Topic j – Topic-word pair j-k

137

SLIDE 137

Calculating F

138

Also borrowed from the Mean Field Example
See slides 67-68
But we adapt it for the mixture model

SLIDE 138

Calculating F

139

SLIDE 139

Calculating F: The Normalization Constant

140

By product of computing

SLIDE 140

Conclusion to Part 2

Many of the most popular models in NLP are

based on Multinomials

– ngrams, HMMs, PCFGs, …

The same math will work there
In fact, if you can implement EM, VB only requires

changing a few lines of code

Similar to other smoothing techniques, but has a

solid statistical grounding

141