Another Walkthrough of Variational Bayes Bevan Jones ML for NLP - - PowerPoint PPT Presentation

another walkthrough of variational bayes
SMART_READER_LITE
LIVE PREVIEW

Another Walkthrough of Variational Bayes Bevan Jones ML for NLP - - PowerPoint PPT Presentation

Another Walkthrough of Variational Bayes Bevan Jones ML for NLP Reading Group The University of Edinburgh 18 th of October, 2012 2 Variational Bayes? Bayes Bayes Theorem But the integral is intractable! Sampling Gibbs,


slide-1
SLIDE 1

Another Walkthrough of Variational Bayes

Bevan Jones ML for NLP Reading Group The University of Edinburgh

18th of October, 2012

slide-2
SLIDE 2

Variational Bayes?

  • Bayes ← Bayes’ Theorem
  • But the integral is intractable!

– Sampling

  • Gibbs, Metropolis Hastings, Slice Sampling, Particle Filters…

– Variational Bayes

  • Change the equations, replacing intractable integrals
  • This involves searching for a good approximation
  • Variational ← Calculus of Variations

– A way of searching through a space of functions for the “best” one

2

slide-3
SLIDE 3

Useful Concepts

  • Probability/Information Theory

– Bayes’ Theorem – Expectations – Jensen’s Inequality – KL Divergence – Conjugacy?

  • Calculus

– Functionals & Functional Derivatives – Lagrange Multipliers

  • Logarithms

3

slide-4
SLIDE 4

Useful Concepts

  • Probability/Information Theory

– Bayes’ Theorem – Expectations – Jensen’s Inequality – KL Divergence – Conjugacy

  • Calculus

– Functionals & Functional Derivatives – Lagrange Multipliers

  • Logarithms

4

slide-5
SLIDE 5

Outline

  • Part I: Principles of Variational Bayes

– The posterior and it’s approximation – Finding the optimal approximation

  • Defining “optimal”
  • Doing the math

– The Mean Field Assumption – An inference procedure – The lower bound

  • Part II: Dirichlet-multinomial case study

– Intractability – The Mean Field Assumption – Dirichlet-multinomial math

5

slide-6
SLIDE 6
  • We have some observed data:
  • We have a model relating latent variable z to the

data:

  • To guess z
  • The problem is one of computing
  • Or just as good

The (Log) Likelihood

6

slide-7
SLIDE 7

Approximating p(z|x)

  • The integral in the expression for p(x) may not be

easily computed

  • But we might be able to get by with an

approximation for p(x, z)

  • We’ll focus on approximating only part of it

7

slide-8
SLIDE 8

Choosing q

  • How to choose q?
  • Ideally, we want the q that is closest to p
  • Define a lower bound on p

– Make this a “function” of q

  • Maximize the lower bound to make it as tight

as possible

– Choose q accordingly

8

slide-9
SLIDE 9

Choosing q

9

  • Define a lower bound F on p
slide-10
SLIDE 10

Choosing q

10

  • Define a lower bound F on p

– Make this a “function” of q

slide-11
SLIDE 11

Choosing q

11

slide-12
SLIDE 12

Choosing q

12

  • Choose q to maximize the lower bound
slide-13
SLIDE 13
  • Jensen’s Inequality

where f is concave

Bounding the Log Likelihood w/ Jensen’s Inequality

13

slide-14
SLIDE 14
  • Jensen’s Inequality

where f is concave

Bounding the Log Likelihood w/ Jensen’s Inequality

14

slide-15
SLIDE 15
  • Jensen’s Inequality

where f is concave

Bounding the Log Likelihood w/ Jensen’s Inequality

15

slide-16
SLIDE 16
  • Jensen’s Inequality

where f is concave

Bounding the Log Likelihood w/ Jensen’s Inequality

16

slide-17
SLIDE 17
  • Jensen’s Inequality

where f is concave

Bounding the Log Likelihood w/ Jensen’s Inequality

17

Apply Jensen’s

slide-18
SLIDE 18
  • Jensen’s Inequality

where f is concave

Bounding the Log Likelihood w/ Jensen’s Inequality

18

slide-19
SLIDE 19
  • We can’t calculate the log likelihood, but we can

compute the lower bound

  • Maximizing F tightens the lower bound on the

likelihood

  • What q maximizes F?
  • If q were a variable we could do this by taking

derivatives and solving for q

The Lower Bound

19

slide-20
SLIDE 20

Functionals: the “Variational” in VB

  • Functional: a kind of “meta-function” that

takes a function as input

  • F[q] is a functional of q
  • Functionals can be optimized like functions

– take the derivative of F[q] with respect to q, – set the derivative to 0, and – solve for q

20

slide-21
SLIDE 21

Derivatives

21

slide-22
SLIDE 22
  • The change in functional as we change its

function argument

Functional Derivatives

22

slide-23
SLIDE 23

Useful Derivatives

23

slide-24
SLIDE 24

Useful Derivatives

24

slide-25
SLIDE 25

Useful Derivatives

25

slide-26
SLIDE 26

Useful Derivatives

26

slide-27
SLIDE 27

Useful Derivatives

27

slide-28
SLIDE 28

Useful Derivatives

28

slide-29
SLIDE 29

Useful Derivatives

29

slide-30
SLIDE 30

Calculating q

  • Use Lagrange multipliers

constraint

30

  • 1. Find derivative wrt
  • 2. Find derivative wrt
  • 3. Solve for and
slide-31
SLIDE 31

Calculating q

  • Use Lagrange multipliers

constraint

31

slide-32
SLIDE 32

Calculating q

  • Use Lagrange multipliers

constraint

32

slide-33
SLIDE 33

Calculating q

  • Use Lagrange multipliers

constraint

33

slide-34
SLIDE 34

Calculating q

  • Use Lagrange multipliers

constraint

34

slide-35
SLIDE 35

Calculating q

  • Use Lagrange multipliers

constraint

35

slide-36
SLIDE 36

Calculating q

  • Use Lagrange multipliers

constraint

36

slide-37
SLIDE 37

Calculating q

  • Use Lagrange multipliers

constraint

37

slide-38
SLIDE 38

Calculating q

38

slide-39
SLIDE 39

Calculating q

39

slide-40
SLIDE 40

Calculating q

40

slide-41
SLIDE 41

Calculating q

41

slide-42
SLIDE 42

Calculating q

42

slide-43
SLIDE 43

Calculating q

43

slide-44
SLIDE 44
  • Maximizing F is minimizing the KL divergence
  • And

KL Divergence: An Alternative View

44

slide-45
SLIDE 45
  • Maximizing F is minimizing the KL divergence

KL Divergence: An Alternative View

45

slide-46
SLIDE 46
  • Maximizing F is minimizing the KL divergence

KL Divergence: An Alternative View

46

slide-47
SLIDE 47

Optimal q

48

slide-48
SLIDE 48

Where are we?

  • We’ve bounded the likelihood (Jensen’s Ineq.)
  • Made this bound tight (Lagrange Multipliers)
  • But the best approximation is no

approximation at all!

  • We need to constrain q so that it’s tractable

49

slide-49
SLIDE 49

Optimal q in an Imperfect World

  • We can’t compute q(z)=p(z|x) directly
  • Instead, constrain the domain of F[q] to some

set of more tractable functions

  • This is usually done by making independence

assumptions

– The mean field assumption: cut all dependencies

50

slide-50
SLIDE 50
  • We have some observed data:
  • We have a model relating latent variables z and θ to the

data:

  • To guess z and θ we need
  • But the integral is hard!
  • Why?

Mean Field Assumption

51

slide-51
SLIDE 51
  • z and θ not independent 
  • Often p(z) is defined in terms θ
  • Example:
  • z = draws from a multinomial (words, POS tags, CFG

rules, …)

  • θ = the weights of those words, POS tags, CFG rules
  • But perhaps things would be tractable if they were

independent

Mean Field Assumption

52

slide-52
SLIDE 52

Mean Field Assumption

53

slide-53
SLIDE 53

The Lower Bound (Again)

54

slide-54
SLIDE 54

The Lower Bound (Again)

55

slide-55
SLIDE 55

The Lower Bound (Again)

56

slide-56
SLIDE 56

The New Lower Bound

Apply independence assumption:

57

slide-57
SLIDE 57

The New Lower Bound

Apply independence assumption:

58

slide-58
SLIDE 58

The New Lower Bound

Apply independence assumption:

59

slide-59
SLIDE 59

The New Lower Bound

Apply independence assumption:

60

slide-60
SLIDE 60
  • The integrals get simpler
  • Even simpler after derivatives

The Benefit of Independence

61

slide-61
SLIDE 61

Optimizing the Lower Bound

62

slide-62
SLIDE 62

Optimal qθ(θ)

  • Use Lagrange multipliers

constraint

63

slide-63
SLIDE 63

Optimal qz(z)

  • Use Lagrange multipliers

constraint

64

slide-64
SLIDE 64

q ≠ p

65

slide-65
SLIDE 65

q ≠ p

66

slide-66
SLIDE 66

Estimating Parameters

  • Now we have our approximation q
  • We need to compute the expectations
  • Use EM-like procedure, alternating between the two

– It was hard to do this for p(z,θ|x) – It’s (hopefully) easy for q(z,θ)

  • if we’ve defined p to make use of conjugacy
  • and if we’ve chosen the right constraint for q

67

slide-67
SLIDE 67

Calculating F

68

slide-68
SLIDE 68
  • As a side effect of inference, we already have
  • It’s the log of the normalization constant for

q(z)

  • So, we really only need two more expectations

Calculating F

69

slide-69
SLIDE 69

Uses for F

  • We can often use F in cases where we would

normally use the log likelihood

– Measuring convergence

  • No guarantee to maximize likelihood, but we do have F
  • Others

– Model selection

  • Choose the model with the highest lower bound

– Selecting the number of clusters

  • Pick the number that gives us the highest lower bound

– Parameter optimization

  • Again, optimize the lower bound w.r.t. the parameters

70

slide-70
SLIDE 70

Conclusion to Part 1

  • VB is conceptually very simple

– Optimization (like 1st year calculus)

  • The challenge

– Choosing the space of approximations to

  • Avoid intractability
  • Closely fit the true distribution

71

slide-71
SLIDE 71

Part II

Case Study: Dirichlet-Multinomial Mixture Model

72

slide-72
SLIDE 72

Dirichlet-Multinomial Mixture Model

φ x z

N

π

K

α β

73

slide-73
SLIDE 73

Dirichlet-Multinomial Mixture Model

φ x z

N

π

K

α β

74

  • Quantity of interest
slide-74
SLIDE 74

Dirichlet-Multinomial Mixture Model

φ x z

N

π

K

α β

75

  • Quantity of interest
slide-75
SLIDE 75
  • Generative model
  • Observed data:
  • What we’re interested in (the posterior)

Model

76

slide-76
SLIDE 76
  • Generative model
  • Observed data:
  • What we’re interested in (the posterior)

Model

77

slide-77
SLIDE 77
  • The posterior probability
  • Intractable integral (marginal likelihood)

The Intractable Integral

78

slide-78
SLIDE 78
  • The marginal likelihood
  • We can compute

The Intractable Integral

79

slide-79
SLIDE 79
  • The marginal likelihood
  • We can compute
  • But what about

The Intractable Integral

80

slide-80
SLIDE 80
  • The marginal likelihood

The Intractable Integral

81

z1 z0 z2 φ z1 z0 z2

… …

slide-81
SLIDE 81
  • The marginal likelihood
  • Integrating out induces dependencies between the ’s
  • There is a similar relationship between and the ’s
  • Calculations become intractable
  • But if we cut the dependencies between , , and …

The Intractable Integral

82

slide-82
SLIDE 82
  • The approximation
  • Then our lower bound

becomes

The Mean Field Assumption

83

slide-83
SLIDE 83

Optimizing F

  • Apply Lagrange multipliers just like before
  • In this case, we have simply replaced z, x, and

θ with vectors

  • The math is exactly the same
  • But we need to find the expectations we

skipped before

– Plug in the Dirichlet and multinomial distributions

84

slide-84
SLIDE 84

Optimal q(z,θ)

85

  • Borrowed from the Mean Field Example
  • See slides 62-63
  • All we need to do is apply the particulars of the

mixture model

slide-85
SLIDE 85

Optimal qθ(θ)

86

  • Factorize qθ(θ)

To get

slide-86
SLIDE 86

Optimal qφ(φ): The Expectation

87

slide-87
SLIDE 87

Optimal qφ(φ): The Expectation

88

slide-88
SLIDE 88

Dirichlet Distribution

89

  • Notable for being the conjugate prior of the

multinomial

slide-89
SLIDE 89

Optimal qφ(φ): The Numerator

90

slide-90
SLIDE 90

Optimal qφ(φ): The Numerator

91

  • Def. of Dirichlet
slide-91
SLIDE 91

Optimal qφ(φ): The Numerator

92

Previous slide

slide-92
SLIDE 92

Optimal qφ(φ): The Numerator

93

slide-93
SLIDE 93

Optimal qφ(φ): The Numerator

94

slide-94
SLIDE 94

Optimal qφ(φ): The Numerator

95

slide-95
SLIDE 95

Optimal qφ(φ): The Normalization

96

slide-96
SLIDE 96

Optimal qφ(φ): The Normalization

97

from previous slide

slide-97
SLIDE 97

Optimal qφ(φ): The Normalization

98

slide-98
SLIDE 98

Optimal qφ(φ): The Normalization

99

  • Def. of Dirichlet
slide-99
SLIDE 99

Optimal qφ(φ): The Normalization

100

slide-100
SLIDE 100

Optimal qφ(φ): Conjugacy Helps

101

slide-101
SLIDE 101

Optimal qφ(φ): Conjugacy Helps

102

slide-102
SLIDE 102

Optimal qφ(φ): Conjugacy Helps

103

slide-103
SLIDE 103

Optimal qφ(φ): Conjugacy Helps

104

slide-104
SLIDE 104

Optimal qφ(φ): Conjugacy Helps

105

slide-105
SLIDE 105

Optimal qφ(φ): Conjugacy Helps

106

slide-106
SLIDE 106

Optimal qφ(φ): Conjugacy Helps

107

  • Def. of Dirichlet
slide-107
SLIDE 107

Optimal qπ(π)

  • q(π) is essentially the same as q(φ)
  • The only difference is that there are multiple

π’s

  • So, q(π) should be a product of Dirichlets

108

slide-108
SLIDE 108

Optimal qπ(π): The Expectation

109

slide-109
SLIDE 109

Optimal qπ(π): The Numerator

110

slide-110
SLIDE 110

Optimal qπ(π): The Numerator

111

Model Def. Previous work

slide-111
SLIDE 111

Optimal qπ(π): The Numerator

112

slide-112
SLIDE 112

Optimal qπ(π): The Denominator

113

slide-113
SLIDE 113

Optimal qπ(π): The Denominator

114

slide-114
SLIDE 114

Optimal qπ(π): The Denominator

115

slide-115
SLIDE 115

Optimal qπ(π): The Denominator

116

  • Def. of Dirichlet
slide-116
SLIDE 116

Optimal qπ(π): The Denominator

117

slide-117
SLIDE 117

Optimal qπ(π): Putting Them Together

118

slide-118
SLIDE 118

Optimal qπ(π): Putting Them Together

119

slide-119
SLIDE 119

Optimal qπ(π): Putting Them Together

120

slide-120
SLIDE 120

Optimal qπ(π): Putting Them Together

121

  • Def. of Dirichlet
slide-121
SLIDE 121

Optimal qπ(π): Putting Them Together

122

slide-122
SLIDE 122

Optimal qz(z)

123

  • Again, from our Mean Field Assumption

Example

  • See slide 63
  • Just apply the assumptions of our model
slide-123
SLIDE 123

Optimal qz(z)

124

  • First, let’s work with the simpler multinomial

distribution

  • Side effect: a kind of estimate for the

multinomial parameter vector

slide-124
SLIDE 124

A Useful Standard Result

  • The expectation under a Dirichlet of the log of

an individual scalar component of a Dirichlet random vector

  • The digamma function

125

slide-125
SLIDE 125

Optimal qz(z): The Expectations

126

slide-126
SLIDE 126

Optimal qz(z): The Expectations

127

Standard Result

slide-127
SLIDE 127

Optimal qz(z): The Expectations

128

slide-128
SLIDE 128

Optimal qz(z): The Expectations

129

slide-129
SLIDE 129

Optimal qz(z): The Expectations

130

slide-130
SLIDE 130
  • Now, let’s work with the product of multinomials
  • Side effect: a kind of set of multinomial

parameter vectors

  • This is essentially the same math required for

HMMs and PCFGs

Optimal qz(z): The Expectations

131

slide-131
SLIDE 131

Optimal qz(z): The Expectations

132

slide-132
SLIDE 132

Optimal qz(z): Putting It Together

133

slide-133
SLIDE 133

Optimal qz(z): Putting It Together

134

slide-134
SLIDE 134

Optimal qz(z): Putting It Together

135

slide-135
SLIDE 135

Implications of Assumption

  • We should get the same result with an even

weaker assumption

136

slide-136
SLIDE 136

Inference

  • “E-Step”: Compute Expected Counts

– Topic counts – Topic-word pair counts

  • “M-Step”: Compute Ratios

– Topic j – Topic-word pair j-k

137

slide-137
SLIDE 137

Calculating F

138

  • Also borrowed from the Mean Field Example
  • See slides 67-68
  • But we adapt it for the mixture model
slide-138
SLIDE 138

Calculating F

139

slide-139
SLIDE 139

Calculating F: The Normalization Constant

140

  • By product of computing
slide-140
SLIDE 140

Conclusion to Part 2

  • Many of the most popular models in NLP are

based on Multinomials

– ngrams, HMMs, PCFGs, …

  • The same math will work there
  • In fact, if you can implement EM, VB only requires

changing a few lines of code

  • Similar to other smoothing techniques, but has a

solid statistical grounding

141