Topic Modeling Lecture 9: October 9, 2013 CS886 2 Natural Language - - PDF document

topic modeling lecture 9 october 9 2013
SMART_READER_LITE
LIVE PREVIEW

Topic Modeling Lecture 9: October 9, 2013 CS886 2 Natural Language - - PDF document

2013 10 09 Topic Modeling Lecture 9: October 9, 2013 CS886 2 Natural Language Understanding University of Waterloo CS886 Lecture Slides (c) 2013 P. Poupart 1 Information Retrieval Example #1 CS886 Lecture Slides (c) 2013 P. Poupart 2


slide-1
SLIDE 1

2013‐10‐09 1

Topic Modeling Lecture 9: October 9, 2013

CS886‐2 Natural Language Understanding University of Waterloo

CS886 Lecture Slides (c) 2013 P. Poupart 1

Information Retrieval Example #1

CS886 Lecture Slides (c) 2013 P. Poupart 2

slide-2
SLIDE 2

2013‐10‐09 2

Information Retrieval Example #1

CS886 Lecture Slides (c) 2013 P. Poupart 3

Information Retrieval Example #2

CS886 Lecture Slides (c) 2013 P. Poupart 4

slide-3
SLIDE 3

2013‐10‐09 3

Information Retrieval Example #2

CS886 Lecture Slides (c) 2013 P. Poupart 5

Information Retrieval Example #3

CS886 Lecture Slides (c) 2013 P. Poupart 6

slide-4
SLIDE 4

2013‐10‐09 4

Information Retrieval Example #3

CS886 Lecture Slides (c) 2013 P. Poupart 7

Latent Semantic Analysis

  • Idea: singular value decomposition

– Infer latent space in which documents or words can be described more succinctly

  • Issues:

– How do we interpret this latent space? – How many dimensions should it have? – How can we represent uncertainty/ambiguities?

CS886 Lecture Slides (c) 2013 P. Poupart 8

slide-5
SLIDE 5

2013‐10‐09 5

Latent Dirichlet Allocation

  • Idea: probabilistic generative model for documents

– Latent variables often correspond to topics – Some machine learning techniques can automatically infer the # of topics – Probabilistic framework allows us to quantify uncertainty/ambiguities

CS886 Lecture Slides (c) 2013 P. Poupart 9

Graphical Model

  • Picture

CS886 Lecture Slides (c) 2013 P. Poupart 10

slide-6
SLIDE 6

2013‐10‐09 6

Plate Model

  • Picture

CS886 Lecture Slides (c) 2013 P. Poupart 11

Dirichlet

  • Definition

; , 1

  • : probability of head
  • 1: # of heads
  • 1: # of tails
  • Mean: /

0.2 1 p Pr(p) Dir(p; 1, 1) Dir(p; 2, 8) Dir(p; 20, 80) CS886 Lecture Slides (c) 2013 P. Poupart 12

slide-7
SLIDE 7

2013‐10‐09 7

Conjugate Prior

  • Bayesian learning

– Prior: Pr ; , – Posterior: Pr|

  • Bayes theorem:

Pr ∝ Pr Pr ; , 1 1 ; 3, 1

CS886 Lecture Slides (c) 2013 P. Poupart 13

Topic Modeling

  • Task

– Infer topics and parameters: Pr

  • :, , |

:

  • Two common approaches

– Gibbs sampling

  • Simple, but stochastic and slow

– Variational Bayes (variant of EM)

  • Complex, but deterministic and fast

CS886 Lecture Slides (c) 2013 P. Poupart 14

slide-8
SLIDE 8

2013‐10‐09 8

Sampling Techniques

  • Direct sampling
  • Rejection sampling
  • Likelihood weighting
  • Importance sampling
  • Markov chain Monte Carlo (MCMC)

– Gibbs Sampling – Metropolis‐Hastings

  • Sequential Monte Carlo sampling (a.k.a. particle

filtering)

CS886 Lecture Slides (c) 2013 P. Poupart 15

Approximate Inference by Sampling

  • Expectation:
  • – Approximate integral by sampling:
  • where ~
  • Inference query: Pr| ∑ Pr

, |

  • – Approximate exponentially large sum by sampling:

Pr

Pr |,

  • where ~|

CS886 Lecture Slides (c) 2013 P. Poupart 16

slide-9
SLIDE 9

2013‐10‐09 9

Direct Sampling (a.k.a. forward sampling)

  • Unconditional inference queries (i.e., Pr

)

  • Bayesian networks only

– Idea: sample each variable given the values of its parents according to the topological order of the graph.

CS886 Lecture Slides (c) 2013 P. Poupart 17

Direct Sampling Algorithm

Sort the variables by topological order For 1 to do (sample particles)

For each variable

do

Sample

~ Pr

  • Approximation: Pr
  • CS886 Lecture Slides (c) 2013 P. Poupart

18

slide-10
SLIDE 10

2013‐10‐09 10

Example

CS886 Lecture Slides (c) 2013 P. Poupart 19

Analysis

  • Complexity: || where #variables
  • Accuracy

– Absolute error : P P

V P V 2

  • Sample size
  • – Relative error :
  • ∉ 1 , 1 2
  • Sample size
  • CS886 Lecture Slides (c) 2013 P. Poupart

20

slide-11
SLIDE 11

2013‐10‐09 11

Markov Chain Monte Carlo

  • Iterative sampling technique that converges to

the desired distribution in the limit

  • Idea: set up a Markov chain such that its

stationary distribution is the desired distribution

CS886 Lecture Slides (c) 2013 P. Poupart 21

Markov Chain

  • Definition: A Markov chain is a linear chain Bayesian

network with a stationary conditional distribution known as the transition function

  • Initial distribution: Pr
  • Transition distribution: Pr

|

CS886 Lecture Slides (c) 2013 P. Poupart 22

slide-12
SLIDE 12

2013‐10‐09 12

Asymptotic Behaviour

  • Let Pr be the distribution at time step

Pr ∑ Pr ..

..

∑ Pr Pr|

  • In the limit (i.e., when → ∞), the Markov chain may

converge to stationary distribution Pr

Pr ∑ Pr ′

  • Pr

| ′ ∑ Pr |

  • CS886 Lecture Slides (c) 2013 P. Poupart

23

Stationary distribution

  • Let | Pr

| be a matrix that represents the transition function

  • If we think of as a column vector, then is an

eigenvector of with eigenvalue 1

CS886 Lecture Slides (c) 2013 P. Poupart 24

slide-13
SLIDE 13

2013‐10‐09 13

Ergodic Markov Chain

  • Definition: A Markov chain is ergodic when there is a

non‐zero probability of reaching any state from any state in a finite number of steps

  • When the Markov chain is ergodic, there is a unique

stationary distribution

  • Sufficient condition: detailed balance

Pr | Pr | Detailed balance  ergodicity  unique stationary dist.

CS886 Lecture Slides (c) 2013 P. Poupart 25

Markov Chain Monte Carlo

  • Idea: set up an ergodic Markov chain such that the

unique stationary distribution is the desired distribution

  • Since the Markov chain is a linear chain Bayes net,

we can use direct sampling (forward sampling) to

  • btain a sample of the stationary distribution

CS886 Lecture Slides (c) 2013 P. Poupart 26

slide-14
SLIDE 14

2013‐10‐09 14

Generic MCMC Algorithm

Sample ~ Pr

  • For 1 to do (sample particles)

Sample ~ Pr

  • Approximation:
  • In practice, ignore the first samples for a better

estimate (burn‐in period):

  • CS886 Lecture Slides (c) 2013 P. Poupart

27

Choosing a Markov Chain

  • Different Markov chains lead to different

algorithms

– Gibbs sampling – Metropolis Hastings

CS886 Lecture Slides (c) 2013 P. Poupart 28

slide-15
SLIDE 15

2013‐10‐09 15

Gibbs Sampling

  • Suppose Pr defined by a graphical model (Bayes

net or Markov net)

  • Inference query: Pr ? Where ⊆
  • Idea: randomly assign values to all non‐evidence

variables, then repeatedly sample each non‐evidence variable given the assigned values for all other variables

CS886 Lecture Slides (c) 2013 P. Poupart 29

Gibbs Sampling Algorithm

Randomly assign

to all non‐evidence variables

  • For 1 to do (sample particles)

For each non‐evidence variable

do

Sample

~ Pr ~ ,

  • Approximation: Pr

|

  • CS886 Lecture Slides (c) 2013 P. Poupart

30

slide-16
SLIDE 16

2013‐10‐09 16

Example

CS886 Lecture Slides (c) 2013 P. Poupart 31

Practical Consideration

  • Burn‐in period: ignore first samples:

Pr

|

1

  • Use most recent values to sample
  • ~ Pr
  • |…
  • , …
  • Use conditional independence to restrict parent

variables to the Markov blanket

  • ~ Pr
  • |∀,∈
  • , ∀,∈
  • CS886 Lecture Slides (c) 2013 P. Poupart

32

slide-17
SLIDE 17

2013‐10‐09 17

Convergence

  • Let Pr

| , be the transition function of the Markov chain associated with Gibbs sampling

  • Theorem: Gibbs sampling converges to Pr

when all potentials are strictly positive.

  • Proof: Pr

| , satisfies detailed balance

i.e. Pr Pr , Pr Pr |′,

CS886 Lecture Slides (c) 2013 P. Poupart 33