Non-parametric Bayesian Statistics Graham Neubig 2011-12-22 1 - - PowerPoint PPT Presentation

non parametric bayesian statistics
SMART_READER_LITE
LIVE PREVIEW

Non-parametric Bayesian Statistics Graham Neubig 2011-12-22 1 - - PowerPoint PPT Presentation

Graham Neubig Non-parametric Bayesian Statistics Non-parametric Bayesian Statistics Graham Neubig 2011-12-22 1 Graham Neubig Non-parametric Bayesian Statistics Overview About Bayesian Non-parametrics Basic theory Inference


slide-1
SLIDE 1

1

Graham Neubig – Non-parametric Bayesian Statistics

Non-parametric Bayesian Statistics

Graham Neubig

2011-12-22

slide-2
SLIDE 2

2

Graham Neubig – Non-parametric Bayesian Statistics

Overview

  • About Bayesian Non-parametrics
  • Basic theory
  • Inference using sampling
  • Learning an HMM with sampling
  • From the finite HMM to the infinite HMM
  • Recent developments (in sampling and modeling)
  • Applications to speech and language processing
  • Focus on unsupervised learning for discrete

distributions

slide-3
SLIDE 3

3

Graham Neubig – Non-parametric Bayesian Statistics

Non-parametric Bayes

The number of parameters is not decided in advance (i.e. infinite) Put a prior on the parameters and consider their distribution

slide-4
SLIDE 4

4

Graham Neubig – Non-parametric Bayesian Statistics

Types of Statistical Models

Prior on Parameters # of Parameters (Classes) Discrete Distribution Continuous Distribution Maximum Likelihood No Finite Multinomial Gaussian Bayesian Parametric Yes Finite Multinomial+ Dirichlet Prior Gaussian+ Gaussian Prior Bayesian Non- parametric Yes Infinite Multinomial+ Dirichlet Process Gaussian Process Covered Here

slide-5
SLIDE 5

5

Graham Neubig – Non-parametric Bayesian Statistics

Bayesian Basics

slide-6
SLIDE 6

6

Graham Neubig – Non-parametric Bayesian Statistics

Maximum Likelihood (ML)

  • We have an observed sample

X = 1 2 4 5 2 1 4 4 1 4

  • Gather counts
  • Divide counts to get probabilities

C={c1,c2,c3,c4,c5}={3,2,0,4,1} Px = ={0.3,0.2,0,0.4,0.1}

multinomial

Px=i = ci

∑

i c  i

slide-7
SLIDE 7

7

Graham Neubig – Non-parametric Bayesian Statistics

Bayesian Inference

  • ML is weak against sparse data
  • Don't actually know parameters
  • Bayesian statistics don't pick one probability
  • Use the expectation instead

we could have

Px=i =∫i P  ∣X d    ={0.3,0.2,0,0.4,0.1}  ={0.35,0.05,0.05,0.35,0.2} cx ={3,2,0,4,1}

if

  • r we could have
slide-8
SLIDE 8

8

Graham Neubig – Non-parametric Bayesian Statistics

Calculating Parameter Distributions

  • Decompose with Bayes' law
  • likelihood easily calculated according to the model
  • prior chosen according belief about probable values
  • regularization requires difficult integration...
  • … but conjugate priors make things easier

P∣X = PX∣P 

∫ PX∣P d 

likelihood prior regularization coefficient

slide-9
SLIDE 9

9

Graham Neubig – Non-parametric Bayesian Statistics

Conjugate Priors

  • Definition: Product of likelihood and prior takes the

same form as the prior

  • Because the form is known, no need to take the

integral to regularize

Multinomial Likelihood * Dirichlet Prior = Dirichlet Posterior Gaussian Likelihood * Gaussian Prior = Gaussian Posterior

Same

slide-10
SLIDE 10

10

Graham Neubig – Non-parametric Bayesian Statistics

Dirichlet Distribution/Process

  • Assigns probabilities to multinomial distributions

e.g.

  • Defined over the space of proper probability

distributions

  • Dirichlet process is a generalization of distribution
  • Can assign probabilities to infinite spaces

P{0.3,0.2,0.01,0.4,0.09}=0.000512 P{0.35,0.05,0.05,0.35,0.2}=0.0000963

{1 ,, n} ∀ i 0i1

∑i =1

n

i=1

slide-11
SLIDE 11

11

Graham Neubig – Non-parametric Bayesian Statistics

Dirichlet Process (DP)

  • Eq.
  • α is the “concentration parameter,” larger value means

more data needed to diverge from prior

  • Pbase is the “base measure,” expectation of θ
  • Regularization

coefficient: (Γ=gamma function) P ;  ,Pbase= 1 Z ∏i =1

n

i

P base x=i −1

Z=∏i =1

n

Pbasex=i  ∑i=1

n

Pbasex=i  i=Pbase x=i 

Way of writing in Dirichlet distribution Way of writing in Dirichlet process

slide-12
SLIDE 12

12

Graham Neubig – Non-parametric Bayesian Statistics

Examples of Probability Densities

From Wikipedia

α = 15 Pbase = {0.2,0.47,0.33} α = 9 Pbase = {0.22,0.33,0.44} α = 10 Pbase = {0.6,0.2,0.2} α = 14 Pbase = {0.43,0.14,0.43}

slide-13
SLIDE 13

13

Graham Neubig – Non-parametric Bayesian Statistics

Why is the Dirichlet Conjugate?

  • Likelihood is product of multinomial probabilities
  • Combine multiple instances into a single count
  • Take product of likelihood and prior

x1=1, x 2=5, x 3=2, x 4=5 cx=i ={1,1,0,0, 2} P X∣=px=1∣px=5∣px=2∣px=5∣=1525

Data:

P X∣=12 5

2=∏i =1 n

i

c x=i 

∏i=1

n

i

cx=i ∗

1 Zprior ∏i=1

n

i

i−1→

1 Z post ∏i =1

n

i

cx= ii−1

slide-14
SLIDE 14

14

Graham Neubig – Non-parametric Bayesian Statistics

Expectation of θ in the DP

  • When N=2

E [1]=∫

1

1 1 Z 1

1−12  2−1d 1

= 1 Z∫

1

1

11−1 2−1d 1

Integration by Parts

u=1

1

du=11

1−1d 1

dv=1−1

2−1d 1

v=−1−1

2/2

∫u dv=uv−∫ v du

= 1 Z [−1

11−1 2/2]0 1−

1 Z∫

1

−1−1

2/2∗11 1−1d 1

=01 2 1 Z∫0

1

1

1−11−1 2d 1

=1 2 E[2]=1 2 1−E[1] E [1]= 1 12

slide-15
SLIDE 15

15

Graham Neubig – Non-parametric Bayesian Statistics

Multi-Dimensional Expectation

  • Posterior distribution for multinomial with DP prior:

Px=i =∫

1

i 1 Z post ∏i=1

n

i

c x=i i −1

=cx=i ∗P basex=i  c ・ E [i ] = i

∑i =1

n

i = Pbasex=i   = Pbasex=i 

Observed Counts Base Measure Concentration Parameter

  • Same as additive smoothing
slide-16
SLIDE 16

16

Graham Neubig – Non-parametric Bayesian Statistics

Marginal Probability

  • Calculate prob. of observed data using the chain rule

X = 1 2 1 3 1 α=1 Pbase(x=1,2,3,4) = .25 Px i=cx i∗Pbasex i c・ Px 1=1=01∗.25 01 =.25 Px 2=2∣x 1=01∗.25 11 =.125 Px 3=1∣x1,2=11∗.25 21 =.417

c = { 0, 0, 0, 0 } c = { 1, 0, 0, 0 } c = { 1, 1, 0, 0 }

Px 4=3∣x 1,2,3=01∗.25 31 =.063

c = { 2, 1, 0, 0 }

Px 5=1∣x1,2,3,4=21∗.25 41 =.45

c = { 2, 1, 1, 0 }

Marginal Probability P(X) = .25*.125*.417*.063*.45

slide-17
SLIDE 17

17

Graham Neubig – Non-parametric Bayesian Statistics

Chinese Restaurant Process

  • Way of expressing DP and other stochastic processes
  • Chinese restaurant with infinite number of tables
  • Each customer enters restaurant and takes action:
  • When the first customer sits at a table, choose the

food served there according to Pbase

Psits at table i∝ci  Psits at a new table∝

1 2 1 3

X = 1 2 1 3 1 α=1 N=4

slide-18
SLIDE 18

18

Graham Neubig – Non-parametric Bayesian Statistics

Sampling Basics

slide-19
SLIDE 19

19

Graham Neubig – Non-parametric Bayesian Statistics

Sampling Basics

  • Generate a sample from probability distribution:
  • Count the samples and calculate probabilities
  • More samples = better approximation

Distribution: P(Noun)=0.5 P(Verb)=0.3 P(Preposition)=0.2

P(Noun)= 4/10 = 0.4, P(Verb)= 4/10 = 0.4, P(Preposition) = 2/10 = 0.2 Sample: Verb Verb Prep. Noun Noun Prep. Noun Verb Verb Noun …

1E+00 1E+01 1E+02 1E+03 1E+04 1E+05 1E+06 0.2 0.4 0.6 0.8 1 Noun Verb Prep. Samples Probability

slide-20
SLIDE 20

20

Graham Neubig – Non-parametric Bayesian Statistics

Actual Algorithm

SampleOne(probs[]) z = sum(probs) remaining = rand(z) for each i in 1:probs.size remaining -= probs[i] if remaining <= 0 return i

Generate number from uniform distribution over [0,z) Iterate over all probabilities Subtract current prob. value If smaller than zero, return current index as answer Calculate sum of probs

Bug check, beware of overflow!

slide-21
SLIDE 21

21

Graham Neubig – Non-parametric Bayesian Statistics

Gibbs Sampling

  • Want to sample a 2-variable distribution P(A,B)
  • … but cannot sample directly from P(A,B)
  • … but can sample from P(A|B) and P(B|A)
  • Gibbs sampling samples variables one-by-one to

recover true distribution

  • Each iteration:

Leave A fixed, sample B from P(B|A) Leave B fixed, sample A from P(A|B)

slide-22
SLIDE 22

22

Graham Neubig – Non-parametric Bayesian Statistics

Example of Gibbs Sampling

  • Parent A and child B are shopping, what sex?

P(Mother|Daughter) = 5/6 = 0.833 P(Mother|Son) = 5/8 = 0.625 P(Daughter|Mother) = 2/3 = 0.667 P(Daughter|Father) = 2/5 = 0.4

  • Original state: Mother/Daughter

Sample P(Mother|Daughter)=0.833, chose Mother Sample P(Daughter|Mother)=0.667, chose Son c(Mother, Son)++ Sample P(Mother|Son)=0.625, chose Mother

Sample P(Daughter|Mother)=0.667, chose Daughter

c(Mother, Daughter)++ …

slide-23
SLIDE 23

23

Graham Neubig – Non-parametric Bayesian Statistics

Try it Out:

  • In this case, we can confirm this result by hand

1E+00 1E+02 1E+04 1E+06 0.2 0.4 0.6 0.8 1

Moth/Daugh Moth/Son Fath/Daugh Fath/Son

Number of Samples P r

  • b

a b i l i t y

slide-24
SLIDE 24

24

Graham Neubig – Non-parametric Bayesian Statistics

Learning a Hidden Markov Model Part-of-Speech Tagger with Sampling

slide-25
SLIDE 25

25

Graham Neubig – Non-parametric Bayesian Statistics

Unsupervised Learning

  • Observed Training Data X
  • e.g.: A corpus of natural language text
  • Hidden Variables Y
  • e.g.: States of the HMM = Parts of Speech of words
  • Unobserved Parameters θ
  • Generally probabilities
slide-26
SLIDE 26

26

Graham Neubig – Non-parametric Bayesian Statistics

Task: Unsupervised POS Induction

  • Input: Collection of word strings X

the boats row in a row

  • Output: Collection of clusters Y

1 2 3 4 1 2 1→Determiner 2→Noun 3→Verb 4→Preposition

the boats row in a row Det N V P Det N

slide-27
SLIDE 27

27

Graham Neubig – Non-parametric Bayesian Statistics

Model: HMM

  • Variables Y correspond to hidden states
  • State transition probability:
  • Generate each word from a hidden state
  • Word emission probability:

the boats row in a row 1 2 3 4 1 2

PT(1|0) PT(2|1) PT(3|2) … PE(the|1) PE(boats|2) PE(row|3) …

PT y i∣y i−1=T ,y i , y i−1 PE x i∣y i=E ,y i ,x i

slide-28
SLIDE 28

28

Graham Neubig – Non-parametric Bayesian Statistics

Sampling the HMM

  • Initialize Y randomly
  • Sample each element of Y using Gibbs sampling

the boats row in a row 1 2 3 4 1 2

sample this only

slide-29
SLIDE 29

29

Graham Neubig – Non-parametric Bayesian Statistics

Sampling the HMM

  • Probabilities affected by a single tag
  • Transition from previous tag: PT(yi|yi-1)
  • Transition to next tag: PT(yi+1|yi)
  • Emission probability: PE(xi|yi)
  • Sample the tag value according to these probabilities
  • All variables that have effect are “Markov blanket”

the boats row in a row 1 2 3 4 1 2 Markov blanket

slide-30
SLIDE 30

30

Graham Neubig – Non-parametric Bayesian Statistics

Calculating HMM Probabilities with DP Priors

  • Transition probability:
  • Emission probability:

PT y i∣y i−1=c y i−1 y iT∗PbaseT y i  c y i−1T PE xi∣y i=cy i , x iE∗PbaseE xi cy iE

slide-31
SLIDE 31

31

Graham Neubig – Non-parametric Bayesian Statistics

Sampling Algorithm for One Tag

SampleTag(yi) c(yi-1 yi)--; c(yi yi+1)--; c(yi→xi)-- for each tag in S (all POS tags) p[tag]=PE(tag|yi-1)*PE(yi+1|tag)*PT(xi|tag) yi = SampleOne(p) c(yi-1 yi)++; c(yi yi+1)++; c(yi→xi)++

Subtract current tag counts Calculate all possible tag probabilities Choose a new tag Add the new tag counts

slide-32
SLIDE 32

32

Graham Neubig – Non-parametric Bayesian Statistics

Sampling Algorithm for All Tags

SampleCorpus() initialize Y randomly for N iterations for each yi in the corpus SampleTag(yi) save parameters average parameters

For N iterations Sample all the tags Save sample of θ Average parameters θ Randomly initialize tags

slide-33
SLIDE 33

33

Graham Neubig – Non-parametric Bayesian Statistics

Choosing Hyperparameters

  • Must choose α properly to get desired effect
  • Small α(<0.1) creates sparse distributions

– If we want each word to have one POS tag, we can set

αE of the emission distribution Pe to be small

  • Most distributions are sparse, so often α is set small
  • Best to confirm through experiments
  • Can also give hyperparameters a prior and sample

them as well

slide-34
SLIDE 34

34

Graham Neubig – Non-parametric Bayesian Statistics

From the Finite HMM to the Infinite HMM

slide-35
SLIDE 35

35

Graham Neubig – Non-parametric Bayesian Statistics

Base Measure and Dimensionality

  • Using a uniform distribution as the base measure

1 2 3 4 5 6 0.1 0.2

6 Parts of Speech

POS Number Probability

1 2 3 4 5 6 7 8 9 1011121314151617181920 0.02 0.04 0.06

20 Parts of Speech

POS Number Probability

slide-36
SLIDE 36

36

Graham Neubig – Non-parametric Bayesian Statistics

In the Limit...

  • As the number of POSs goes to infinity
  • Probabilities of each POS Pbase goes to zero
  • But total probability of Pbase is the same

1 13 5 9 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 0.01 0.01 0.02

100 Parts of Speech

POS Number Probability

10000 130000 250000 370000 490000 610000 730000 850000 970000 0.00E+00 5.00E-07 1.00E-06 1.50E-06

1 Million Parts of Speech

POS Number Probability

Py i∣y i −1=cy i −1 y i∗Pbasey i c y i−1 lim

N ∞ ∑i =1 N

1 N =1

N= number

  • f POSs
slide-37
SLIDE 37

37

Graham Neubig – Non-parametric Bayesian Statistics

Finite HMM and Infinite HMM

  • Finite HMM

Probability of emitting POS yi (after yi-1)

  • Infinite HMM

Probability of omitting existing POS yi (after yi-1) Probability of omitting new POS (after yi-1)

Py i∣y i −1=cy i −1 y i∗Pbasey i c y i−1 Py i∣y i −1= cy i −1 y i cy i −1 Py i=new∣y i −1=  cy i −1

slide-38
SLIDE 38

38

Graham Neubig – Non-parametric Bayesian Statistics

Example

  • Assume c(yi-1=1 yi=1)=1 c(yi-1=1 yi=2)=1

When there are 2 possible POSs

Py i=1∣y i−1=1=1∗1/2 2 Py i=2∣y i −1=1=1∗1/2 2 Py i≠1,2∣y i −1=1= ∗0 2 Py i=1∣y i−1=1=1∗1/20 2 Py i=2∣y i −1=1=1∗1/20 2 Py i≠1,2∣y i −1=1=∗18/20 2 Py i=1∣y i−1=1=1∗1/∞ 2 Py i=2∣y i −1=1=1∗1/∞ 2 Py i≠1,2∣y i −1=1= ∗1 2

When there are 20 possible POSs When there are infinite possible POSs

slide-39
SLIDE 39

39

Graham Neubig – Non-parametric Bayesian Statistics

Sampling Algorithm

SampleTag(yi) c(yi-1 yi)--; c(yi yi+1)--; c(yi→xi)-- for each tag in S (possible POSs) p[tag]=PE(tag|yi-1)*PE(yi+1|tag)*PT(xi|tag) p[|S|+1]=PE(new|yi-1)*PE(yi+1|new)*PT(xi|new) yi = SampleOne(p) c(yi-1 yi)++; c(yi yi+1)++; c(yi→xi)++

Remove counts for current tag Calculate existing POS probabilities Pick a single value Add the new counts Calculate new POS probability

slide-40
SLIDE 40

40

Graham Neubig – Non-parametric Bayesian Statistics

Non-Uniform Base Measures

  • Previous slides assumed uniform base measures, but

this is not required

  • Example: Language model unknown word model
  • Split each word into characters, give some probability

to all words:

  • Probability is not equal, but gives some probability to

each member of an infinite collection

Pword=cword∗Pbaseword cword Pbaseword=P len4Pchar wPchar oPchar rPchar d

slide-41
SLIDE 41

41

Graham Neubig – Non-parametric Bayesian Statistics

Implementation Tips

  • Zero count classes remain → wasted memory
  • When new classes are made, re-use class numbers
  • When c(y)=0, probability of revival becomes 0
  • This model doesn't do well with new POSs
  • New POSs can only appear after 1 type of POS
  • Can fix this with hierarchical model

PT y i∣y i−1=DP ,PT y i PT y i=DP , Pbasey i Transition Prob. POS Prob.

c(y1)=5 c(y2)=0 c(y3)=1 Dumb: c(y1)=5 c(y2)=0 c(y3)=1 c(y4)=1 Smart: c(y1)=5 c(y2)=1 c(y3)=1

slide-42
SLIDE 42

42

Graham Neubig – Non-parametric Bayesian Statistics

Debugging

  • Unit tests! Unit tests! Unit tests!
  • Remove bugs in implementation, and conceptualization
  • Create fail-safe function for adding/subtracting counts,

terminate if count goes below zero

  • When program finishes, remove all samples and make

sure the counts are exactly zero

  • The likelihood will not always go up, but if it

consistently goes down something is probably wrong

  • Set the random seed to a single value (srand)
slide-43
SLIDE 43

43

Graham Neubig – Non-parametric Bayesian Statistics

Recent Topics

slide-44
SLIDE 44

44

Graham Neubig – Non-parametric Bayesian Statistics

Block Sampling

  • Often hidden variables depend on each-other strongly
  • For example, variables close in time and space
  • Block sampling samples multiple hidden variables at a

time, considering dependence

  • HMMs use forward filtering/backward sampling
  • Context free grammars, etc. also possible

sampling depends depends

the boats row in a row 1 2 3 4 1 2

slide-45
SLIDE 45

45

Graham Neubig – Non-parametric Bayesian Statistics

Forward Filtering

  • forward-filtering adds up probabilities starting from an

initial state

s0 s1 s2 s3 s4 s5 p(s1|s0) p(s2|s0) p(s3|s2) p(s4|s1) p(s3|s1) p(s4|s2) p(s5|s3) p(s5|s4)

forward filtering calculate forward probabilities f

f(s0) = 1 f(s1) = p(s1|s0)*f(s0) f(s2) = p(s2|s0)*f(s0) f(s3) = p(s3|s1)*f(s1) + p(s3|s2)*f(s2) f(s4) = p(s4|s1)*f(s1) + p(s4|s2)*f(s2) f(s5) = p(s5|s3)*f(s3) + p(s5|s4)*f(s4)

s0 s1 s2 s3 s4 s5

slide-46
SLIDE 46

46

Graham Neubig – Non-parametric Bayesian Statistics

Backward Sampling

s0 s1 s2 s3 s4 s5 p(s1|s0) p(s2|s0) p(s3|s2) p(s4|s1) p(s3|s1) p(s4|s2) p(s5|s3) p(s5|s4)

backward sampling considers edge probs and forward probs

e(s5→x) p(x=s3) p(s5|s3)*f(s3) p(x=s4) p(s5|s4)*f(s4)

s2 s3 s5

  • Backward sampling starts at the acceptance state and

samples edges in backwards order

e(s3→x) p(x=s1) p(s3|s1)*f(s1) p(x=s2) p(s3|s2)*f(s2) ∝ ∝ ∝ ∝

slide-47
SLIDE 47

47

Graham Neubig – Non-parametric Bayesian Statistics

Type-Based Sampling

  • Sample variables that have the same Markov blanket

at once

  • Here, the Markov blanket is “3,in,1”

the boats row in a row 1 2 3 4 1 2 he will jump in the pool 2 5 3 4 1 2

slide-48
SLIDE 48

48

Graham Neubig – Non-parametric Bayesian Statistics

Type-base Sampling

  • Models based on Dirichlet distributions tend to assign

same tag to similar values (rich-gets-richer)

  • Good for modeling: Induces consistent, compact model
  • Bad for inference: Creates “valleys” in posterior prob
  • We are on the right side
  • The left side has more

probability, but requires several variable changes

  • Possible to escape, but

takes a very long time

# of x=1 Values Prob.

slide-49
SLIDE 49

49

Graham Neubig – Non-parametric Bayesian Statistics

Type-based Sampling

  • For each type, sample the

number of instances x=1

  • “x=1” has one instance
  • Markov blankets are

identical, probabilities are also

  • Can set one instance to

x=1 randomly, all others to x=2

# of x=1 Values Prob.

slide-50
SLIDE 50

50

Graham Neubig – Non-parametric Bayesian Statistics

Hierarchical Models

  • Multiple levels using the hierarchical Dirichlet process

Py i∣y i −1=cy i −1 y i∗Pbasey i c y i−1 Pbasey i =cbasey i∗1/N cbase・ Transition prob:

P(yi|yi-1=1) P(yi|yi-1=2) P(yi|yi-1=3)

Pbase(yi)

Shared base measure:

slide-51
SLIDE 51

51

Graham Neubig – Non-parametric Bayesian Statistics

Counting cbase

  • Use the Chinese restaurant process

1 2 1 3

yi = 1 2 1 3 1 yi-1 = 1

1 4 2

yi = 1 4 2 2 4 yi-1 = 2

1 2 3

base

4

  • Add customers to top level for each data point, add

customers to bottom level for each table in top level

slide-52
SLIDE 52

52

Graham Neubig – Non-parametric Bayesian Statistics

Pitman-Yor Process

  • Similar to Dirichlet process, but adds table discount d

1 2 1 3

Px i=cx i−d∗t x id∗t ・∗Pbasex i c・

  • Similar to absolute discounting for language models
  • Able to model power-law distributions, which are

common in language

Px i=1=3−d∗2d∗4∗0.25 5

slide-53
SLIDE 53

53

Graham Neubig – Non-parametric Bayesian Statistics

Examples from Speech and Language Processing

slide-54
SLIDE 54

54

Graham Neubig – Non-parametric Bayesian Statistics

Topic Models

  • Latent Dirichlet Allocation (LDA) [Blei+ 03]
  • Infinite topic models [Teh+ 06]
  • Applications to computer vision, document clustering,

language modeling (e.g.: [Heidel+ 07] )

this is a document this is a document this is a document this is a document this is a document this is a document this is a document this is a document this is a document this is a document this is a document this is a document

Collection of Documents Generate a multinomial topic distribution (with a Dirichlet prior)

1 1 4 3 3 3 Bill Clinton buys the Detroit Tigers

  • Poli. Enter. Sport Econ. Soci. Science

{ 0.4, 0.05, 0.3, 0.2, 0.01, 0.04}

Generate each word's topic from the topic dist. Generate each word from the topic's word dist

slide-55
SLIDE 55

55

Graham Neubig – Non-parametric Bayesian Statistics

Language Models

  • Hierarchical Pitman-Yor language model [Teh 06]

bi-gram uni-gram

  • Improvements to modeling accuracy by using Pitman-

Yor process

  • Similar accuracy to Kneser-Ney
  • Used in speech recognition [Huang&Renals 07]

P(wi|wi-1=1) P(wi|wi-1=2) P(wi|wi-1=3)

Pbase(wi)

slide-56
SLIDE 56

56

Graham Neubig – Non-parametric Bayesian Statistics

Unsupervised Word Segmentation

  • Generate word sequences from 1-gram or 2-gram

models [Goldwater+ 09]

  • Improvements using block sampling and Pitman-Yor

language model [Mochihashi+ 09]

これ は 単語 で す これ は 単 語 で す

  • r

Sampling

P( 単語 ) P( 単 )P( 語 )

  • r
slide-57
SLIDE 57

57

Graham Neubig – Non-parametric Bayesian Statistics

Learning a Language Model from Continuous Speech

  • Use Pitman-Yor language model to learn language model

and word dictionary from speech [Neubig+ 10]

  • Use forward filtering-backward sampling over phoneme

lattices

  • Can be used for:
  • Learning models for languages with no written text
  • Learning models faithful to spoken language

Acoustic Model

Learning Spoken Language Model

Speech Phoneme Lattice

slide-58
SLIDE 58

58

Graham Neubig – Non-parametric Bayesian Statistics

Learning Various Types

  • f Linguistic Information
  • POS using infinite HMM [Beal+ 02]
  • CFG [Johnson+ 07] and infinite CFG [Liang+ 07]
  • Word and phrase alignment for machine translation

[DeNero+ 08, Blunsom+ 09, Neubig+ 11]

  • Non-parametric extension of unsupervised semantic

parsing [Poon+ 09, Titov+ 11]

slide-59
SLIDE 59

59

Graham Neubig – Non-parametric Bayesian Statistics

References

  • M.J. Beal, Z. Ghahramani, and C.E. Rasmussen. 2002. The infinite hidden Markov model.

Proceedings of the 16th Annual Conference on Neural Information Processing Systems, 1:577– 584.

  • David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of

Machine Learning Research, 3:993–1022.

  • Phil Blunsom, Trevor Cohn, Chris Dyer, and Miles Osborne. 2009. A Gibbs sampler for phrasal

synchronous grammar induction. In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics, pages 782–790.

  • John DeNero, Alex Bouchard-Cˆot´e, and Dan Klein. 2008. Sampling alignment structure under

a Bayesian translation model. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 314–323.

  • Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. 2009. A Bayesian framework for

word segmentation: Exploring the effects of context. Cognition, 112(1):21–54.

  • A. Heidel, H. Chang, and L. Lee. 2007. Language model adaptation using latent Dirichlet

allocation and an efficient topic inference algorithm. In Proceedings of the 8th Annual Conference of the International Speech Communication Association (InterSpeech).

  • S. Huang and S. Renals. 2007. Hierarchical Pitman-Yor language models for ASR in meetings.

In Proceedings of the 2007 IEEE Automatic Speech Recognition and Understanding Workshop, pages 124–129.

  • Mark Johnson, Thomas Griffiths, and Sharon Goldwater. 2007. Bayesian inference for PCFGs

via Markov chain Monte Carlo. In Proceedings of the Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 139–146

slide-60
SLIDE 60

60

Graham Neubig – Non-parametric Bayesian Statistics

References

  • P. Liang, S. Petrov, M. Jordan, and D. Klein. 2007. The infinite PCFG using hierarchical Dirichlet
  • processes. In Proceedings of the Conference on Empirical Methods in Natural Language

Processing, pages 688–697.

  • Daichi Mochihashi, Takeshi Yamada, and Naonori Ueda. 2009. Bayesian unsupervised word

segmentation with nested Pitman-Yor modeling. In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics.

  • Graham Neubig, Masato Mimura, Shinsuke Mori, and Tatsuya Kawahara. 2010. Learning a

language model from continuous speech. In Proceedings of the 11th Annual Conference of the International Speech Communication Association (InterSpeech), Makuhari, Japan, 9.

  • Graham Neubig, Taro Watanabe, Eiichiro Sumita, Shinsuke Mori, and Tatsuya Kawahara. 2011.

An unsupervised model for joint phrase alignment and extraction. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oregon, USA, 6.

  • H. Poon and P. Domingos. 2009. Unsupervised semantic parsing. In Proceedings of the

Conference on Empirical Methods in Natural Language Processing, pages 1–10.

  • Y.W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei. 2006. Hierarchical dirichlet processes. Journal
  • f the American Statistical Association, 101(476):1566–1581.
  • Yee Whye Teh. 2006. A Bayesian interpretation of interpolated Kneser-Ney. Technical report,

School of Computing, National Univ. of Singapore.

  • Ivan Titov and Alexandre Klementiev. 2011. A Bayesian model for unsupervised semantic
  • parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational

Linguistics.