- FrankWood
CedricArchambeau JanGasthaus LancelotJames YeeWhye Teh Gatsby UCL Gatsby HKUST Gatsby
FrankWood Gatsby UCL - - PowerPoint PPT Presentation
FrankWood Gatsby UCL CedricArchambeau Gatsby JanGasthaus HKUST LancelotJames Gatsby YeeWhye Teh
CedricArchambeau JanGasthaus LancelotJames YeeWhye Teh Gatsby UCL Gatsby HKUST Gatsby
– SmoothingMarkovmodelofdiscretesequences – ExtensionofhierarchicalPitmanYorprocess[Teh2006]
– Lineartimesuffix0treegraphicalmodelidentificationandconstruction – StandardChineserestaurantfranchisesampler
– Maximumcontextualinformationusedduringinference – Competitivelanguagemodellingresults
– SamecomputationalcostasaBayesianinterpolating50gramlanguage model
c|ao a|ca c|ac
trigram
a|o c|a a|c c|a
c|aca
bigram unigram 40gram Increasingcontextlength/orderofMarkovmodel Decreasingnumberofobservations Increasingnumberofconditionaldistributionstoestimate(indexedbycontext) Increasingpowerofmodel
P(xN) =
N
P(xi|x, . . . xi−) ≈
N
P(xi|xi−n, . . . xi−), n = 2 = P(x)P(x|x)P(x|x)P(x|x) . . . P(oacac) = P(o)P(a|o)P(c|a)P(a|c)P(c|a) = G(o)G(a)G(a)G(c)G(a)
– TrainingsequencexN – Predictiveinference
– Non0smoothedunigrammodel( ǫ)
i = 1 : N
{}
– Inferenceis“smoothed” w.r.t.uncertaintyabout unknowndistribution
– Smoothedunigram( ǫ)
i = 1 : N
distribution
– (equalwhen =∞ or =1)
concentration discount base distribution
’
– Different(power0law)properties – Betterfortext[Teh,2006]andimages[Sudderth andJordan,2009]
P(XN|xN; c, d) ≈
=
k(mk − d)(φk = XN)
c + N + c + dK c + N Gσ(XN)
together
– Observationsinonecontextaffect inferenceinothercontext. – Statisticalstrengthissharedbetween similarcontexts
– Smoothingbi0gram( ǫ ∈ Σ)
Θ = {G, G, G}, ( = σ() = σ() P(Θ|xN) ∝ P(xN|Θ)P(Θ)
P(XN|xN) =
j = 1 : N i = 1 : N
Gthe United States ∼ PY(d, c, GUnited States)
Conditional Distributions Posterior Predictive Probabilities Observations
Conditional Distributions Posterior Predictive Probabilities Observations CPU
Conditional Distributions Posterior Predictive Probabilities Observations CPU CPU
sequentiallyrelatedpredictive conditionaldistributions
– Estimatesofhighlyspecific conditionaldistributions – Arecoupledwithothersthatare related – Throughasinglecommon,more0 generalsharedancestor
G
G
G
G
G
G
G
G
G
G
G
G
G
G
– [Goldwateretal’05;Teh,’06;Teh,Kurihara,Welling,’08]
’’
Increasingcontextlength Alwaysasingleobservation Foreshadowing:allsuffixesofthestring“cacao”
a|o c|ao a|cao c|acao
P(xN) =
N
P(xi|x, . . . xi−) = P(x)P(x|x)P(x|x, x)P(x|x, . . . x) . . . P(oacac) = P(o)P(a|o)P(c|oa)P(a|oac)P(c|oaca)
identification
Observations
a|o c|ao a|cao c|acao
Latent conditional distributions with Pitman Yor priors / stochastic memoizers
a|o c|ao a|cao c|acao
All suffixes of the string “cacao”
[Pitman ’99; Ho, James, Lau ’06; W., Archambeau, Gasthaus, James, Teh ‘09]
!.%/$0
HPYP exceeds SM computational complexity
[Mnih&Hinton,2009] 112.1 [Bengioetal.,2003] 109.0 40gramModifiedKneser0Ney[Teh,2006] 102.4 40gramHPYP[Teh,2006] 101.9 SequenceMemoizer(SM) SequenceMemoizer(SM) SequenceMemoizer(SM) SequenceMemoizer(SM) 96.9 96.9 96.9 96.9