FrankWood Gatsby UCL - - PowerPoint PPT Presentation

frank wood gatsby ucl cedric archambeau
SMART_READER_LITE
LIVE PREVIEW

FrankWood Gatsby UCL - - PowerPoint PPT Presentation

FrankWood Gatsby UCL CedricArchambeau Gatsby JanGasthaus HKUST LancelotJames Gatsby YeeWhye Teh


slide-1
SLIDE 1
  • FrankWood

CedricArchambeau JanGasthaus LancelotJames YeeWhye Teh Gatsby UCL Gatsby HKUST Gatsby

slide-2
SLIDE 2
  • Model

– SmoothingMarkovmodelofdiscretesequences – ExtensionofhierarchicalPitmanYorprocess[Teh2006]

  • Unboundeddepth(contextlength)
  • Algorithmsandestimation

– Lineartimesuffix0treegraphicalmodelidentificationandconstruction – StandardChineserestaurantfranchisesampler

  • Results

– Maximumcontextualinformationusedduringinference – Competitivelanguagemodellingresults

  • Limitofn0gramlanguagemodelasn→∞

– SamecomputationalcostasaBayesianinterpolating50gramlanguage model

slide-3
SLIDE 3
  • Uses

– Anysituationinwhichalow0orderMarkovmodelofdiscrete sequencesisinsufficient – DropinreplacementforsmoothingMarkovmodel

  • Name?

– ‘‘AStochasticMemoizerforSequenceData’’ →Sequence Memoizer(SM)

  • Describesposteriorinference[Goodmanetal‘08]
slide-4
SLIDE 4
  • SequenceMarkovmodelsareusuallyconstructedbytreatinga

sequenceasasetof(exchangeable)observationsinfixed0length contexts

  • acac →

     c|ao a|ca c|ac

trigram

  • acac →

         a|o c|a a|c c|a

  • acac →

              

  • |[]

a|[] c|[] a|[] c|[]

  • acac →
  • a|cao

c|aca

bigram unigram 40gram Increasingcontextlength/orderofMarkovmodel Decreasingnumberofobservations Increasingnumberofconditionaldistributionstoestimate(indexedbycontext) Increasingpowerofmodel

slide-5
SLIDE 5
  • Example

P(xN) =

N

  • i

P(xi|x, . . . xi−) ≈

N

  • i

P(xi|xi−n, . . . xi−), n = 2 = P(x)P(x|x)P(x|x)P(x|x) . . . P(oacac) = P(o)P(a|o)P(c|a)P(a|c)P(c|a) = G(o)G(a)G(a)G(c)G(a)

slide-6
SLIDE 6
  • Discretedistribution↔ vectorofparameters
  • Counting/Maximumlikelihoodestimation

– TrainingsequencexN – Predictiveinference

  • Example

– Non0smoothedunigrammodel( ǫ)

G

xi

i = 1 : N

ˆ G(X = k) = ˆ πk = {k}

{}

P(Xn|x . . . xN) = ˆ G(Xn) G = [π, . . . , πK], K ∈ |Σ|

slide-7
SLIDE 7

!

  • Estimation
  • Predictiveinference
  • Priorsoverdistributions
  • Neteffect

– Inferenceis“smoothed” w.r.t.uncertaintyabout unknowndistribution

  • Example

– Smoothedunigram( ǫ)

xi

i = 1 : N

P(G|xn) ∝ P(xn|G)P(G) P(Xn|xn) =

  • P(Xn|G)P(G|xn)dG

U

G ∼ Dirichlet(U), G ∼ PY(d, c, U)

G

slide-8
SLIDE 8

"#

  • Toolfortyingtogetherrelateddistributionsinhierarchicalmodels
  • Measureovermeasures
  • Basemeasureisthe“mean” measure
  • AdistributiondrawnfromaPitmanYorprocessisrelatedtoits base

distribution

– (equalwhen =∞ or =1)

G ∼ PY(d, c, Gσ) xi ∼ G

concentration discount base distribution

E[G(dx)] = Gσ(dx)

slide-9
SLIDE 9

$%&$

  • GeneralizationoftheDirichletprocess( =0)

– Different(power0law)properties – Betterfortext[Teh,2006]andimages[Sudderth andJordan,2009]

  • Posteriorpredictivedistribution
  • Formsthebasisforstraightforward,simplesamplers
  • Ruleforstochasticmemoization

P(XN|xN; c, d) ≈

  • P(xN|G)P(G|xN; c, d)dG

=

  • K

k(mk − d)(φk = XN)

c + N + c + dK c + N Gσ(XN)

  • Can’t actually do this integral this way
slide-10
SLIDE 10

'!

  • Estimation
  • Predictiveinference
  • Naturallyrelateddistributionstied

together

  • Neteffect

– Observationsinonecontextaffect inferenceinothercontext. – Statisticalstrengthissharedbetween similarcontexts

  • Example

– Smoothingbi0gram( ǫ ∈ Σ)

xj xi

U

Θ = {G, G, G}, ( = σ() = σ() P(Θ|xN) ∝ P(xN|Θ)P(Θ)

P(XN|xN) =

  • P(XN|Θ)P(Θ|xN)dΘ

G

j = 1 : N i = 1 : N

G

G

Gthe United States ∼ PY(d, c, GUnited States)

slide-11
SLIDE 11

)'$&$"

Conditional Distributions Posterior Predictive Probabilities Observations

U

G G

G G

slide-12
SLIDE 12

*$$+,

Conditional Distributions Posterior Predictive Probabilities Observations CPU

U

G G

G G

slide-13
SLIDE 13

*$$+,

Conditional Distributions Posterior Predictive Probabilities Observations CPU CPU

U

G G

G G

slide-14
SLIDE 14

'$&$"

  • Sharestatisticalstrengthbetween

sequentiallyrelatedpredictive conditionaldistributions

– Estimatesofhighlyspecific conditionaldistributions – Arecoupledwithothersthatare related – Throughasinglecommon,more0 generalsharedancestor

  • Correspondsintuitivelytoback0off

G

G

G

G

G

G

  • G

G

G

G

G

G

G

G

G

G G G

slide-15
SLIDE 15

'$&$

  • Bayesiangeneralizationofsmoothingn0gramMarkovmodel
  • Languagemodel:outperformsinterpolatedKneser0Ney(KN)smoothing
  • Efficientinferencealgorithmsexist

– [Goldwateretal’05;Teh,’06;Teh,Kurihara,Welling,’08]

  • Sharingbetweencontextsthatdifferinmostdistantsymbolonly
  • Finitedepth

G | d, U ∼ PY(d, 0, U) G | d||, Gσ ∼ PY(d||, 0, Gσ) xi | i− = ∼ G i = 1, . . . , T ∀ ∈ Σn−

’’

slide-16
SLIDE 16

"

  • Asequencecanbecharacterizedbyasetofsingle
  • bservationsinuniquecontextsofgrowinglength

Increasingcontextlength Alwaysasingleobservation Foreshadowing:allsuffixesofthestring“cacao”

  • acac →

              

  • |[]

a|o c|ao a|cao c|acao

slide-17
SLIDE 17
  • -.%’’
  • Example
  • Smoothingessential

– Onlyoneobservationineachcontext!

  • Solution

– HierarchicalsharingalaHPYP

P(xN) =

N

  • i

P(xi|x, . . . xi−) = P(x)P(x|x)P(x|x, x)P(x|x, . . . x) . . . P(oacac) = P(o)P(a|o)P(c|oa)P(a|oac)P(c|oaca)

slide-18
SLIDE 18
  • EliminatesMarkovorderselection
  • Alwaysusesfullcontextwhenmakingpredictions
  • Lineartime,linearspace(inlengthofobservationsequence)graphicalmodel

identification

  • Performanceislimitofn0gramasn→∞
  • Sameorlessoverallcostas50graminterpolatingKneser Ney

G | d, U ∼ PY(d, 0, U) G | d||, Gσ ∼ PY(d||, 0, Gσ) xi | i− = ∼ G i = 1, . . . , T ∀ ∈ Σ

slide-19
SLIDE 19

/,

Observations

  • acac →

              

  • |[]

a|o c|ao a|cao c|acao

Latent conditional distributions with Pitman Yor priors / stochastic memoizers

slide-20
SLIDE 20

00

  • acac →

              

  • |[]

a|o c|ao a|cao c|acao

All suffixes of the string “cacao”

slide-21
SLIDE 21

00

  • Deterministicfiniteautomatathatrecognizesall

suffixesofaninputstring.

  • RequiresO(N2)timeandspacetobuildandstore

[Ukkonen,95]

  • Toointensiveforanypracticalsequencemodelling

application.

slide-22
SLIDE 22

00

  • Deterministicfiniteautomatathatrecognizesall

suffixesofaninputstring

  • Usespathcompressiontoreducestorageand

constructioncomputationalcomplexity.

  • RequiresonlyO(N)timeandspacetobuildandstore

[Ukkonen,95]

  • Practicalforlargescalesequencemodelling

applications

slide-23
SLIDE 23

00

slide-24
SLIDE 24

00

slide-25
SLIDE 25

/,10

  • Thisisagraphicalmodeltransformationunderthe

covers.

  • Thesecompressedpathsrequirebeingableto

analyticallymarginalizeoutnodesfromthegraphical model

  • Theresultofthismarginalizationcanbethoughtofas

providingadifferentsetofcachingrulestomemoizers

  • nthepath0compressededges
slide-26
SLIDE 26
  • Theorem1:Coagulation

If G|G ∼ PY(d, 0, G) and G|G ∼ PY(d, 0, G) then G|G ∼ PY(dd, 0, G) with G marginalized out.

[Pitman ’99; Ho, James, Lau ’06; W., Archambeau, Gasthaus, James, Teh ‘09]

slide-27
SLIDE 27

/,

slide-28
SLIDE 28

/,

slide-29
SLIDE 29

/,1

  • Givenasingleinputsequence

– Ukkonen’s lineartimesuffixtreeconstructionalgorithmis runonitsreversetoproduceaprefixtree – Thisidentifiesthenodesinthegraphicalmodelweneedto represent – Thetreeistraversedandpathcompressedparametersfor thePitmanYorprocessesareassignedtoeachremaining PitmanYorprocess

slide-30
SLIDE 30

.1/,

slide-31
SLIDE 31

. 2%

slide-32
SLIDE 32

!.%/$0

HPYP exceeds SM computational complexity

slide-33
SLIDE 33

*

[Mnih&Hinton,2009] 112.1 [Bengioetal.,2003] 109.0 40gramModifiedKneser0Ney[Teh,2006] 102.4 40gramHPYP[Teh,2006] 101.9 SequenceMemoizer(SM) SequenceMemoizer(SM) SequenceMemoizer(SM) SequenceMemoizer(SM) 96.9 96.9 96.9 96.9

APNewsTestPerplexity APNewsTestPerplexity APNewsTestPerplexity APNewsTestPerplexity

slide-34
SLIDE 34
  • TheSequenceMemoizerisadeep(unbounded)smoothing

Markovmodel

  • Itcanbeusedtolearnajointdistributionoverdiscrete

sequencesintimeandspacelinearinthelengthofasingle

  • bservationsequence
  • Itisequivalenttoasmoothing∞0grambutcostsnomoreto

computethana50gram