StatisticalNLP Ingeneral,wewanttoplaceadistributionoversentences - - PDF document

statistical nlp
SMART_READER_LITE
LIVE PREVIEW

StatisticalNLP Ingeneral,wewanttoplaceadistributionoversentences - - PDF document

LanguageModels StatisticalNLP Ingeneral,wewanttoplaceadistributionoversentences Spring2010 Basic/classicsolution:n*grammodels


slide-1
SLIDE 1

1

StatisticalNLP

Spring2010

Lecture3:LMsII/TextCat

DanKlein– UCBerkeley

LanguageModels

  • Ingeneral,wewanttoplaceadistributionoversentences
  • Basic/classicsolution:n*grammodels

Question:howtoestimateconditionalprobabilities? Problems:

Knownwordsinunseencontexts Entirelyunknownwords

Manysystemsignorethis– why? OftenjustlumpallnewwordsintoasingleUNKtype

thecat<s> thecat<s> thedog<s>

Held*OutReweighting

  • What’swrongwithadd*dsmoothing?
  • Let’slookatsomerealbigramcounts[ChurchandGale91]:
  • Bigthingstonotice:

Add*onevastlyoverestimatesthefractionofnewbigrams Add*anythingvastlyunderestimatestheratio2*/1*

  • Onesolution:useheld*outdatatopredictthemapofctoc*

Countin22MWords Actualc*(Next22M) Add*one’sc* Add*0.0000027’sc* 1 0.448 2/7e*10 ~1 2 1.25 3/7e*10 ~2 3 2.24 4/7e*10 ~3 4 3.23 5/7e*10 ~4 5 4.21 6/7e*10 ~5 MassonNew 9.2% ~100% 9.2% Ratioof2/1 2.8 1.5 ~2

Good*TuringReweightingI

  • We’dliketonotneedheld*outdata(why?)
  • Idea:leave*one*outvalidation

Nk:numberoftypeswhichoccurktimesinthe entirecorpus Takeeachofthectokensoutofcorpusinturn c“training”setsofsizec*1,“held*out”ofsize1 Howmany“held*out”tokensareunseenin “training”?

N1

Howmanyheld*outtokensareseenktimesin training?

(k+1)Nk+1

ThereareNk wordswithtrainingcountk Eachshouldoccurwithexpectedcount

(k+1)Nk+1/Nk

Eachshouldoccurwithprobability:

(k+1)Nk+1/(cNk)

N1 2N2 3N3

4417N4417 3511N3511

....

/N0 /N1 /N2 /N4416 /N3510

....

“Training” “Held*Out”

Good*TuringReweightingII

  • Problem:whatabout“the”?(sayk=4417)

Forsmallk,Nk >Nk+1 Forlargek,toojumpy,zeroswreckestimates SimpleGood*Turing[GaleandSampson]: replaceempiricalNk withabest*fitpowerlaw

  • ncecountcountsgetunreliable

N1 N2 N3 N4417 N3511

....

N0 N1 N2 N4416 N3510

....

N1 N2 N3 N1 N2

Good*TuringReweightingIII

Hypothesis:countsofkshouldbek*=(k+1)Nk+1/Nk KatzSmoothing

UseGTdiscountedcounts(roughly– Katzleftlargecountsalone) Whatevermassisleftgoestoempiricalunigram

Countin22MWords Actualc*(Next22M) GT’sc* 1 0.448 0.446 2 1.25 1.26 3 2.24 2.24 4 3.23 3.24 MassonNew 9.2% 9.2%

slide-2
SLIDE 2

2

  • Kneser*Ney smoothing:verysuccessfulestimatorusingtwoideas
  • Idea1:observedn*gramsoccurmoreintrainingthantheywilllater:
  • AbsoluteDiscounting
  • Saveourselvessometimeandjustsubtract0.75(orsomed)
  • Maybehaveaseparatevalueofdforverylowcounts

Kneser*Ney:Discounting

3.23 2.24 1.25 0.448 AvginNext22M 3.24 4 2.24 3 1.26 2 0.446 1 Good*Turingc* Countin22MWords

Kneser*Ney:Continuation

  • Idea2:Type*basedfertilityratherthantokencounts

Shannongame:Therewasanunexpected____?

delay? Francisco?

“Francisco”ismorecommonthan“delay” …but“Francisco”alwaysfollows“San” …soit’sless“fertile”

  • Solution:type*continuationprobabilities

Intheback*offmodel,wedon’twanttheprobabilityofwasaunigram Instead,wanttheprobabilitythatwis Foreachword,countthenumberofbigramtypesitcompletes

Kneser*Ney

Kneser*Ney smoothingcombinesthesetwoideas

Absolutediscounting Lowerordercontinuationprobabilities

KNsmoothingrepeatedlyproveneffective(ASR,MT,…) [Teh,2006]showsKNsmoothingisakindofapproximate inferenceinahierarchicalPitman*Yor process(andbetter approximationsaresuperiortobasicKN)

WhatActuallyWorks?

  • Trigramsandbeyond:
  • Unigrams,bigrams

generallyuseless

  • Trigramsmuchbetter(when

there’senoughdata)

  • 4*,5*gramsreallyusefulin

MT,butnotsomuchfor speech

  • Discounting
  • Absolutediscounting,Good*

Turing,held*outestimation, Witten*Bell

  • Contextcounting
  • Kneser*Neyconstruction
  • flower*ordermodels
  • See[Chen+Goodman]

readingfortonsofgraphs!

[Graphsfrom JoshuaGoodman]

Data>>Method?

  • Havingmoredataisbetter…
  • …butsoisusingabetterestimator
  • Anotherissue:N>3hashugecostsinspeechrecognizers

5.5 6 6.5 7 7.5 8 8.5 9 9.5 10 1 2 3 4 5 6 7 8 9 10 20

  • 100,000Katz

100,000KN 1,000,000Katz 1,000,000KN 10,000,000Katz 10,000,000KN allKatz allKN

TonsofData?

[Brants etal,2007]

slide-3
SLIDE 3

3

LargeScaleMethods

  • Languagemodelsgetbig,fast

EnglishGigawords corpus:2Gtokens,0.3Gtrigrams,1.2G5*grams Needtoaccessentriesveryoften,ideallyinmemory

  • Whatdoyoudowhenlanguagemodelsgettoobig?

DistributingLMsacrossmachines Quantizingprobabilities Randomhashing(e.g.Bloomfilters) [TalbotandOsborne07]

BeyondN*GramLMs

  • Lotsofideaswewon’thavetimetodiscuss:

Cachingmodels:recentwordsmorelikelytoappearagain Triggermodels:recentwordstriggerotherwords Topicmodels

Afewrecentideas

Syntacticmodels:usetreemodelstocapturelong*distance

syntacticeffects[Chelba andJelinek,98]

Discriminativemodels:setn*gramweightstoimprovefinaltask

accuracyratherthanfittrainingsetdensity[Roark,05,forASR; Liang et.al.,06,forMT]

Structuralzeros:somen*gramsaresyntacticallyforbidden,keep

estimatesatzeroifthelooklikerealzeros[Mohri andRoark,06]

BayesiandocumentandIRmodels[Daume 06]

Overview

Sofar:languagemodelsgiveP(s)

Helpmodelfluencyforvariousnoisy*channelprocesses(MT,

ASR,etc.)

N*grammodelsdon’trepresentanydeepvariablesinvolvedin

languagestructureormeaning

Usuallywewanttoknowsomethingabouttheinputotherthan

howlikelyitis(syntax,semantics,topic,etc)

Next:Naïve*Bayesmodels

Weintroduceasinglenewglobalvariable Stillaverysimplisticmodelfamily Letsusmodelhiddenpropertiesoftext,butonlyverynon*local

  • nes…

Inparticular,wecanonlymodelpropertieswhicharelargely

invarianttowordorder(liketopic)

TextCategorization

  • Wanttoclassifydocumentsintobroadsemantictopics(e.g.politics,

sports,etc.)

  • Whichoneisthepoliticsdocument?(Andhowmuchdeep

processingdidthatdecisiontake?)

  • Oneapproach:bag*of*wordsandNaïve*Bayesmodels
  • Anotherapproachlater…
  • Usuallybeginwithalabeledcorpuscontainingexamplesofeach

class

Obamaishopingtorallysupport forhis$825billionstimulus packageontheeveofacrucial Housevote.Republicanshave expressedreservationsaboutthe proposal,callingformoretax cutsandlessspending.GOP representativesseemeddoubtful thatanydealswouldbemade. Californiawillopenthe2009 seasonathomeagainst MarylandSept.5andwillplaya totalofsixgamesinMemorial Stadiuminthefinalfootball scheduleannouncedbythe Pacific*10ConferenceFriday. Theoriginalschedulecalledfor 12gamesover12weekends.

Naïve*BayesModels

  • Idea:pickatopic,thengenerateadocumentusingalanguage

modelforthattopic.

  • Naïve*Bayesassumption:allwordsareindependentgiventhetopic.
  • Comparetoaunigramlanguagemodel:

=

=

  • =STOP
  • UsingNBforClassification
  • Wehaveajointmodeloftopicsanddocuments
  • Givesposteriorlikelihoodoftopicgivenadocument
  • Whatabouttotallyunknownwords?
  • Canworkshockinglywellfortextcat(especiallyinthewild)
  • Howcanunigrammodelsbesoterribleforlanguagemodeling,butclass*conditional

unigrammodelsworkfortextcat?

  • Numerical/speedissues
  • HowaboutNBforspamdetection?

=

∏ ∏

      =

slide-4
SLIDE 4

4

TwoNBFormulations

TwoNBeventmodelsfortextcategorization

Theclass*conditionalunigrammodel,a.k.a.multinomialmodel

Onenodeperwordinthedocument Drivenbywordswhicharepresent Multipleoccurrences,multipleevidence Betteroverall– plus,knowhowtosmooth

Thebinominal(binary)model

Onenodeforeachwordinthevocabulary Incorporatesexplicitnegativecorrelations Knowhowtodofeatureselection(e.g.keepwordswithhigh

mutualinformationwiththeclassvariable)

  • vocabulary

accuracy

Example:Barometers

NBFACTORS:

P(s) =1/2 P(*|s)=1/4 P(*|r)=3/4

  • PREDICTIONS:
  • P(r,*,*)=(½)(¾)(¾)
  • P(s,*,*)=(½)(¼)(¼)
  • P(r|*,*)=9/10
  • P(s|*,*)=1/10
  • Example:Stoplights
  • !
  • "

#

  • NBFACTORS:

P(w)=6/7 P(r|w)=1/2 P(g|w)=1/2 P(b)=1/7 P(r|b)=1 P(g|b)=0

P(b|r,r)=4/10(whathappened?)

(Non*)IndependenceIssues

MildNon*Independence

Evidenceallpointsintherightdirection Observationsjustnotentirelyindependent Results

InflatedConfidence DeflatedPriors

Whattodo?Boostpriorsorattenuateevidence(hack)

SevereNon*Independence

Wordsviewedindependentlyaremisleading Interactionshavetobemodeled(“notbad”) Whattodo?

Changeyourmodel!

< >

=

  • LanguageIdentification
  • Howcanwetellwhatlanguageadocumentisin?
  • HowtotelltheFrenchfromtheEnglish?
  • Treatitasword*leveltextcat?

Overkill,andrequiresalotoftrainingdata Youdon’tactuallyneedtoknowaboutwords!

  • Option:buildacharacter*levellanguagemodel

The38thParliamentwillmeeton Monday,October4,2004,at11:00a.m. Thefirstitemofbusinesswillbethe electionoftheSpeakeroftheHouseof Commons.HerExcellencytheGovernor GeneralwillopentheFirstSessionof the38thParliamentonOctober5,2004, withaSpeechfromtheThrone. La38elégislatureseréuniraà11heuresle lundi4octobre2004,etlapremièreaffaire àl'ordredujourseral’électionduprésident delaChambredescommunes.Son ExcellencelaGouverneuregénérale

  • uvriralapremièresessiondela38e

législatureavecundiscoursduTrônele mardi5octobre2004.

Σύcφωνοσταθερότηταςκαιανάπτυξης Pattodistabilitàedicrescita

Class*ConditionalLMs

  • Canaddatopicvariabletootherlanguagemodels
  • Couldbecharactersinsteadofwords,usedforlanguageID(HW2)
  • Couldsumoutthetopicvariableanduseasalanguagemodel
  • Howmightaclass*conditionaln*gramlanguagemodelbehave

differentlyfromastandardn*grammodel?

=