StatisticalNLP Spring2010 Lecture2:LanguageModels DanKlein - - PDF document

statistical nlp
SMART_READER_LITE
LIVE PREVIEW

StatisticalNLP Spring2010 Lecture2:LanguageModels DanKlein - - PDF document

StatisticalNLP Spring2010 Lecture2:LanguageModels DanKlein UCBerkeley SpeechinaSlide Frequencygivespitch;amplitudegivesvolume


slide-1
SLIDE 1

1

StatisticalNLP

Spring2010

Lecture2:LanguageModels

DanKlein– UCBerkeley

  • Frequencygivespitch;amplitudegivesvolume
  • Frequenciesateachtimesliceprocessedintoobservationvectors

speechlab

amplitude

SpeechinaSlide

……………………………………………..a12a13a12a14a14………..

slide-2
SLIDE 2

2

TheNoisy-ChannelModel

Wewanttopredictasentencegivenacoustics: Thenoisychannelapproach:

Acousticmodel:HMMsover wordpositionswithmixtures

  • fGaussiansasemissions

Languagemodel: Distributionsoversequences

  • fwords(sentences)

AcousticallyScoredHypotheses

thestationsignsareindeepinenglish

  • 14732

thestationssignsareindeepinenglish

  • 14735

thestationsignsareindeepintoenglish

  • 14739

thestation'ssignsareindeepinenglish

  • 14740

thestationsignsareindeepintheenglish

  • 14741

thestationsignsareindeedinenglish

  • 14757

thestation'ssignsareindeedinenglish

  • 14760

thestationsignsareindiansinenglish

  • 14790

thestationsignsareindianinenglish

  • 14799

thestationssignsareindiansinenglish

  • 14807

thestationssignsareindiansandenglish

  • 14815
slide-3
SLIDE 3

3

ASRSystemComponents

  • Translation:Codebreaking?

“Alsoknowingnothingofficialabout,buthaving guessedandinferredconsiderableabout,the powerfulnewmechanizedmethodsin cryptography—methodswhichIbelievesucceed evenwhenonedoesnotknowwhatlanguagehas beencoded—onenaturallywondersiftheproblem

  • ftranslationcouldconceivablybetreatedasa

problemincryptography.WhenIlookatanarticle inRussian,Isay:‘ThisisreallywritteninEnglish, butithasbeencodedinsomestrangesymbols.I willnowproceedtodecode.’”

WarrenWeaver(1955:18,quotingaletterhewrotein1947)

slide-4
SLIDE 4

4

MTOverview

  • MTSystemComponents
slide-5
SLIDE 5

5

OtherNoisy-ChannelProcesses

SpellingCorrection Handwritingrecognition OCR More…

ProbabilisticLanguageModels

  • Goal:AssignusefulprobabilitiesP(x)tosentencesx

Input:manyobservationsoftrainingsentencesx Output:systemcapableofcomputingP(x)

  • Probabilitiesshouldbroadlyindicateplausibilityofsentences

P(Isawavan)>>P(eyesaweofan) :P(artichokesintimidatezippers)≈ 0 Inprinciple,“plausible”dependsonthedomain,context,speaker…

  • Oneoption:empiricaldistributionovertrainingsentences?

Problem:doesn’tgeneralize(atall)

  • Twoaspectsofgeneralization

Decomposition:breaksentencesintosmallpieceswhichcanbe recombinedinnewways(conditionalindependence) Smoothing:allowforthepossibilityofunseenpieces

slide-6
SLIDE 6

6

N-GramModelDecomposition

Chainrule:breaksentenceprobabilitydown Impracticaltoconditiononeverythingbefore

P(???|Turntopage134andlookatthepictureofthe)?

N-grammodels:assumeeachworddependsonlyona shortlinearhistory Example:

N-GramModelParameters

  • Theparametersofann-grammodel:

Theactualconditionalprobabilityestimates,we’llcallthemθ Obviousestimate:()

  • Generalapproach

TakeatrainingsetXandatestsetX’ Computeanestimateθ fromX Useittoassignprobabilitiestoothersentences,suchasthoseinX’ 198015222thefirst 194623024thesame 168504105thefollowing 158562063theworld … 14112454thedoor

  • 23135851162the*

TrainingCounts

slide-7
SLIDE 7

7

HigherOrderN-grams?

3380pleaseclosethedoor 1601pleaseclosethewindow 1164pleaseclosethenew 1159pleaseclosethegate 900pleaseclosethebrowser

  • 13951pleaseclosethe*

198015222thefirst 194623024thesame 168504105thefollowing 158562063theworld … 14112454thedoor

  • 23135851162the*

197302closethewindow 191125closethedoor 152500closethegap 116451closethethread 87298closethedeal

  • 3785230 closethe*

Pleaseclosethedoor Pleaseclosethefirstwindowontheleft

UnigramModels

  • Simplestcase:unigrams
  • Generativeprocess:pickaword,pickaword,…untilyoupickSTOP
  • Asagraphicalmodel:
  • Examples:
  • [fifth,an,of,futures,the,an,incorporated,a,a,the,inflation,most,dollars,quarter,in,is,mass.]
  • [thrift,did,eighty,said,hard,'m,july,bullish]
  • [that,or,limited,the]
  • []
  • [after,any,on,consistently,hospital,lake,of,of,other,and,factors,raised,analyst,too,allowed,

mexico,never,consider,fall,bungled,davison,that,obtain,price,lines,the,to,sass,the,the,further, board,a,details,machinists,the,companies,which,rivals,an,because,longer,oakes,percent,a, they,three,edward,it,currier,an,within,in,three,wrote,is,you,s.,longer,institute,dentistry,pay, however,said,possible,to,rooms,hiding,eggs,approximate,financial,canada,the,so,workers, advancers,half,between,nasdaq]

  • ………….
slide-8
SLIDE 8

8

BigramModels

  • Bigproblemwithunigrams:P(thethe the the)>>P(Ilikeicecream)!
  • Conditiononprevioussingleword:
  • Obviousthatthisshouldhelp– inprobabilisticterms,we’reusingweaker

conditionalindependenceassumptions(what’sthecost?)

  • Anybetter?
  • [texaco,rose,one,in,this,issue,is,pursuing,growth,in,a,boiler,house,said,mr.,

gurria,mexico,'s,motion,control,proposal,without,permission,from,five,hundred, fifty,five,yen]

  • [outside,new,car,parking,lot,of,the,agreement,reached]
  • [although,common,shares,rose,forty,six,point,four,hundred,dollars,from,thirty,

seconds,at,the,greatest,play,disingenuous,to,be,reset,annually,the,buy,out,of, american,brands,vying,for,mr.,womack,currently,sharedata,incorporated,believe, chemical,prices,undoubtedly,will,be,as,much,is,scheduled,to,conscientious, teaching]

  • [this,would,be,a,record,november]
  • RegularLanguages?

N-grammodelsare(weighted)regularlanguages

Manylinguisticargumentsthatlanguageisn’tregular.

Long-distanceeffects:“Thecomputer whichIhadjustputintothe machineroomonthefifthfloor___.”(crashed) Recursivestructure

WhyCANweoftengetawaywithn-grammodels?

PCFGLM(later):

[This,quarter,‘s,surprisingly,independent,attack,paid,off,the, risk,involving,IRS,leaders,and,transportation,prices,.] [It,could,be,announced,sometime,.] [Mr.,Toseland,believes,the,average,defense,economy,is, drafted,from,slightly,more,than,12,stocks,.]

slide-9
SLIDE 9

9

MoreN-GramExamples MeasuringModelQuality

Thegameisn’ttopoundoutfakesentences!

Obviously,generatedsentencesget“better”asweincreasethe modelorder Moreprecisely:usingMLestimators,higherorderisalways betterlikelihoodontrain,butnottest

Whatwereallywanttoknowis:

Willourmodelprefergoodsentencestobadones? Bad≠ungrammatical! Bad≈ unlikely Bad=sentencesthatouracousticmodelreallylikesbutaren’t thecorrectanswer

slide-10
SLIDE 10

10

MeasuringModelQuality

TheShannonGame:

Howwellcanwepredictthenextword? Unigramsareterribleatthisgame.(Why?)

“Entropy”:per-wordtest loglikelihood(misnamed)

WhenIeatpizza,Iwipeoffthe____ Manychildrenareallergicto____ Isawa____

grease0.5 sauce0.4 dust0.05 …. mice0.0001 …. the1e-100 3516wipeofftheexcess 1034wipeoffthedust 547wipeoffthesweat 518wipeoffthemouthpiece … 120wipeoffthegrease 0wipeoffthesauce 0wipeoffthemice

  • 28048wipeoffthe*

MeasuringModelQuality

Problemwith“entropy”:

0.1bitsofimprovementdoesn’tsoundsogood “Solution”:perplexity Interpretation:averagebranchingfactorinmodel

Importantnotes:

It’seasytogetbogusperplexitiesbyhavingbogusprobabilities thatsumtomorethanoneovertheireventspaces.30%ofyou willdothisonHW1. Eventhoughourmodelsrequireastopstep,averagesareper actualword,notperderivationstep.

slide-11
SLIDE 11

11

MeasuringModelQuality

WordErrorRate(WER) The“right”measure:

Taskerrordriven Forspeechrecognition Foraspecificrecognizer!

Commonissue:intrinsicmeasureslikeperplexityare easiertouse,butextrinsiconesaremorecredible

Correctanswer: Andysawa part ofthemovie Recognizeroutput: And he sawapart ofthemovie

insertions +deletions +substitutions truesentencesize

  • !

0.2 0.4 0.6 0.8 1 200000 400000 600000 800000 1000000

  • Unigrams

Bigrams Rules

Sparsity

Problemswithn-grammodels:

Newwordsappearallthetime:

Synaptitute 132,701.03 multidisciplinarization

Newbigrams:evenmoreoften Trigramsormore– stillworse!

Zipf’sLaw

Types(words)vs.tokens(wordoccurences) Broadly:mostwordtypesarerareones Specifically:

Rankwordtypesbytokenfrequency Frequencyinverselyproportionaltorank

Notspecialtolanguage:randomlygeneratedcharacterstrings havethisproperty(tryit!)

slide-12
SLIDE 12

12

ParameterEstimation

  • Maximumlikelihoodestimateswon’tgetusveryfar
  • Needto theseestimates
  • Generalmethod(procedurally)

Takeyourempiricalcounts Modifytheminvariouswaystoimproveestimates

  • Generalmethod(mathematically)

Oftencangiveestimatorsaformalstatisticalinterpretation …butnotalways Approachesthataremathematicallyobviousaren’talwayswhatworks

3516wipeofftheexcess 1034wipeoffthedust 547wipeoffthesweat 518wipeoffthemouthpiece … 120wipeoffthegrease 0wipeoffthesauce 0wipeoffthemice

  • 28048wipeoffthe*

Smoothing

  • Weoftenwanttomakeestimatesfromsparsestatistics:
  • Smoothingflattensspikydistributionssotheygeneralizebetter
  • VeryimportantalloverNLP,buteasytodobadly!
  • We’llillustratewithbigramstoday(h=previousword,couldbeanything).

P(w|deniedthe) 3allegations 2reports 1claims 1request 7total

allegations

charges motion benefits

allegations reports claims

charges

request

motion benefits

allegations reports

claims

request

P(w|deniedthe) 2.5allegations 1.5reports 0.5claims 0.5request 2other 7total

slide-13
SLIDE 13

13

Puzzle:UnknownWords

Imaginewelookat1Mwordsoftext

We’llseemanythousandsofwordtypes Somewillbefrequent,othersrare CouldturnintoanempiricalP(w)

Questions:

Whatfractionofthenext1Mwillbenewwords? Howmanytotalwordtypesexist?

LanguageModels

  • Ingeneral,wewanttoplaceadistributionoversentences
  • Basic/classicsolution:n-grammodels

Question:howtoestimateconditionalprobabilities? Problems:

Knownwordsinunseencontexts Entirelyunknownwords

Manysystemsignorethis– why? OftenjustlumpallnewwordsintoasingleUNKtype

slide-14
SLIDE 14

14

Smoothing:Add-One,Etc.

  • Classicsolution:addcounts(Laplacesmoothing/Dirichlet prior)

Add-onesmoothingespeciallyoftentalkedabout

  • Forabigramdistribution,canaddcountsshapedliketheunigram:
  • Canconsiderhierarchicalformulations:trigramisrecursively

centeredonsmoothedbigramestimate,etc[MacKayandPeto,94]

  • CanbederivedfromDirichlet /multinomialconjugacy:priorshape

showsupas"#

  • Problem:worksquitepoorly!

LinearInterpolation

Problem:issupportedbyfewcounts Classicsolution:mixturesofrelated,denserhistories,e.g.: ThemixtureapproachtendstoworkbetterthantheDirichlet priorapproachforseveralreasons

Canflexiblyincludemultipleback-offcontexts,notjustachain Oftenmultipleweights,dependingonbucketedcounts GoodwaysoflearningthemixtureweightswithEM(later) Notentirelyclearwhyitworkssomuchbetter

Allthedetailsyoucouldeverwant:[ChenandGoodman,98]

slide-15
SLIDE 15

15

Held-OutData

  • Importanttoolforoptimizinghowmodelsgeneralize:

Setasmallnumberofhyperparameters thatcontrolthedegreeof smoothingbymaximizingthe(log-)likelihoodofheld-outdata Canuseanyoptimizationtechnique(linesearchorEMusuallyeasiest)

  • Examples:

TrainingData Held-Out Data Test Data k

  • Held-OutReweighting
  • What’swrongwithadd-dsmoothing?
  • Let’slookatsomerealbigramcounts[ChurchandGale91]:
  • Bigthingstonotice:

Add-onevastlyoverestimatesthefractionofnewbigrams Add-0.0000027vastlyunderestimatestheratio2*/1*

  • Onesolution:useheld-outdatatopredictthemapofctoc*

Countin22MWords Actualc*(Next22M) Add-one’sc* Add-0.0000027’sc* 1 0.448 2/7e-10 ~1 2 1.25 3/7e-10 ~2 3 2.24 4/7e-10 ~3 4 3.23 5/7e-10 ~4 5 4.21 6/7e-10 ~5 MassonNew 9.2% ~100% 9.2% Ratioof2/1 2.8 1.5 ~2

slide-16
SLIDE 16

16

Good-TuringReweightingI

  • We’dliketonotneedheld-outdata(why?)
  • Idea:leave-one-outvalidation

Nk:numberoftypeswhichoccurktimesinthe entirecorpus Takeeachofthectokensoutofcorpusinturn c“training”setsofsizec-1,“held-out”ofsize1 Howmanyheld-outtokensareunseenin training?

N1

Howmanyheld-outtokensareseenktimesin training?

(k+1)Nk+1

ThereareNk wordswithtrainingcountk Eachshouldoccurwithexpectedcount

(k+1)Nk+1/Nk

Eachshouldoccurwithprobability:

(k+1)Nk+1/(cNk)

N1 2N2 3N3

4417N4417 3511N3511

....

/N0 /N1 /N2 /N4416 /N3510

....

“Training” “Held-Out”

Good-TuringReweightingII

  • Problem:whatabout“the”?(sayk=4417)

Forsmallk,Nk >Nk+1 Forlargek,toojumpy,zeroswreckestimates SimpleGood-Turing[GaleandSampson]: replaceempiricalNk withabest-fitpowerlaw

  • ncecountcountsgetunreliable

N1 N2 N3 N4417 N3511

....

N0 N1 N2 N4416 N3510

....

N1 N2 N3 N1 N2

slide-17
SLIDE 17

17

Good-TuringReweightingIII

Hypothesis:countsofkshouldbek*=(k+1)Nk+1/Nk KatzSmoothing

UseGTdiscounted$counts(roughly– Katzleftlargecountsalone) Whatevermassisleftgoestoempiricalunigram

Countin22MWords Actualc*(Next22M) GT’sc* 1 0.448 0.446 2 1.25 1.26 3 2.24 2.24 4 3.23 3.24 MassonNew 9.2% 9.2%

  • Kneser-Ney smoothing:verysuccessfulestimatorusingtwoideas
  • Idea1:observedn-gramsoccurmoreintrainingthantheywilllater:
  • AbsoluteDiscounting

Noneedtoactuallyhaveheld-outdata;justsubtract0.75(orsomed) Maybehaveaseparatevalueofdforverylowcounts

Kneser-Ney:Discounting

Countin22MWords Futurec*(Next22M) 1 0.448 2 1.25 3 2.24 4 3.23

slide-18
SLIDE 18

18

Kneser-Ney:Continuation

  • Idea2:Type-basedfertility

Shannongame:Therewasanunexpected____?

delay? Francisco?

“Francisco”ismorecommonthan“delay” …but“Francisco”alwaysfollows“San” …soit’sless“fertile”

  • Solution:type-continuationprobabilities

Intheback-offmodel,wedon’twanttheprobabilityofwasaunigram Instead,wanttheprobabilitythatwis% Foreachword,countthenumberofbigramtypesitcompletes

Kneser-Ney

Kneser-Ney smoothingcombinesthesetwoideas

Absolutediscounting Lowerordercontinuationprobabilities

KNsmoothingrepeatedlyproveneffective [Teh,2006]showsKNsmoothingisakindofapproximate inferenceinahierarchicalPitman-Yor process(andbetter approximationsaresuperiortobasicKN)

slide-19
SLIDE 19

19

WhatActuallyWorks?

  • Trigramsandbeyond:
  • Unigrams,bigrams

generallyuseless

  • Trigramsmuchbetter(when

there’senoughdata)

  • 4-,5-gramsreallyusefulin

MT,butnotsomuchfor speech

  • Discounting
  • Absolutediscounting,Good-

Turing,held-outestimation, Witten-Bell

  • Contextcounting
  • Kneser-Neyconstruction
  • flower-ordermodels
  • See[Chen+Goodman]

readingfortonsofgraphs!

[Graphsfrom JoshuaGoodman]

Data>>Method?

  • Havingmoredataisbetter…
  • …butsoisusingabetterestimator
  • Anotherissue:N>3hashugecostsinspeechrecognizers

5.5 6 6.5 7 7.5 8 8.5 9 9.5 10 1 2 3 4 5 6 7 8 9 10 20

  • 100,000Katz

100,000KN 1,000,000Katz 1,000,000KN 10,000,000Katz 10,000,000KN allKatz allKN

slide-20
SLIDE 20

20

TonsofData? BeyondN-GramLMs

  • Lotsofideaswewon’thavetimetodiscuss:

Cachingmodels:recentwordsmorelikelytoappearagain Triggermodels:recentwordstriggerotherwords Topicmodels

Afewrecentideas

Syntacticmodels:usetreemodelstocapturelong-distance syntacticeffects[Chelba andJelinek,98] Discriminativemodels:setn-gramweightstoimprovefinaltask accuracyratherthanfittrainingsetdensity[Roark,05,forASR; Liang et.al.,06,forMT] Structuralzeros:somen-gramsaresyntacticallyforbidden,keep estimatesatzero[Mohri andRoark,06] BayesiandocumentandIRmodels[Daume 06]

slide-21
SLIDE 21

21

Overview

Sofar:languagemodelsgiveP(s)

Helpmodelfluencyforvariousnoisy-channelprocesses(MT, ASR,etc.) N-grammodelsdon’trepresentanydeepvariablesinvolvedin languagestructureormeaning Usuallywewanttoknowsomethingabouttheinputotherthan howlikelyitis(syntax,semantics,topic,etc)

Next:Naïve-Bayesmodels

Weintroduceasinglenewglobalvariable Stillaverysimplisticmodelfamily Letsusmodelhiddenpropertiesoftext,butonlyverynon-local

  • nes…

Inparticular,wecanonlymodelpropertieswhicharelargely invarianttowordorder(liketopic)