StatisticalNLP Sofar:languagemodelsgiveP(s) Spring2010 - - PDF document

statistical nlp
SMART_READER_LITE
LIVE PREVIEW

StatisticalNLP Sofar:languagemodelsgiveP(s) Spring2010 - - PDF document

Overview StatisticalNLP Sofar:languagemodelsgiveP(s) Spring2010 Helpmodelfluencyforvariousnoisy,channelprocesses(MT, ASR,etc.)


slide-1
SLIDE 1

1

StatisticalNLP

Spring2010

Lecture4:TextCategorization

DanKlein– UCBerkeley

Overview

Sofar:languagemodelsgiveP(s)

Helpmodelfluencyforvariousnoisy,channelprocesses(MT,

ASR,etc.)

N,grammodelsdon’trepresentanydeepvariablesinvolvedin

languagestructureormeaning

Usuallywewanttoknowsomethingabouttheinputotherthan

howlikelyitis(syntax,semantics,topic,etc)

Next:NaïveBayesmodels

Weintroduceasinglenewglobalvariable Stillaverysimplisticmodelfamily Letsusmodelhiddenpropertiesoftext,butonlyverynon,local

  • nes…

Inparticular,wecanonlymodelpropertieswhicharelargely

invarianttowordorder(liketopic)

TextCategorization

  • Wanttoclassifydocumentsintobroadsemantictopics(e.g.politics,

sports,etc.)

  • Whichoneisthepoliticsdocument?(Andhowmuchdeep

processingdidthatdecisiontake?)

  • Oneapproach:bag,of,wordsandNaïve,Bayes models
  • Anotherapproachlater…
  • Usuallybeginwithalabeledcorpuscontainingexamplesofeach

class

Obamaishopingtorallysupport forhis$825billionstimulus packageontheeveofacrucial Housevote.Republicanshave expressedreservationsaboutthe proposal,callingformoretax cutsandlessspending.GOP representativesseemeddoubtful thatanydealswouldbemade. Californiawillopenthe2009 seasonathomeagainst MarylandSept.5andwillplaya totalofsixgamesinMemorial Stadiuminthefinalfootball scheduleannouncedbythe Pacific,10ConferenceFriday. Theoriginalschedulecalledfor 12gamesover12weekends.

Naïve,BayesModels

  • Idea:pickatopic,thengenerateadocumentusingalanguage

modelforthattopic.

  • Naïve,Bayesassumption:allwordsareindependentgiventhetopic.
  • Comparetoaunigramlanguagemodel:

=

=

  • =STOP
  • UsingNBforClassification
  • Wehaveajointmodeloftopicsanddocuments
  • Givesposteriorlikelihoodoftopicgivenadocument
  • Whatabouttotallyunknownwords?
  • Canworkshockinglywellfortextcat(especiallyinthewild)
  • Howcanunigrammodelsbesoterribleforlanguagemodeling,butclass,conditional

unigrammodelsworkfortextcat?

  • Numerical/speedissues
  • HowaboutNBforspamdetection?

=

∏ ∏

      =

  • TwoNBFormulations

TwoNBeventmodelsfortextcategorization

Theclass,conditionalunigrammodel,a.k.a.multinomialmodel

Onenodeperwordinthedocument Drivenbywordswhicharepresent Multipleoccurrences,multipleevidence Betteroverall– plus,knowhowtosmooth

Thebinominal(binary)model

Onenodeforeachwordinthevocabulary Incorporatesexplicitnegativecorrelations Knowhowtodofeatureselection(e.g.keepwordswithhigh

mutualinformationwiththeclassvariable)

  • vocabulary

accuracy

slide-2
SLIDE 2

2

Example:Barometers

NBFACTORS:

P(s) =1/2 P(,|s)=1/4 P(,|r)=3/4

  • PREDICTIONS:
  • P(r,,,,)=(½)(¾)(¾)
  • P(s,,,,)=(½)(¼)(¼)
  • P(r|,,,)=9/10
  • P(s|,,,)=1/10
  • Example:Stoplights
  • !
  • "

#

  • NBFACTORS:

P(w)=6/7 P(r|w)=1/2 P(g|w)=1/2 P(b)=1/7 P(r|b)=1 P(g|b)=0

P(b|r,r)=4/10(whathappened?)

(Non,)IndependenceIssues

MildNon,Independence

Evidenceallpointsintherightdirection Observationsjustnotentirelyindependent Results

InflatedConfidence DeflatedPriors

Whattodo?Boostpriorsorattenuateevidence

SevereNon,Independence

Wordsviewedindependentlyaremisleading Interactionshavetobemodeled Whattodo?

Changeyourmodel!

< >

=

  • LanguageIdentification
  • Howcanwetellwhatlanguageadocumentisin?
  • HowtotelltheFrenchfromtheEnglish?
  • Treatitasword,leveltextcat?

Overkill,andrequiresalotoftrainingdata Youdon’tactuallyneedtoknowaboutwords!

  • Option:buildacharacter,levellanguagemodel

The38thParliamentwillmeeton Monday,October4,2004,at11:00a.m. Thefirstitemofbusinesswillbethe electionoftheSpeakeroftheHouseof Commons.HerExcellencytheGovernor GeneralwillopentheFirstSessionof the38thParliamentonOctober5,2004, withaSpeechfromtheThrone. La38elégislatureseréuniraà11heuresle lundi4octobre2004,etlapremièreaffaire àl'ordredujourseral’électionduprésident delaChambredescommunes.Son ExcellencelaGouverneuregénérale

  • uvriralapremièresessiondela38e

législatureavecundiscoursduTrônele mardi5octobre2004.

ΣύUφωνοσταθερότηταςκαιανάπτυξης Pattodistabilitàedicrescita

Class,ConditionalLMs

  • Canaddatopicvariabletoricherlanguagemodels
  • Couldbecharactersinsteadofwords,usedforlanguageID(HW2)
  • Couldsumoutthetopicvariableanduseasalanguagemodel
  • Howmightaclass,conditionaln,gramlanguagemodelbehave

differentlyfromastandardn,grammodel?

=

  • Clustering/PatternDetection

Problem1:Therearemanypatternsinthedata,

mostofwhichyoudon’tcareabout.

Soccerteamwinsmatch Stockscloseup3% Investinginthestockmarkethas… Thefirstgameoftheworldseries…

slide-3
SLIDE 3

3

Clusteringvs.Classification

Classification:wespecifywhichpatternwewant,

featuresuncorrelatedwiththatpatternareidle

Clustering:theclusteringprocedurelocksonto

whicheverpatternismostsalient,statistically

P(contentwords|class)willlearntopics P(length,functionwords|class)willlearnstyle P(characters|class)willlearn“language”

P(w|sports) P(w|politics) the0.1 the0.1 game0.02 game0.005 win0.02 win0.01 P(w|headline) P(w|story) the0.05 the0.1 game0.01 game0.01 win0.01 win0.01

Model,BasedClustering

  • Clusteringwithprobabilisticmodels:
  • Problem2:Therelationshipbetweenthestructureofyourmodel

andthekindsofpatternsitwilldetectiscomplex.

LONDON,, Soccerteamwinsmatch NEWYORK– Stockscloseup3% Investinginthestockmarkethas… Thefirstgameoftheworldseries… Observed(X) Unobserved(Y) ?? ?? ?? ?? Often:findθ tomaximize: Buildamodelofthedomain:

LearningModelswithEM

HardEM:

alternatebetween

Example:K,Means Problem3:Datalikelihood(usually)isn’tthe

  • bjectiveyoureallycareabout

Problem4:Youcan’tfindglobalmaximaanyway

E,step:Findbest“completions”Yforfixedθ M,step:Findbestparametersθ forfixedY

HardEMforNaïve,Bayes

Procedure:(1)wecalculatebestcompletions: (2)computerelevantcountsfromthecompleteddata: (3)computenewparametersfromthesecounts(divide) (4)repeatuntilconvergence Canalsodothiswhensomedocsarelabeled

EM:MoreFormally

HardEM: Improvecompletions Improveparameters Eachstepeitherdoesnothingorincreasesthe

  • bjective

SoftEMforNaïve,Bayes

Procedure:(1)calculateposteriors(softcompletions): (2)computeexpectedcountsunderthoseposteriors: (3)computenewparametersfromthesecounts(divide) (4)repeatuntilconvergence

slide-4
SLIDE 4

4

EMinGeneral

  • We’lluseEMoverandoveragaintofillinmissingdata
  • ConvenienceScenario:wewantP(x),includingyjustmakesthemodel

simpler(e.g.mixingweightsforlanguagemodels)

  • InductionScenario:weactuallywanttoknowy(e.g.clustering)
  • NLPdiffersfrommuchofstatistics/machinelearninginthatweoften

wanttointerpretorusetheinducedvariables(whichistrickyatbest)

  • Generalapproach:alternatelyupdateyandθ
  • E,step:computeposteriorsP(y|x,θ)

Thismeansscoringallcompletionswiththecurrentparameters Usually,wedothisimplicitlywithdynamicprogramming

  • M,step:fitθ tothesecompletions

Thisisusuallytheeasypart– treatthecompletionsas(fractional)complete

data

  • Initialization:startwithsomenoisylabelings andthenoiseadjustsinto

patternsbasedonthedataandthemodel

  • We’llseelotsofexamplesinthiscourse
  • EMisonlylocallyoptimal(why?)

HeuristicClustering?

Manymethodsofclusteringhavebeendeveloped

Moststartwithapairwisedistancefunction Mostcanbeinterpretedprobabilistically(withsomeeffort) Axes:flat/hierarchical,agglomerative/divisive,incremental/

iterative,probabilistic/graphtheoretic/linearalgebraic

Examples:

Single,linkagglomerativeclustering Complete,linkagglomerativeclustering Ward’smethod Hybriddivisive/agglomerativeschemes

DocumentClustering

Typicallywanttoclusterdocumentsbytopic Bag,of,wordsmodelsusuallydodetecttopic

It’sdetectingdeeperstructure,syntax,etc.whereitgetsreally

tricky!

Allkindsofgamestofocustheclustering

Stopwordlists Termweightingschemes(fromIR,morelater) Dimensionalityreduction(morelater)

WordSenses

Wordshavemultipledistinctmeanings,orsenses:

Plant:livingplant,manufacturingplant,… Title:nameofawork,ownershipdocument,formofaddress,

materialatthestartofafilm,…

Manylevelsofsensedistinctions

Homonymy:totallyunrelatedmeanings(riverbank,moneybank) Polysemy:relatedmeanings(starinsky,starontv) Systematicpolysemy:productivemeaningextensions

(metonymysuchasorganizationstotheirbuildings)ormetaphor

Sensedistinctionscanbeextremelysubtle(ornot)

Granularityofsensesneededdependsalotonthetask Whyisitimportanttomodelwordsenses?

Translation,parsing,informationretrieval?

WordSenseDisambiguation

Example:livingplantvs.manufacturingplant Howdowetellthesesensesapart?

“context” Maybeit’sjusttextcategorization Eachwordsenserepresentsatopic Runanaive,bayesclassifier?

Bag,of,wordsclassificationworksokfornounsenses

90%onclassic,shockinglyeasyexamples(line,interest,star) 80%onsenseval,1nouns 70%onsenseval,1verbs

Themanufacturingplant whichhadpreviouslysustainedthe town’seconomyshutdownafteranextendedlaborstrike.

VariousApproachestoWSD

Unsupervisedlearning

Bootstrapping(Yarowsky95) Clustering

Indirectsupervision

Fromthesauri FromWordNet Fromparallelcorpora

Supervisedlearning

Mostsystemsdosomekindofsupervisedlearning Manycompetingclassificationtechnologiesperformaboutthe

same(it’sallabouttheknowledgesourcesyoutap)

Problem:trainingdataavailableforonlyafewwords

slide-5
SLIDE 5

5

Resources

WordNet

Hand,build(butlarge)hierarchyofwordsenses Basicallyahierarchicalthesaurus

SensEval,>SemEval

AWSDcompetition,ofwhichtherehavebeen3+3iterations Training/testsetsforawiderangeofwords,difficulties,and

parts,of,speech

Bake,offwherelotsoflabstriedlotsofcompetingapproaches

SemCor

AbigchunkoftheBrowncorpusannotatedwithWordNet

senses

OtherResources

TheOpenMindWordExpert Paralleltexts Flatthesauri

VerbWSD

Whyareverbsharder?

Verbalsenseslesstopical Moresensitivetostructure,argumentchoice

VerbExample:“Serve”

[function]Thetreestumpservesasatable [enable]Thescandalservedtoincreasehispopularity [dish]Weservemealsforthehomeless [enlist]Sheservedhercountry [jail]Heservedsixyearsforembezzlement [tennis]ItwasAgassi'sturntoserve [legal]Hewasservedbythesheriff

KnowledgeSources

Sowhatdoweneedtomodeltohandle“serve”?

Therearedistanttopicalcues

….point…court…………………serve………game…

=

  • WeightedWindowswithNB

Distanceconditioning

Somewordsareimportantonlywhentheyarenearby ….as….point…court…………………serve………game… ….…………………………………………serveas……………..

Distanceweighting

Nearbywordsshouldgetalargervote …court……serveas………game……point

− + + =−

=

− + + =−

=

  • relativepositioni

BetterFeatures

Therearesmarterfeatures:

Argumentselectionalpreference:

serveNP[meals]vs.serveNP[papers]vs.serveNP[country]

Subcategorization:

[function]servePP[as] [enable]serveVP[to] [tennis]serve<intransitive> [food]serveNP{PP[to]}

Cancapturepoorly(butrobustly)withlocalwindows …butwecanalsouseaparserandgetthesefeaturesexplicitly

Otherconstraints(Yarowsky95)

One,sense,per,discourse(onlytrueforbroadtopicaldistinctions) One,sense,per,collocation(prettyreliablewhenitkicksin:

manufacturingplant,floweringplant)

ComplexFeatureswithNB?

Example: Sowehaveadecisiontomakebasedonasetofcues:

context:jail,context:county,context:feeding,… local,context:jail,local,context:meals subcat:NP,direct,object,head:meals

Notclearhowbuildagenerativederivationforthese:

Choosetopic,thendecideonhavingatransitiveusage,then

pick“meals”tobetheobject’shead,thengenerateotherwords?

Howaboutthewordsthatappearinmultiplefeatures? Hardtomakethiswork(thoughmaybepossible) Norealreasontotry

WashingtonCountyjailserved 11,166mealslast month, afigurethattranslatestofeedingsome 120peoplethreetimesdailyfor31days.