SupervisedLearning StatisticalNLP Spring2010 - - PDF document

supervised learning statistical nlp
SMART_READER_LITE
LIVE PREVIEW

SupervisedLearning StatisticalNLP Spring2010 - - PDF document

SupervisedLearning StatisticalNLP Spring2010 Systemsduplicatecorrect analysesfromtrainingdata Hand'annotationofdata Time'consuming Expensive Hardtoadaptfornewpurposes


slide-1
SLIDE 1

1

StatisticalNLP

Spring2010

Lecture15:GrammarInduction

DanKlein– UCBerkeley

SupervisedLearning

Systemsduplicatecorrect analysesfromtrainingdata Hand'annotationofdata

Time'consuming Expensive Hardtoadaptfornewpurposes (tasks,languages,domains,etc) Corpusavailabilitydrives research,nottasks

Example:PennTreebank

50KSentences Hand'parsedoverseveralyears

UnsupervisedLearning

Systemstakerawdataand automaticallydetectpatterns Whyunsupervisedlearning?

Moredatathanannotation Insightsintomachinelearning, clustering Kidslearnsomeaspectsof languageentirelywithout supervision

Here:unsupervisedlearning

Workpurelyfromtheformsofthe utterances Neitherassumenorexploitprior meaningorgrounding[. Feldman.]

UnsupervisedParsing?

  • Startwithrawtext,learnsyntacticstructure
  • Somehavearguedthatlearningsyntaxfrom

positivedataaloneisimpossible:

Gold,1967:Non'identifiability inthelimit Chomsky,1980:Thepovertyofthestimulus

  • Manyothershavefeltitshouldbepossible:

Lari andYoung,1990 CarrollandCharniak,1992 AlexClark,2001 MarkPaskin,2001 …andmanymore,butitdidn’tworkwell(oratall) untilthepastfewyears

  • Surprisingresult:it’spossibletogetentirely

unsupervisedparsingto(reasonably)workwell!

Learnability

Learnability:formalconditionsunderwhichaclassof

languagescanbelearnedinsomesense

Setup:

Classoflanguagesis

  • LearnerissomealgorithmH

LearnerseesasequencesXofstringsx1 …xn HmapssequencesXtolanguagesLin

  • Question:forwhatclassesdolearnersexist?

Learnability:[Gold67]

Criterion:identificationinthelimit

A ofLisaninfinitesequenceofx’s fromLin

whicheachxoccursatleastonce

AlearnerH ifforanypresentationof

L,fromsomepointnonward,HalwaysoutputsL

Aclass

  • is ifthereissomesingle

HwhichcorrectlyidentifiesinthelimitanyLin

  • Example:L={{a},{a,b}}islearnableinthelimit

Theorem[Gold67]:Any

  • whichcontainsallfinite

languagesandatleastoneinfinitelanguage(i.e.is superfinite)isunlearnable inthissense

slide-2
SLIDE 2

2

Learnability:[Gold67]

Proofsketch

Assume

  • issuperfinite

ThereexistsachainL1 ⊂ L2 ⊂ …L∞ TakeanylearnerHassumedtoidentify

  • Constructthefollowingmisleadingsequence

PresentstringsfromL1 untilitoutputsL1 PresentstringsfromL2 untilitoutputsL2 …

ThisisapresentationofL∞,butHwon’tidentifyL∞

Learnability:[Horning69]

Problem:IILrequiresthatHsucceedoneach

presentation,eventheweirdones

Anothercriterion:

AssumeadistributionPL(x)foreachL AssumePL(x)putsnon'zeromassonallandonlyxinL AssumeinfinitepresentationXdrawni.i.d.fromPL(x) Hmeasure'oneidentifiesLifprobabilityofdrawinganX

fromwhichHidentifiesLis1

[Horning69]:PCFGscanbeidentifiedinthissense

Note:therecanbemisleadingsequences,theyjusthaveto

be(infinitely)unlikely

Learnability:[Horning69]

  • Proofsketch
  • Assume
  • isarecursivelyenumerablesetofrecursivelanguages(e.g.the

setofPCFGs)

  • Assumeanorderingonallstringsx1 <x2 <…
  • Define:twosequencesAandB ifforallx<xn,xinA⇔ x

inB

  • Definethe E(L,n,m):
  • AllsequencessuchthatthefirstmelementsdonotagreewithLthroughn
  • ThesearethesequenceswhichcontainearlystringsoutsideofL(can’thappen)
  • rfailtocontainalltheearlystringsinL(happenslessasmincreases)
  • Claim:P(E(L,n,m))goesto0asmgoesto∞
  • LetdL(n)bethesmallestmsuchthatP(E)<2'n
  • Letd(n)bethelargestdL(n)infirstnlanguages
  • Learner:afterd(n)pickfirstLthatagreeswithevidencethroughn
  • CanonlyfailforsequenceXifXkeepsshowingupinE(L,n,d(n)),which

happensinfinitelyoftenwithprobabilityzero(weskippedsomedetails)

Learnability

Gold’sresultsayslittleaboutreallearners

(requirementsofIILarewaytoostrong)

Horning’salgorithmiscompletelyimpractical(needs

astronomicalamountsofdata)

Evenmeasure'oneidentificationdoesn’tsayanything

abouttreestructures(orevendensityoverstrings)

Onlytalksaboutlearninggrammaticalsets Stronggenerativevs weakgenerativecapacity

UnsupervisedTagging?

AKApart'of'speechinduction Task:

Rawsentencesin Taggedsentencesout

Obviousthingtodo:

Startwitha(mostly)uniformHMM RunEM Inspectresults

EMforHMMs:Process

  • Alternatebetweenrecomputingdistributionsoverhiddenvariables

(thetags)andreestimatingparameters

  • Crucialstep:wewanttotallyuphowmany(fractional)countsof

eachkindoftransitionandemissionwehaveundercurrentparams:

  • SamequantitiesweneededtotrainaCRF!
slide-3
SLIDE 3

3

Merialdo:Setup

Some(discouraging)experiments[Merialdo 94] Setup:

Youknowthesetofallowabletagsforeachword Learnasupervisedmodelonktrainingsentences

LearnP(w|t)ontheseexamples LearnP(t|t'1,t'2)ontheseexamples

Onn>ksentences,re'estimatewithEM

Note:weknowallowedtagsbutnotfrequencies

Merialdo:Results DistributionalClustering

  • ♦ ♦

DistributionalClustering

Threemainvariantsonthesameidea:

Pairwisesimilaritiesandheuristicclustering

E.g.[FinchandChater92] Producesdendrograms

Vectorspacemethods

E.g.[Shuetze93] Modelsofambiguity

Probabilisticmethods

Variousformulations,e.g.[LeeandPereira99]

NearestNeighbors Dendrograms_

slide-4
SLIDE 4

4

=

+ −

=

  • AProbabilisticVersion?

♦ ♦

  • ♦ ♦
  • WeaklySupervisedLearning

Newlyremodeled2Bdrms/1Bath,spaciousupperunit,locatedin HilltopMallarea.Walkingdistancetoshopping,publictransportation, schoolsandpark.Paidwaterandgarbage.Nodogsallowed.

PrototypeLists

FEATURE kitchen,laundry LOCATION near,close TERMS paid,utilities SIZE large,feet RESTRICT cat,smoking NN president IN

  • f

VBD said NNS shares CC and TO to NNP Mr. PUNC . JJ new CD million DET the VBP are

EnglishPOS InformationExtraction

From[Haghighi andKlein06]

Context'FreeGrammars

  • Lookslikeacontext'freegrammar.
  • Canmodelatreeasacollectionofcontext'freerewrites(with

probabilitiesattached).

  • =

EarlyApproaches:StructureSearch

  • Incrementalgrammarlearning,chunking[Wolff88,Langley82,

manyothers] Canrecoversyntheticgrammars

  • An(extremelygood/lucky)resultofincrementalstructuresearch:
  • Looksgood,…butcan’tparseinthewild.

Idea:LearnPCFGswithEM

ClassicexperimentsonlearningPCFGswith

Expectation'Maximization[LariandYoung,1990]

Fullbinarygrammarover symbols Parseuniformly/randomlyatfirst Re'estimateruleexpectationsoffofparses Repeat

Theirconclusion:

itdoesn’treallywork.

  • { }
  • Problem:ModelSymmetries

Symmetries Howdoesthisrelatetotrees

NOUN VERBADJ NOUN

  • NOUN VERBADJ NOUN
slide-5
SLIDE 5

5

OtherApproaches

Evaluation:fractionofnodesingoldtreescorrectly

positedinproposedtrees(unlabeledrecall)

Somerecentworkinlearningconstituency:

[Adrians,99] Languagegrammarsaren’tgeneralPCFGs [Clark,01] Mutual'informationfiltersdetectconstituents,thenan

MDL'guidedsearchassemblesthem

[vanZaanen,00] Findslowedit'distancesentencepairsand

extractstheirdifferences Adriaans,1999 16.8 Clark,2001 34.6 vanZaanen,2000 35.6

Right'BranchingBaseline

Englishtreestendtoberight'branching,not

balanced

Asimple(English'specific)baselineistochoosethe

rightchainstructureforeachsentence

  • 35.6

vanZaanen,00 46.4 Right'Branch

Idea:DistributionalSyntax?

♦ ! ♦

NP PP VP S __♦

  • Context

Span __

  • Canweusedistributionalclusteringforlearning

syntax?

[Harris,51]

Problem:IdentifyingConstituents

  • Distributionalclassesareeasytofind…
  • …butfiguringoutwhichareconstituentsishard.

Principal Component2 Principal Component2 PrincipalComponent1 PrincipalComponent1

ANestedDistributionalModel

We’dlikeamodelthat:

Tiesspanstolinearcontexts(like

distributionalclustering)

Considersonlypropertree

structures(likeaPCFGmodel)

Hasnosymmetriestobreak(likea

dependencymodel)

  • Constituent'ContextModel(CCM)

"#

♦ ♦

  • "

♦♦" " ♦$$" " $$♦" " $$♦"

+ +

# %

  • %

%

  • χ

φ

− −

# %

  • %

%

  • χ

φ

slide-6
SLIDE 6

6

Results:Constituency

Right'Branch

70.0

CCM[Klein&Manning02]

81.6

  • SpectrumofSystematicErrors

CCM analysis better Treebank analysis better !"(~25%)

Analysis InsideNPs Possesives Verbgroups CCM [#] $[% ] [] Treebank # [$%] [] CCMRight? Yes Maybe No

SyntacticParsing

Parsingassignsstructurestosentences. Dependencystructuregivesattachments.

WHAT WHEN WHO

Idea:LexicalAffinityModels

Wordsselectotherwordsonsyntacticgrounds Linkuppairswithhighmutualinformation

[Yuret,1998]: Greedylinkage [Paskin,2001]: Iterativere'estimationwithEM

Evaluation:comparelinkedpairstoagoldstandard

  • 39.7

Accuracy Method Paskin,2001 41.7 Random

Problem:Non'SyntacticAffinity

Mutualinformationbetweenwordsdoesnot

necessarilyindicatesyntacticselection.

& ' &

Idea:WordClasses

Individualwordslikecongressareentwinedwith

semanticfactsabouttheworld.

Syntacticclasses,likeNOUN andADVERB are

bleachedofword'specificsemantics.

Automaticwordclassesmorelikelytolooklike

DAYS'OF'WEEKor PERSON'NAME.

Wecouldbuilddependencymodelsoverword

classes.[.CarrollandCharniak,1992] NOUNADVERBVERBDETPARTICIPLENOUN

slide-7
SLIDE 7

7

Problems:WordClassModels

Issues:

Toosimpleamodel– doesn’tworkmuchbettersupervised Norepresentationofvalence(numberofarguments)

NOUNNOUNVERB & NOUN NOUNVERB &

41.7 Random 53.2 44.7 AdjacentWords CarrollandCharniak,92

  • LocalRepresentations

Classes? Distance LocalFactor Paskin 01

&

Carroll&Charniak 92

&

DMV[K&M04]

&'

  • 55.9

AdjacentWords 62.7 DMV

CommonErrors:Dependency

DET← N 3474

N'PROP← N'PROP

2096 NUM→ NUM 760 PREP← DET 735 DET← N'PL 696 DET→ PREP 627 DET→ V'PAST 470 DET→ V'PRES 420 DET→ N 3079

N'PROP→ N'PROP

1898 PREP← N 838 N→ V'PRES 714 DET→ N'PL 672 N← PREP 669 NUM← NUM 54 N→ V'PAST 54

Underproposed Dependencies Overproposed Constituents Overproposed Dependencies

Results:Dependencies

Situationsofar:

Task:unstructuredtextin,wordpairsout Previousresultswerebelowbaseline Wemodeledwordclasses[.Carroll&Charniak92] Weaddedamodelofdistance[.Collins99] Resultingmodelissubstantiallyoverbaseline …butwecandomuchbetter

55.9 AdjacentWords 62.7 DMV

Results:CombinedModels

SupervisedPCFGconstituencyrecallisat92.8 Qualitativeimprovements

Subject'verbgroupsgone,modifierplacementimproved

Random 45.6 DMV 62.7 CCM+DMV 64.7 Random 39.4 CCM 81.0 CCM+DMV 88.0

  • HowGeneralisThis?

English(7422sentences) RandomBaseline 39.4 CCM+DMV 88.0 German(2175sentences) RandomBaseline 49.6 CCM+DMV 89.7 Chinese(2473sentences) RandomBaseline 35.5 CCM+DMV 46.7

  • DMV

54.2 CCM+DMV 60.0