 
              Weakly-Supervised Bayesian Learning of a CCG Supertagger Dan Garrette, Chris Dyer, Jason Baldridge, Noah A. Smith
Type-Level Supervision
Type-Level Supervision • Unannotated text • Incomplete tag dictionary: word ↦ {tags}
Type-Level Supervision Used for POS tagging for 20+ years [Kupiec, 1992] [Merialdo, 1994]
Type-Level Supervision Good POS tagger performance even with low supervision [Das & Petrov 2011] [Garrette & Baldridge 2013] [Garrette et al. 2013]
Combinatory Categorial Grammar (CCG)
CCG Every word token is associated with a category Categories combine to categories of constituents [Steedman, 2000] [Steedman and Baldridge, 2011]
CCG np np np / n n the dog
CCG s np s s \ np dogs sleep
POS vs. Supertags S s VP NP np np/n DT NN VBZ n s\np the the dog sleeps dog sleeps
Supertagging Type-supervised learning for supertagging is much more difficult than for POS Penn Treebank POS CCGBank Supertags 48 tags 1,239 tags
CCG The grammar formalism itself can be used to guide learning
CCG Supertagging
CCG Supertagging • Sequence tagging problem, like POS-tagging • Building block for grammatical parsing
Supertagging “almost parsing” [Bangalore and Joshi 1999]
Why Supertagging? np n / n / n n s np \ the lazy dog sleeps
Why Supertagging? s np n np np / n n / n n s s \ np the lazy dog sleeps
CCG Supertagging n np np / n n n / n n s s np \ np the lazy dog sleeps
CCG Supertagging np np / n n n / n n s np s \ np the lazy dog sleeps
CCG Supertagging np n / n / n n s\np the lazy dog sleeps
CCG Supertagging np/n ? n the lazy dog
Principle #1 np/n np n X X the lazy dog Prefer Connections
Supertags vs. POS s S np VP NP np/n n s\np DT NN VBZ ? the dog sleeps the dog sleeps universal, intrinsic all relationships grammar properties must be learned
Principle #2 np/n (np\(np/n))/n n the lazy dog Prefer Simplicity
Prefer Simplicity appears 342 times in CCGbank buy := (s b \np)/np e.g. “Opponents don't buy such arguments.” buy := (((s b \np)/ pp )/ pp )/np appears once “Tele-Communications agreed to buy half of Showtime Networks from Viacom for $ 225 million.” pp pp
Weighted Tag Grammar a {s, np, n,…} p atom ( a ) × p term A B / B p term × p fwd × p mod A B / C p term × p fwd × p mod A B \ B p term × p fwd × p mod A B \ C p term × p fwd × p mod
CCG Supertagging np np/n (np\(np/n))/n n n/n the lazy dog
HMM Transition Prior P( t → u ) = λ · P( u ) + (1 −λ ) · P( t → u ) simple is good connecting is good
Type-Supervised Learning unlabeled corpus same as POS tagging tag dictionary universal properties of the CCG formalism
Training
Posterior Inference Forward-Filter Backward-Sample (FFBS) � [Carter and Kohn, 1996]
Posterior Inference Unlabeled Data ______________ ______________ ______________ the lazy wander dogs ______________ ______________ np/n n/n n n Tag Dictionary ___ : __, __, __ np np n/n ___ : __, __, __ ___ : __, __, __ (s\np)/np np/n ___ : __, __, __ ___ : __, __, __ s\np …
Posterior Inference Priors the lazy wander dogs np/n n/n n n np np n/n HMM (s\np)/np np/n s\np …
Posterior Inference Priors the lazy wander dogs np/n n/n n n np np n/n HMM (s\np)/np np/n s\np …
Posterior Inference Priors the lazy wander dogs np/n n/n n n np np n/n HMM (s\np)/np np/n s\np …
Posterior Inference Priors the lazy wander dogs np/n np/n n/n n/n n n n np np n/n HMM (s\np)/np np/n s\np s\np …
Posterior Inference Priors the lazy wander dogs np/n n/n n n np np n/n HMM (s\np)/np np/n s\np …
Experiments
Baldridge 2008 Use universal properties of CCG to initialize EM • Simpler definition of category complexity • No corpus-specific information
English Supertagging 100 Baldridge '08 Ours y c 75 80 80 a 78 r 73 u c 67 c a 50 55 g 51 n i 41 g g 25 a t 0 0.1 0.01 0.001 none tag dictionary pruning cutoff
Chinese Supertagging 100 Baldridge '08 Ours y c 75 a r u 69 66 c 62 c 56 a 50 g 49 43 n i g 33 g 25 28 a t 0 0.1 0.01 0.001 none tag dictionary pruning cutoff
Italian Supertagging 100 Baldridge '08 Ours y c 75 a r u c c a 50 54 53 g 47 46 45 n i 36 g 33 32 g 25 a t 0 0.1 0.01 0.001 none tag dictionary pruning cutoff
Code Available GitHub repository linked from my website
Conclusion Combining annotation exploitation with universal grammatical knowledge yields good models from weak supervision
Recommend
More recommend