features of statistical parsers
play

Features of Statistical Parsers Mark Johnson Brown Laboratory for - PowerPoint PPT Presentation

Features of Statistical Parsers Mark Johnson Brown Laboratory for Linguistic Information Processing CoNLL 2005 1 Features of Statistical Parsers Confessions of a bottom-feeder: Dredging in the Statistical Muck Mark Johnson Brown Laboratory


  1. Features of Statistical Parsers Mark Johnson Brown Laboratory for Linguistic Information Processing CoNLL 2005 1

  2. Features of Statistical Parsers Confessions of a bottom-feeder: Dredging in the Statistical Muck Mark Johnson Brown Laboratory for Linguistic Information Processing CoNLL 2005 2

  3. Features of Statistical Parsers Confessions of a bottom-feeder: Dredging in the Statistical Muck Mark Johnson Brown Laboratory for Linguistic Information Processing CoNLL 2005 With much help from Eugene Charniak , Michael Collins and Matt Lease 3

  4. Outline • Goal: find features for identifying good parses • Why is this difficult with generative statistical models? • Reranking framework • Conditional versus joint estimation • Features for parse ranking • Estimation procedures • Experimental set-up • Feature selection and evaluation 4

  5. Features for accurate parsing • Accurate parsing requires good features ⇒ need a flexible method for evaluating a wide range of features • parse ranking framework is current best method for doing this + works with virtually any kind of representation + features can encode virtually any kind of information (syntactic, lexical semantics, prosody, etc.) + can exploit the currently best-available parsers − efficient algorithms are hard(-er) to design and implement − fishing expedition 5

  6. Why not a generative statistical parser? • Statistical parsers (Charniak, Collins) generate parses node by node • Each step is conditioned on the structure already generated S NP VP . PRP VBD NP . He raised DT NN the price • Encoding dependencies is as difficult as designing a feature-passing grammar (GPSG) • Smoothing interacts in mysterious ways with these encodings • Conditional estimation should produce better parsers with our current lousy models 6

  7. Linear ranking framework sentence s • Generate n candidate parses T c ( s ) for each sentence s n -best parser • Map each parse t ∈ T c ( s ) to a parses T c ( s ) t 1 . . . t n real-valued feature vector apply feature fns f ( t ) = ( f 1 ( t ) , . . . , f m ( t )) f ( t 1 ) f ( t n ) feature vectors . . . • Each feature f j is associated with a weight w j linear combination • The highest scoring parse w · f ( t 1 ) . . . w · f ( t n ) parse scores ^ t = argmax w · f ( t ) argmax t ∈T c ( s ) “best” parse for s is predicted correct 7

  8. Linear ranking example w = (− 1, 2, 1 ) Candidate parse tree t features f ( t ) parse score w · f ( t ) ( 1, 3, 2 ) t 1 7 ( 2, 2, 1 ) t 2 3 . . . . . . . . . • Parser designer specifies feature functions f = ( f 1 , . . . , f m ) • Feature weights w = ( w 1 , . . . , w m ) specify each feature’s “importance” • n -best parser produces trees T c ( s ) for each sentence s • Feature functions f apply to each tree t ∈ T c ( s ) , producing feature values f ( t ) = ( f 1 ( t ) , . . . , f m ( t )) • Return highest scoring tree m � ^ t ( s ) = argmax w · f ( t ) = argmax w j f j ( t ) t t j = 1 8

  9. Linear ranking, statistics and machine learning • Many models define the best candidate ^ t in terms of a linear combination of feature values w · f ( t ) – Exponential, Log-linear, Gibbs models, MaxEnt 1 P ( t ) Z exp w · f ( t ) = � Z = exp w · f ( t ) (partition function) t ∈T log P ( t ) = w · f ( t ) − log Z – Perceptron algorithm (including averaged version) – Support Vector Machines – Boosted decision stubs 9

  10. PCFGs are exponential models f j ( t ) = number of times the j th rule is used in t = log p j , where p j is probability of j th rule w j   S     NP VP   f = [ 1 , 1 , 0 , 1 , 0 ]   ���� ���� ���� ���� ����   S → NP VP NP → rice VP → grows VP → grow NP → bananas rice grows � � � exp ( w j ) f j ( t ) = p f j ( t ) P PCFG ( t ) = = exp w j f j ( t ) j j j j � = exp w j f j ( t ) = exp w · f ( t ) j So a PCFG is just a special kind of exponential model with Z = 1 . 10

  11. Features in linear ranking models • Features can be any real-valued function of parse t and sentence s – counts of number of times a particular structure appears in t – log probabilities from other models ∗ log P c ( t ) is our most useful feature! ∗ generalizes reference distributions of MaxEnt models • Subtracting a constant c ( s ) from a feature’s value doesn’t affect difference between parse scores in a linear model w · ( f ( t 1 ) − c ( s )) − w · ( f ( t 2 ) − c ( s )) = w · f ( t 1 ) − w · f ( t 2 ) – features that don’t vary on T c ( s ) are useless – subtract most frequently occuring value c j ( s ) for each feature f j in sentence s ⇒ sparser feature vectors 11

  12. Getting the feature weights f ( t ⋆ ( s )) { f ( t ) : t ∈ T c ( s ) , t � = t ⋆ ( s ) } s sentence 1 ( 1, 3, 2 ) ( 2, 2, 3 ) ( 3, 1, 5 ) ( 2, 6, 3 ) sentence 2 ( 7, 2, 1 ) ( 2, 5, 5 ) sentence 3 ( 2, 4, 2 ) ( 1, 1, 7 ) ( 7, 2, 1 ) . . . . . . . . . • n -best parser produces trees T c ( s ) for each sentence s • Treebank gives correct tree t ⋆ ( s ) ∈ T c ( s ) for sentence s • Feature functions f apply to each tree t ∈ T c ( s ) , producing feature values f ( t ) = ( f 1 ( t ) , . . . , f m ( t )) • Machine learning algorithm selects feature weights w to prefer t ⋆ ( s ) (e.g., so w · f ( t ⋆ ( s )) is greater than w · f ( t ′ ) for other t ′ ∈ T c ( s ) ) 12

  13. Conditional ML estimation of w • Conditional ML estimation selects w to make t ⋆ ( s ) as likely as possible compared to the trees in T c ( s ) • Same as conditional MaxEnt estimation 1 P w ( t | s ) Z w ( s ) exp w · f ( t ) exponential model = � exp w · f ( t ′ ) Z w ( s ) = t ′ ∈T c ( s ) = (( s 1 , t ⋆ 1 ) , . . . , ( s n , t ⋆ n )) treebank training data D n � L D ( w ) = P w ( t ⋆ i | s i ) conditional likelihood of D i = 1 = argmax L D ( w ) w � w 13

  14. (Joint) MLE for exponential models is hard = ( t ⋆ 1 , . . . , t ⋆ n ) D n � t ⋆ L D ( w ) = P w ( t ⋆ i ) i T i = 1 w = argmax L D ( w ) � w � 1 exp w · f ( t ′ ) P w ( t ) = exp w · f ( t ) , Z w = Z w t ′ ∈T • Joint MLE selects w to make t ⋆ i as likely as possible • T is set of all possible parses for all possible strings • T is infinite ⇒ cannot be enumerated ⇒ Z w cannot be calculated • For a PCFG, Z w and hence � w are easy to calculate, but . . . • in general ∂L D /∂w j and Z w are intractable analytically and numerically • Abney (1997) suggests a Monte-Carlo calculation method 14

  15. Conditional MLE is easier • The conditional likelihood of w is the conditional probability of the hidden part of the data (syntactic structure) t ⋆ given its visible part (yield or terminal string) s • The conditional likelihood can be numerically optimized because T c ( s ) can be enumerated (by a parser) T ( s i ) t ⋆ i (( t ⋆ 1 , s 1 ) . . . , ( t ⋆ D = n , s n )) n � P w ( t ⋆ L D ( w ) = i | s i ) i = 1 T = argmax L D ( w ) w � w � 1 exp w · f ( t ′ ) P ( t | s ) = Z w ( s ) exp w · f ( t ) , Z w ( s ) = t ′ ∈T c ( s ) 15

  16. Conditional vs joint estimation • Joint MLE maximizes probability of training trees and strings – Generative statistical parsers usually use joint MLE – Joint MLE is simple to compute (relative frequency) • Conditional MLE maximizes probability of trees given strings – Conditional estimation uses less information from the data – learns nothing from distribution of strings – ignores unambiguous sentences (!) P ( t, s ) = P ( t | s ) P ( s ) • Joint MLE should be better (lower variance) if your model correctly predicts the distribution of parses and strings – Any good probabilistic models of semantics and discourse? 16

  17. Conditional vs joint MLE for PCFGs VP VP V NP VP PP see NP PP VP V NP P NP N P NP V see N with N people with N 100 × 2 × 1 × run people telescopes telescopes . . . × 2/105 × . . . . . . × 1/7 × . . . . . . × 2/7 × . . . . . . × 1/7 × . . . Rule count rel freq better vals 100 100/105 4/7 VP → V VP → V NP 3 3/105 1/7 2 2/105 2 / 7 VP → VP PP 6 6/7 6/7 NP → N 1 1/7 1 / 7 NP → NP PP 17

  18. Regularization • Overlearning ⇒ add regularization R that penalizes “complex” models • Useful with a wide range of objective functions w = argmin Q ( w ) + R ( w ) � w Q ( w ) = − log L D ( w ) (objective function) � | w j | p (regularizer) R ( w ) = c j � P w ( t ⋆ L D ( w ) = i | s i ) i • p = 2 known as the Gaussian prior • p = 1 known as the Laplacian or exponential prior – sparse solutions – requires special care in optimization (Kazama and Tsujii, 2003) 18

  19. If candidate parses don’t include correct parse • If T c ( s ) doesn’t include t ⋆ ( s ) , choose parse t + ( s ) in T c ( s ) closest to t ⋆ ( s ) • Maximize conditional likelihood of ( t + 1 , . . . , t + n ) • Closest parse t + t + i = argmax t ∈T ( s i ) F t ⋆ i ( t ) i T c ( s i ) t ⋆ i – F t ⋆ ( t ) is f-score of t relative to t ⋆ • w chosen to maximize the regularized log conditional likelihood of t + T i � P w ( t + L D ( w ) = i | s i ) i 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend