1 min
play

1 min 1-1 P ( W ) , W = w ; w ; : : : ; w 1 2 n - PowerPoint PPT Presentation

Exploiting Syntactic Structure for Language Modeling Basic Language Modeling: Ciprian Chelba, Frederick Jelinek Model Component Parameterization A Structured Language Model: Pruning Strategy Language Model Requirements Word


  1. Exploiting Syntactic Structure for Language Modeling � Basic Language Modeling: Ciprian Chelba, Frederick Jelinek � Model Component Parameterization ☞ A Structured Language Model: � Pruning Strategy – Language Model Requirements � Word Level Probability Assignment – Word and Structure Generation � Model Statistics Reestimation – Research Issues: – Model Performance The Johns Hopkins University

  2. � give people an outline so that they know what’s going on 1 min 1-1

  3. P ( W ) , W = w ; w ; : : : ; w 1 2 n Basic Language Modeling Estimate the source probability P ( w =w : : : w ) ; � 2 � ; w 2 V � i 1 i � 1 i from a training corpus — millions of words of text chosen for its similarity to the sentences expected at run-time. ! N X Parametric conditional models 1 P P L ( M ) = exp � ln P ( w j w : : : w ) [ ] M i 1 i � 1 N i =1 Perplexity(PPL) P ( w j w : : : w ) > � M i 1 i � 1 ✔ different than maximum likelihood estimation: the test data is not seen dur- ing the model estimation process; ✔ good models are smooth: The Johns Hopkins University

  4. � Source modeling; classical problem in information theory � give interpretation for perplexity as expected average length of list of equi- probable words; Shannon code-length; 3 min 2-1

  5. � Generalize trigram modeling (local) by taking advantage of sentence struc- � Use exposed heads h (words w and their corresponding non-terminal tags Exploiting Syntactic Structure for Language Modeling l ) for prediction: P ( w j T ) = P ( w j h ( T ) ; h ( T )) i i i � 2 i � 1 i ture (influence by more distant past) T W i is the partial hidden structure, with head assignment, provided to i ended_VP’ with_PP loss_NP of_PP contract_NP loss_NP cents_NP the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN 7_CD cents_NNS after The Johns Hopkins University

  6. � point out originality of approach; � explain clearly what headwords are; � difference between trigram/slm: surface/deep modeling of the source; give � hidden nature of the parses; cannot decide on a single best parse for a word � need to weight them according to how likely they are - probabilistic model; example with removed constituent again; show that they make intuitively better predictors for the following word; prefix, not even at the end of sentence; 6 min 3-1

  7. � Model must operate left-to-right: P ( w =w : : : w ) i 1 i � 1 � In hypothesizing hidden structure, the model can use only word-prefix W ; i w ; :::; w ; :::; w 0 i n +1 as all conventional parsers Language Model Requirements � Model complexity must be limited; even trigram model faces critical data � Model will assign joint probability to sequences of words and hidden parse i.e. , not the complete sentence do! P ( T ; W ) i i sparseness problems structure: The Johns Hopkins University

  8. x 8 min 4-1

  9. ended_VP’ with_PP loss_NP of_PP cents_NP contract_NP loss_NP : : : ; null ; predict cents ; POStag cents ; adjoin-right-NP ; adjoin-left-PP ; : : : ; : : : ; the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN 7_CD cents_NNS adjoin-left-VP’ ; null ; The Johns Hopkins University

  10. ended_VP’ with_PP loss_NP of_PP cents_NP contract_NP loss_NP : : : ; null ; predict cents ; POStag cents ; adjoin-right-NP ; adjoin-left-PP ; : : : ; : : : ; the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN _NNS 7_CD cents adjoin-left-VP’ ; null ;

  11. ended_VP’ with_PP loss_NP of_PP cents_NP contract_NP loss_NP : : : ; null ; predict cents ; POStag cents ; adjoin-right-NP ; adjoin-left-PP ; : : : ; : : : ; the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN _NNS 7_CD cents adjoin-left-VP’ ; null ;

  12. ended_VP’ with_PP loss_NP of_PP cents_NP contract_NP loss_NP : : : ; null ; predict cents ; POStag cents ; adjoin-right-NP ; adjoin-left-PP ; : : : ; : : : ; the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN _NNS 7_CD cents adjoin-left-VP’ ; null ;

  13. predict word ended_VP’ PREDICTOR TAGGER with_PP loss_NP null tag word PARSER of_PP adjoin_{left,right} cents_NP contract_NP loss_NP : : : ; null ; predict cents ; POStag cents ; adjoin-right-NP ; adjoin-left-PP ; : : : ; : : : ; the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN 7_CD cents_NNS adjoin-left-VP’ ; null ;

  14. � just one of the possible continuations for one of the possible parses of the � prepare next slide using FSM; explain that it is merely an encoding of the prefix; word prefix and the tree structure; 11 min 5-1

  15. P ( T ; W ) = n +1 n +1 n +1 Y P ( w j h ; h ) P ( g j w ; h :tag ; h :tag ) P ( T j w ; g ; T ) i � 2 � 1 i i � 1 � 2 i i i i � 1 | Word and Structure Generation {z } | {z } | {z } i =1 � The predictor generates the next word w P ( w = v j h ; h ) i with probability i � 2 � 1 � The tagger attaches tag g w i to the most recently generated word i with P ( g j w ; h :tag ; h :tag ) i i � 1 � 2 parser tagger predictor � The parser builds the partial parse T T ; w g i from i � 1 i , and i in a series of a is made with probability P ( a j h ; h ) ; � 2 � 1 a 2 f (adjoin-left, NTtag) , (adjoin-right, NTtag) , null g probability moves ending with null , where a parser move The Johns Hopkins University

  16. � we have described an encoding of a word sequence with a parse tree; � to get a probabilistic model assign a probability to each elementary action in the encoding 13 min 6-1

  17. � Model component parameterization — equivalence classifications for model P ( w = v j h ; h ) ; P ( g j w ; h :tag ; h :tag ) ; P ( a j h ; h ) i � 2 � 1 i i � 1 � 2 � 2 � 1 Research Issues � Huge number of hidden parses — need to prune it by discarding the unlikely components: � Word level probability assignment — calculate P ( w =w : : : w ) i 1 i � 1 � Model statistics estimation — unsupervised algorithm for maximizing P ( W ) ones (minimizing perplexity) The Johns Hopkins University

  18. everything’s on the slide 14 min 7-1

  19. k T W jf T gj � O (2 ) ; k for a given word prefix k is k Pruning Strategy � the hypotheses are ranked according to ln( P ( W ; T )) Number of parses k k � each stack contains partial parses constructed by the same number of parser Prune most parses without discarding the most likely ones for a given sentence Synchronous Multi-Stack Pruning Algorithm � maximum number of stack entries operations � log-probability threshold The width of the pruning is controlled by: The Johns Hopkins University

  20. x 15 min 8-1

  21. Pruning Strategy (k) (k’) (k+1) 0 parser op 0 parser op 0 parser op k predict. k+1 predict. k+1 predict. p parser op p parser op p parser op k predict. k+1 predict. k+1 predict. p+1 parser p+1 parser p+1 parser k+1 predict. k predict. k+1 predict. P_k parser P_k parser P_k parser k predict. k+1 predict. k+1 predict. P_k+1parser P_k+1parser k+1 predict. k+1 predict. word predictor and tagger null parser transitions parser adjoin/unary transitions The Johns Hopkins University

  22. � we want to find the most probable set of parses that are extensions of the � there is an upper bound on the number of stacks at a given input position � hypotheses in stack 0 differ according to their POS sequences ones currently in the stacks 17 min 9-1

  23. k + 1 in the input sentence X P ( w =W ) = P ( w =W T ) � � ( W ; T ) k +1 k k +1 k k k k Word Level Probability Assignment T 2 S k k The probability assignment for the word at position � S k k is the set of all parses present in the stacks at the current stage must be made using: � interpolation weights � ( W ; T ) must satisfy: k k X � ( W ; T ) = 1 k k T 2 S k k � : W X � ( W ; T ) = P ( W T ) = P ( W T ) k k k k k k T 2 S k k in order to ensure a proper probability over strings The Johns Hopkins University

  24. � point out consistency of estimate: when summing over all parses we get the actual probability value according to our model. 19 min 10-1

  25. P ( w = v j h ; h ) ; P ( g j w ; h :tag ; h :tag ) ; P ( a j h ; h ) i � 2 � 1 i i � 1 � 2 � 2 � 1 Model Parameter Reestimation Need to re-estimate model component probabilities such that we decrease the model perplexity. 1 N � We retain the N “best” parses f T ; : : : ; T g for the complete sentence W � The hidden events in the EM algorithm are restricted to those occurring in Modified Expectation-Maximization(EM) algorithm: � We seed re-estimation process with statistics gathered from manually parsed the N “best” parses sentences The Johns Hopkins University

  26. � point out goal of re-estimation � warn about need to know the E-M algorithm; � explain what a treebank is and why/how we can initialize from treebank 21 min 11-1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend