1 min 1-1 P ( W ) , W = w ; w ; : : : ; w 1 2 n - PowerPoint PPT Presentation

Exploiting Syntactic Structure for Language Modeling � Basic Language Modeling: Ciprian Chelba, Frederick Jelinek � Model Component Parameterization ☞ A Structured Language Model: � Pruning Strategy – Language Model Requirements � Word Level Probability Assignment – Word and Structure Generation � Model Statistics Reestimation – Research Issues: – Model Performance The Johns Hopkins University

� give people an outline so that they know what’s going on 1 min 1-1

P ( W ) , W = w ; w ; : : : ; w 1 2 n Basic Language Modeling Estimate the source probability P ( w =w : : : w ) ; � 2 � ; w 2 V � i 1 i � 1 i from a training corpus — millions of words of text chosen for its similarity to the sentences expected at run-time. ! N X Parametric conditional models 1 P P L ( M ) = exp � ln P ( w j w : : : w ) [ ] M i 1 i � 1 N i =1 Perplexity(PPL) P ( w j w : : : w ) > � M i 1 i � 1 ✔ different than maximum likelihood estimation: the test data is not seen dur- ing the model estimation process; ✔ good models are smooth: The Johns Hopkins University

� Source modeling; classical problem in information theory � give interpretation for perplexity as expected average length of list of equi- probable words; Shannon code-length; 3 min 2-1

� Generalize trigram modeling (local) by taking advantage of sentence struc- � Use exposed heads h (words w and their corresponding non-terminal tags Exploiting Syntactic Structure for Language Modeling l ) for prediction: P ( w j T ) = P ( w j h ( T ) ; h ( T )) i i i � 2 i � 1 i ture (influence by more distant past) T W i is the partial hidden structure, with head assignment, provided to i ended_VP’ with_PP loss_NP of_PP contract_NP loss_NP cents_NP the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN 7_CD cents_NNS after The Johns Hopkins University

� point out originality of approach; � explain clearly what headwords are; � difference between trigram/slm: surface/deep modeling of the source; give � hidden nature of the parses; cannot decide on a single best parse for a word � need to weight them according to how likely they are - probabilistic model; example with removed constituent again; show that they make intuitively better predictors for the following word; prefix, not even at the end of sentence; 6 min 3-1

� Model must operate left-to-right: P ( w =w : : : w ) i 1 i � 1 � In hypothesizing hidden structure, the model can use only word-prefix W ; i w ; :::; w ; :::; w 0 i n +1 as all conventional parsers Language Model Requirements � Model complexity must be limited; even trigram model faces critical data � Model will assign joint probability to sequences of words and hidden parse i.e. , not the complete sentence do! P ( T ; W ) i i sparseness problems structure: The Johns Hopkins University

x 8 min 4-1

ended_VP’ with_PP loss_NP of_PP cents_NP contract_NP loss_NP : : : ; null ; predict cents ; POStag cents ; adjoin-right-NP ; adjoin-left-PP ; : : : ; : : : ; the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN 7_CD cents_NNS adjoin-left-VP’ ; null ; The Johns Hopkins University

ended_VP’ with_PP loss_NP of_PP cents_NP contract_NP loss_NP : : : ; null ; predict cents ; POStag cents ; adjoin-right-NP ; adjoin-left-PP ; : : : ; : : : ; the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN _NNS 7_CD cents adjoin-left-VP’ ; null ;

predict word ended_VP’ PREDICTOR TAGGER with_PP loss_NP null tag word PARSER of_PP adjoin_{left,right} cents_NP contract_NP loss_NP : : : ; null ; predict cents ; POStag cents ; adjoin-right-NP ; adjoin-left-PP ; : : : ; : : : ; the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN 7_CD cents_NNS adjoin-left-VP’ ; null ;

� just one of the possible continuations for one of the possible parses of the � prepare next slide using FSM; explain that it is merely an encoding of the prefix; word prefix and the tree structure; 11 min 5-1

P ( T ; W ) = n +1 n +1 n +1 Y P ( w j h ; h ) P ( g j w ; h :tag ; h :tag ) P ( T j w ; g ; T ) i � 2 � 1 i i � 1 � 2 i i i i � 1 | Word and Structure Generation {z } | {z } | {z } i =1 � The predictor generates the next word w P ( w = v j h ; h ) i with probability i � 2 � 1 � The tagger attaches tag g w i to the most recently generated word i with P ( g j w ; h :tag ; h :tag ) i i � 1 � 2 parser tagger predictor � The parser builds the partial parse T T ; w g i from i � 1 i , and i in a series of a is made with probability P ( a j h ; h ) ; � 2 � 1 a 2 f (adjoin-left, NTtag) , (adjoin-right, NTtag) , null g probability moves ending with null , where a parser move The Johns Hopkins University

� we have described an encoding of a word sequence with a parse tree; � to get a probabilistic model assign a probability to each elementary action in the encoding 13 min 6-1

� Model component parameterization — equivalence classifications for model P ( w = v j h ; h ) ; P ( g j w ; h :tag ; h :tag ) ; P ( a j h ; h ) i � 2 � 1 i i � 1 � 2 � 2 � 1 Research Issues � Huge number of hidden parses — need to prune it by discarding the unlikely components: � Word level probability assignment — calculate P ( w =w : : : w ) i 1 i � 1 � Model statistics estimation — unsupervised algorithm for maximizing P ( W ) ones (minimizing perplexity) The Johns Hopkins University

everything’s on the slide 14 min 7-1

k T W jf T gj � O (2 ) ; k for a given word prefix k is k Pruning Strategy � the hypotheses are ranked according to ln( P ( W ; T )) Number of parses k k � each stack contains partial parses constructed by the same number of parser Prune most parses without discarding the most likely ones for a given sentence Synchronous Multi-Stack Pruning Algorithm � maximum number of stack entries operations � log-probability threshold The width of the pruning is controlled by: The Johns Hopkins University

x 15 min 8-1

Pruning Strategy (k) (k’) (k+1) 0 parser op 0 parser op 0 parser op k predict. k+1 predict. k+1 predict. p parser op p parser op p parser op k predict. k+1 predict. k+1 predict. p+1 parser p+1 parser p+1 parser k+1 predict. k predict. k+1 predict. P_k parser P_k parser P_k parser k predict. k+1 predict. k+1 predict. P_k+1parser P_k+1parser k+1 predict. k+1 predict. word predictor and tagger null parser transitions parser adjoin/unary transitions The Johns Hopkins University

� we want to find the most probable set of parses that are extensions of the � there is an upper bound on the number of stacks at a given input position � hypotheses in stack 0 differ according to their POS sequences ones currently in the stacks 17 min 9-1

k + 1 in the input sentence X P ( w =W ) = P ( w =W T ) � � ( W ; T ) k +1 k k +1 k k k k Word Level Probability Assignment T 2 S k k The probability assignment for the word at position � S k k is the set of all parses present in the stacks at the current stage must be made using: � interpolation weights � ( W ; T ) must satisfy: k k X � ( W ; T ) = 1 k k T 2 S k k � : W X � ( W ; T ) = P ( W T ) = P ( W T ) k k k k k k T 2 S k k in order to ensure a proper probability over strings The Johns Hopkins University

� point out consistency of estimate: when summing over all parses we get the actual probability value according to our model. 19 min 10-1

P ( w = v j h ; h ) ; P ( g j w ; h :tag ; h :tag ) ; P ( a j h ; h ) i � 2 � 1 i i � 1 � 2 � 2 � 1 Model Parameter Reestimation Need to re-estimate model component probabilities such that we decrease the model perplexity. 1 N � We retain the N “best” parses f T ; : : : ; T g for the complete sentence W � The hidden events in the EM algorithm are restricted to those occurring in Modified Expectation-Maximization(EM) algorithm: � We seed re-estimation process with statistics gathered from manually parsed the N “best” parses sentences The Johns Hopkins University

� point out goal of re-estimation � warn about need to know the E-M algorithm; � explain what a treebank is and why/how we can initialize from treebank 21 min 11-1

1 min 1-1 P ( W ) , W = w ; w ; : : : ; w 1 2 n - PowerPoint PPT Presentation

Exploiting Syntactic Structure for Language Modeling Basic Language Modeling: Ciprian Chelba, Frederick Jelinek Model Component Parameterization A Structured Language Model: Pruning Strategy Language Model Requirements Word