exploiting syntactic structure for language modeling
play

Exploiting Syntactic Structure for Language Modeling Ciprian Chelba, - PowerPoint PPT Presentation

Hierarchical Structure in Natural Language ended_S ended_VP with_PP loss_NP of_PP cents_NP contract_NP loss_NP cents_NNS 7_CD the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN Words are hierarchically organized in


  1. Hierarchical Structure in Natural Language ended_S ended_VP with_PP loss_NP of_PP cents_NP contract_NP loss_NP cents_NNS 7_CD the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN � Words are hierarchically organized in syntactic constituents — tree structure � Part of Speech(POS) and Non-Terminal(NT) tags identify the type of constituent � Lexicalized annotation of intermediate nodes in the tree Identifying the syntactic structure � Parsing ✔ Automatic parsing of natural language text is an area of active research Microsoft Research Speech.Net

  2. Exploiting Syntactic Structure for Language Modeling Ciprian Chelba, Frederick Jelinek � Hierarchical Structure in Natural Language � Speech Recognition: Statistical Approach � Basic Language Modeling: – Measures for Language Model Quality – Current Approaches to Language Modeling � A Structured Language Model: – Language Model Requirements – Word and Structure Generation – Research Issues – Model Performance: Perplexity results on UPenn-Treebank – Model Performance: Perplexity and WER results on WSJ/SWB/BN � Any Future for the Structured Language Model? – Richer Syntactic Dependencies – Syntactic Structure Portability – Information Extraction from Text Microsoft Research Speech.Net

  3. Speech Recognition — Statistical Approach Speaker’s Speech Acoustic Linguistic Mind Producer Processor Decoder ^ Speech W A W Speaker Speech Recognizer Acoustic Channel ^ W = a rgmax P ( W j A ) = a rgmax P ( A j W ) � P ( W ) W W ( A j W ) acoustic model : channel probability; � P � P ( W ) language model : source probability; ^ � search for the most likely word string W . ✔ due to the large vocabulary size — tens of thousands of words — an exhaustive search is intractable. Microsoft Research Speech.Net

  4. Basic Language Modeling Estimate the source probability ( W ) , = P W w ; w ; : : : ; w 1 2 n from a training corpus — millions of words of text chosen for its similarity to the expected utterances. Parametric conditional models: P ( w =w : : : w ) ; � 2 � ; w 2 V i 1 i � 1 i � � parameter space � � V source alphabet (vocabulary) ✔ Source Modeling Problem Microsoft Research Speech.Net

  5. Measures for Language Model Quality Word Error Rate (WER) TRN: UP UPSTATE NEW YORK SOMEWHERE UH OVER OVER HUGE AREAS HYP: UPSTATE NEW YORK SOMEWHERE UH ALL ALL THE HUGE AREAS 1 0 0 0 0 0 1 1 1 0 0 :4 errors per 10 words in transcription; WER = 40% Evaluating WER reduction is computationally expensive. Perplexity(PPL) ! N 1 X P P L ( M ) = exp � ln [ P ( w j w : : : w ) ℄ M i 1 i � 1 N i =1 ✔ different than maximum likelihood estimation: the test data is not seen during the model estimation process; ✔ good models are smooth: P ( w j w : : : w ) > � i 1 i � 1 M Microsoft Research Speech.Net

  6. Current Approaches to Language Modeling Assume a Markov source of order n ; equivalence classification of a given context: [ w : : : w ℄ = w : : : w = h 1 i � 1 i � n +1 i � 1 n ( w ) Data sparseness: 3-gram model j w ; w i i � 2 i � 1 � approx. 70% of the trigrams in the training data have been seen once. � the rate of new (unseen) trigrams in test data relative to those observed in a training corpus of size 38 million words is 32% for a 20,000-words vocabulary; Smoothing: recursive linear interpolation among relative frequency estimates of different orders f ( � ) ; k = 0 : : : n using a recursive mixing scheme: k P ( u j z ; : : : ; z ) = n 1 n � ( z ; : : : ; z ) � P ( u j z ; : : : ; z ) + (1 � � ( z ; : : : ; z )) � f ( u j z ; : : : ; z ) ; 1 n n � 1 1 n � 1 1 n n 1 n ( u ) = m ( U ) P unif or � 1 Parameters: � = f � ( z ; : : : ; z ); ount ( u j z ; : : : ; z ) ; 8 ( u j z ; : : : ; z ) 2 T g n n n 1 1 1 Microsoft Research Speech.Net

  7. Exploiting Syntactic Structure for Language Modeling � Hierarchical Structure in Natural Language � Speech Recognition: Statistical Approach � Basic Language Modeling: ☞ A Structured Language Model: – Language Model Requirements – Word and Structure Generation – Research Issues: � Model Component Parameterization � Pruning Method � Word Level Probability Assignment � Model Statistics Reestimation – Model Performance: Perplexity results on UPenn-Treebank – Model Performance: Perplexity and WER results on WSJ/SWB/BN Microsoft Research Speech.Net

  8. A Structured Language Model � Generalize trigram modeling (local) by taking advantage of sentence structure (influ- ence by more distant past) � Use exposed heads h (words w and their corresponding non-terminal tags l ) for prediction: P ( w j T ) = P ( w j h ( T ) ; h ( T )) i i i � 2 i � 1 i i is the partial hidden structure, with head assignment, provided to T W i ended_VP’ with_PP loss_NP of_PP contract_NP loss_NP cents_NP the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN 7_CD cents_NNS after Microsoft Research Speech.Net

  9. Language Model Requirements � Model must operate left-to-right: P ( w =w : : : w ) i 1 i � 1 � In hypothesizing hidden structure, the model can use only word-prefix W ; i.e. , not i the complete sentence w ; :::; w ; :::; w n +1 as all conventional parsers do! 0 i � Model complexity must be limited; even trigram model faces critical data sparseness problems � Model will assign joint probability to sequences of words and hidden parse structure: P ( T ; ) W i i Microsoft Research Speech.Net

  10. ended_VP’ with_PP loss_NP of_PP cents_NP contract_NP loss_NP the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN 7_CD cents_NNS : : : ; null ; predict cents ; POStag cents ; adjoin-right-NP ; adjoin-left-PP ; : : : ; adjoin- left-VP’ ; null ; : : : ; Microsoft Research Speech.Net

  11. ended_VP’ with_PP loss_NP of_PP cents_NP contract_NP loss_NP the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN _NNS 7_CD cents : : : ; null ; predict cents ; POStag cents ; adjoin-right-NP ; adjoin-left-PP ; : : : ; adjoin- left-VP’ ; null ; : : : ;

  12. ended_VP’ with_PP loss_NP of_PP cents_NP contract_NP loss_NP the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN _NNS 7_CD cents : : : ; null ; predict cents ; POStag cents ; adjoin-right-NP ; adjoin-left-PP ; : : : ; adjoin- left-VP’ ; null ; : : : ;

  13. ended_VP’ with_PP loss_NP of_PP cents_NP contract_NP loss_NP the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN _NNS 7_CD cents : : : ; null ; predict cents ; POStag cents ; adjoin-right-NP ; adjoin-left-PP ; : : : ; adjoin- left-VP’ ; null ; : : : ;

  14. predict word ended_VP’ PREDICTOR TAGGER with_PP loss_NP null tag word PARSER of_PP adjoin_{left,right} cents_NP contract_NP loss_NP the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN 7_CD cents_NNS : : : ; null ; predict cents ; POStag cents ; adjoin-right-NP ; adjoin-left-PP ; : : : ; adjoin- left-VP’ ; null ; : : : ;

  15. Word and Structure Generation P ( T ; W ) = n +1 n +1 n +1 Y P ( w j h ; h ) P ( g j w ; h :tag ; h :tag ) P ( T j w ; g ; ) T i � 2 � 1 i i � 1 � 2 i i i i � 1 | {z } | {z } | {z } parser i =1 tagger predictor � The predictor generates the next word w i with probability P ( w = v j h ; h ) � 2 � 1 i � The tagger attaches tag g i to the most recently generated word w i with probability P ( g j w ; h :tag ; h :tag ) i i � 1 � 2 � The parser builds the partial parse i from ; w i , and g i in a series of moves T T i � 1 ending with null , where a parser move a is made with probability P ( a j h ; h ) ; � 2 � 1 f (adjoin-left, NTtag) , (adjoin-right, NTtag) , null g a 2 Microsoft Research Speech.Net

  16. Research Issues � Model component parameterization — equivalence classifications for model compo- nents: P ( w = v j h ; h ) ; P ( g j w ; h :tag ; h :tag ) ; P ( a j h ; h ) i � 2 � 1 i i � 1 � 2 � 2 � 1 � Huge number of hidden parses — need to prune it by discarding the unlikely ones ( w ) � Word level probability assignment — calculate P =w : : : w i 1 i � 1 � Model statistics estimation — unsupervised algorithm for maximizing P ( W ) (mini- mizing perplexity) Microsoft Research Speech.Net

  17. Pruning Method k Number of parses T k for a given word prefix W k is jf T gj � O (2 ) ; k Prune most parses without discarding the most likely ones for a given sentence Synchronous Multi-Stack Pruning Algorithm � the hypotheses are ranked according to ln( P ( W ; T )) k k � each stack contains partial parses constructed by the same number of parser oper- ations The width of the pruning is controlled by: � maximum number of stack entries � log-probability threshold Microsoft Research Speech.Net

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend