character level language models with word level learning
play

Character-level Language Models With Word-level Learning Arvid - PowerPoint PPT Presentation

Character-level Language Models With Word-level Learning Arvid Frydenlund March 16, 2018 Character-level Language models Want language models with an open vocabulary Character-level models give this for free Treat the probability of


  1. Character-level Language Models With Word-level Learning Arvid Frydenlund March 16, 2018

  2. Character-level Language models ◮ Want language models with an open vocabulary ◮ Character-level models give this for free ◮ Treat the probability of a word as the product of character probabilities m e s c ( c j +1 , j ) � P w ( w = c 1 , ..., c m | h i ) = (1) � c ′ ∈ V c e s c ( c ′ , j ) j =0 ◮ Where V c is the character ‘vocabulary’ ◮ Models are trained to minimize per character cross entropy ◮ Issue: Training focuses on how words look and not what they mean ◮ Solution: Do not define the probability of a word as the product of character probabilities

  3. Globally normalized word probabilities ◮ Conditional Random Field objective P w ( w = c 1 , ..., c m | h i ) = e s w ( w = c 1 ,..., c m , h i ) (2) w ′ ∈ V e s w ( w ′ , h i ) � ◮ normalizing partition function over all words in the (open) vocabulary ◮ Issue: Partition function is intractable ◮ Solution: Use beam search to limit the scope of the elements comprising the partition function. ◮ This can be seen as approximating P ( w ) by normalizing over the top most probable candidate words. ◮ Issue: Elements of partition are words of different length. ◮ Score function and beam search need to be length agnostic.

  4. c 1 c 1 c 1 Projection . Projection . Projection . . . . h j =0 h j =1 h j =2 h j =3 q = s w ( w = ‘ sat ′ , h i =2 ) Beam 1 • . . . Argmax Argmax Argmax c | V c | ‘s’ c | V c | ‘a’ c | V c | ‘t’ c 1 c 1 c 1 ‘s’ ‘o’ ‘t’ Projection . Projection . Projection . . . . h j =0 h j =1 h j =2 h j =3 q = s w ( w = ‘ sot ′ , h i =2 ) Beam 2 • . . . c | V c | c | V c | c | V c | ... h i =1 h i =2 ‘t’ ‘h’ ‘e’ ‘c’ ‘a’ ‘t’ Figure: Predicting the next word in the sequence ‘the cat’. The beam search uses two beams over three steps and produces the words ‘sat’ and ‘sot’ in the top beams. ◮ Beam search in back pass as well   n � � s w ( w ′ , h i ) J =  − s w ( w i , h i ) + (3)  i =1 w ′ ∈ B top ( i )

  5. Experiments ◮ Toy problem of generating word-forms given word embeddings ◮ Compare to LSTM baseline ◮ Test accuracy across different score functions (average character score, average character probability, hidden-state score) ◮ Test accuracy across different beam-sizes ◮ Eventually a full language model ◮ This model has dynamic vocabulary at every step ◮ New evaluation metric for open vocabulary language models

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend