token level and sequence level loss smoothing for rnn
play

Token-level and sequence-level loss smoothing for RNN language - PowerPoint PPT Presentation

Token-level and sequence-level loss smoothing for RNN language models Maha Elbayad 1,2 , Laurent Besacier 1 ,and Jakob Verbeek 2 1 LIG , 2 INRIA, Grenoble, France ACL 2018 Melbourne, Australia Language generation | Equivalence in the target


  1. Token-level and sequence-level loss smoothing for RNN language models Maha Elbayad 1,2 , Laurent Besacier 1 ,and Jakob Verbeek 2 1 LIG , 2 INRIA, Grenoble, France ACL 2018 Melbourne, Australia

  2. Language generation | Equivalence in the target space • Ground truth sequences lie in a union of low-dimensional subspaces where sequences convey the same message. ◮ France won the world cup for the second time. ◮ France captured its second world cup title. • Some words in the vocabulary share the same meaning. ◮ Capture, conquer, win, gain, achieve, accomplish, . . . ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 1

  3. Contributions Take into consideration the nature of the target language space with: • A token-level smoothing for a “robust” multi-class classification. • A sequence-level smoothing to explore relevant alternative sequences. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 2

  4. Maximum likelihood estimation (MLE) For a pair ( x , y ) , we model the conditional distribution: | y | � p θ ( y | x ) = p θ ( y t | y < t , x ) (1) t Given the ground truth target sequence y ⋆ : ℓ MLE ( y ⋆ , x ) = − ln p θ ( y ⋆ | x ) = D KL ( δ ( y | y ⋆ ) � p θ ( y | x )) (2) | y ⋆ | � D KL ( δ ( y t | y ⋆ t ) � p θ ( y t | y ⋆ = < t , x )) (3) t = 1 ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 3

  5. Maximum likelihood estimation (ML) ℓ MLE ( y ⋆ , x ) = − ln p θ ( y ⋆ | x ) = D KL ( δ ( y | y ⋆ ) � p θ ( y | x )) (2) T � D KL ( δ ( y t | y ⋆ = t ) � p θ ( y t | h t )) (3) t = 1 Issues: • Zero-one loss, all the outputs y � = y ⋆ are treated equally. • Discrepancy at the sentence level between the training (1-gram) and evaluation metric (4-gram). ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 4

  6. Loss smoothing smoothing r τ ( y | y ⋆ ) δ ( y ⋆ ) ℓ seq RAML ( y ⋆ , x ) = D KL ( r τ ( y | y ⋆ ) � p θ ( y | x )) (Norouzi et al, 2016) D KL ( δ ( y | y ⋆ ) � p θ ( y | x )) ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 5

  7. Loss smoothing smoothing r τ ( y | y ⋆ ) ( resp . r τ ( y t | y ⋆ t )) δ ( y ⋆ ) ( resp . δ ( y ⋆ t )) ℓ seq RAML ( y ⋆ , x ) = D KL ( r τ ( y | y ⋆ ) � p θ ( y | x )) (Norouzi et al, 2016) D KL ( δ ( y | y ⋆ ) � p θ ( y | x )) T T � ℓ tok RAML ( y ⋆ , x ) = D KL ( r τ ( y t | y ⋆ t ) � p θ ( y t | h t )) � D KL ( δ ( y t | y ⋆ t ) � p θ ( y t | h t )) t = 1 t = 1 ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 5

  8. Token-level smoothing ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 6

  9. Loss smoothing | Token-level T ℓ tok RAML ( y ⋆ , x ) = � D KL ( r τ ( y t | y ⋆ t ) � p θ ( y t | h t )) (4) t = 1 • Uniform label smoothing over all words in the vocabulary: r τ ( y t | y ⋆ t ) = δ ( y t | y ⋆ t ) + τ. u ( V ) (Szegedy et al. 2016) • We can leverage word co-occurrence statistics to build a non-uniform and “meaningful” distribution. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 7

  10. Loss smoothing T � ℓ tok RAML ( y ⋆ , x ) = D KL ( r τ ( y t | y ⋆ t ) � p θ ( y t | h t )) (4) t = 1 Prerequisite: A word embedding w (e.g. Glove) in the target space and a distance d . � − d( w ( y t ) , w ( y ⋆ � t ) = 1 t )) r τ ( y t | y ⋆ Z exp , τ with a temperature τ st. r τ − τ → 0 δ . − − → � r τ ( y t | y ⋆ Z st. t ) = 1 y t ∈V ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 8

  11. Loss smoothing | Token-level τ = 0 . 12 τ = 0 . 70 ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 9

  12. Loss smoothing | Token-level T ℓ tok RAML ( y ⋆ , x ) = � D KL ( r τ ( y t | y ⋆ t ) � p θ ( y t | h t )) (4) t = 1 T � r τ ( y t | y ⋆ t ) � � � r τ ( y t | y ⋆ = t ) log (5) p θ ( y t | h t ) t = 1 y t ∈V We can estimate the exact KL divergence for every target token. No approximation needed. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 10

  13. Sequence-level smoothing ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 11

  14. Loss smoothing | Sequence-level ℓ seq RAML ( y ⋆ , x ) = D KL ( r τ ( y | y ⋆ ) � p θ ( y | x )) (6) Prerequisite: A distance d in the sequences space V n , n ∈ N . � − d( y , y ⋆ ) r τ ( y | y ⋆ ) = 1 � Z exp , τ � r τ ( y | y ⋆ ) = 1 Z st. y ∈V n , n ∈ N Possible (pseudo-)distances: • Hamming • Edit • 1 − BLEU • 1 − CIDEr ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 12

  15. Loss smoothing | Sequence-level Can we evaluate the partition function Z for a given reward? � − d( y , y ⋆ ) � t ) = 1 r τ ( y t | y ⋆ Z exp , τ � − d( y , y ⋆ ) � � Z = exp τ y ∈V n , n ∈ N We can approximate Z for Hamming distance. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 13

  16. Loss smoothing | Sequence-level | Hamming distance Assumption: consider only sequences of the same length as y ⋆ ( d ( y , y ′ ) = 0 if | y | � = | y ′ | ). We partition the set of sequences V T w.r.t. their distance to the ground truth y ⋆ :  S d = { y ∈ V T sub | d( y , y ⋆ ) = d } ,    V T = ∪ d S d ,  ∀ d , d ′ : S d ∩ S d ′ = ∅ .   • The reward in each subset is a constant. • The cardinality of each subset is known. � − d � � Z = | S d | exp τ d ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 14

  17. Loss smoothing | Sequence-level | Hamming distance We can easily draw from r τ with Hamming distance: 1 Sample a distance d from { 0 , . . . , T } . 2 Pick d positions in the sequence to be changed among { 1 , . . . , T } . 3 Sample substitutions from V of the vocabulary. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 15

  18. Loss smoothing | Sequence-level | Hamming distance We can easily draw from r τ with Hamming distance: 1 Sample a distance d from { 0 , . . . , T } . 2 Pick d positions in the sequence to be changed among { 1 , . . . , T } . 3 Sample substitutions from V of the vocabulary. Monte Carlo estimation: ℓ seq RAML ( y ⋆ , x ) = D KL ( r τ ( y | y ⋆ ) � p θ ( y | x )) (6) = − E r τ [log p θ ( . | x )] + cst (7) L ≈ − 1 ( y l ∼ r τ ) � log p θ ( y l | x ) (8) L l = 1 ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 15

  19. Loss smoothing | Sequence-level | Other distances We cannot “easily” sample from more complicated rewards such as BLEU or CIDEr. Importance sampling: ℓ seq RAML ( y ⋆ , x ) = − E r τ [log p θ ( . | x )] (9) = − E q [ r τ q log p θ ] (10) L ( y l ∼ q ) ≈ − 1 � ω l log p θ ( y l | x ) (11) L l = 1 r τ ( y l | y ⋆ ) / q ( y l | y ⋆ ) ω l ≈ , � L k = 1 r τ ( y k | y ⋆ ) / q ( y k | y ⋆ ) Choose q the reward distribution relative to Hamming distance. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 16

  20. Loss smoothing | Sequence-level | Support reduction ℓ seq RAML ( y ⋆ , x ) = D KL ( r τ ( y | y ⋆ ) � p θ ( y | x )) (6) Can we reduce the support of r τ ? � − d( y , y ⋆ ) � − d( y , y ⋆ ) r τ ( y | y ⋆ ) = 1 � � � Z exp , Z = exp τ τ y ∈V T Reduce the support from V | y ⋆ | to V | y ⋆ | sub where V sub ⊂ V . • V sub = V batch : tokens occuring in the SGD mini-batch. • V sub = V refs : tokens occuring in the available references. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 17

  21. Loss smoothing | Sequence-level | Lazy training Default training Lazy training ℓ seq RAML ( y ⋆ , x ) = − E r τ [log p θ ( . | x )] ℓ seq RAML ( y ⋆ , x ) = − E r τ [log p θ ( . | x )] L L ≈ − 1 ≈ − 1 � � log p θ ( y l | x ) log p θ ( y l | x ) L L l = 1 l = 1 ∀ l , y l is: ∀ l , y l is: 1 forwarded in the RNN. 1 not forwarded in the RNN. 2 used as target. 2 used as target. log p θ ( y l | y ⋆ , x ) log p θ ( y l | y l , x ) ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 18

  22. Loss smoothing | Sequence-level | Lazy training Default training Lazy training ℓ seq ℓ seq RAML ( y ⋆ , x ) = − E r τ [log p θ ( . | x )] RAML ( y ⋆ , x ) = − E r τ [log p θ ( . | x )] L L ≈ − 1 ≈ − 1 � � log p θ ( y l | x ) log p θ ( y l | x ) L L l = 1 l = 1 ∀ l , y l is: ∀ l , y l is: 1 forwarded in the RNN. 1 not forwarded in the RNN. 2 used as target. 2 used as target. log p θ ( y l | y ⋆ , x ) log p θ ( y l | y l , x ) Complexity : O ( 2 L .λ ) Complexity: O (( L + 1 ) λ ) λ = | y || θ cell | , where θ cell are the cell parameters. ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 18

  23. Experiments ACL 2018, Melbourne M. Elbayad || Token-level and Sequence-level Loss Smoothing 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend