p r o b a b i l i s t i c fa s t t e x t f o r m u lt i s
play

P R O B A B I L I S T I C FA S T T E X T F O R M U LT I - S E N S - PowerPoint PPT Presentation

P R O B A B I L I S T I C FA S T T E X T F O R M U LT I - S E N S E W O R D E M B E D D I N G S B E N AT H I WA R AT K U N , A N D R E W G O R D O N W I L S O N , A N I M A A N A N D K U M A R 2 - M I N S U M M A RY Probabilistic


  1. P R O B A B I L I S T I C FA S T T E X T F O R M U LT I - S E N S E W O R D E M B E D D I N G S B E N AT H I WA R AT K U N , A N D R E W G O R D O N W I L S O N , A N I M A A N A N D K U M A R

  2. 2 - M I N S U M M A RY Probabilistic FastText = FastText + Gaussian Mixture Embeddings Words as probability densities • Gaussian Mixture Embeddings Each word = Gaussian Mixture density • stone rock music Disentangled meanings • pop jazz basalt rock ~ µ rock, 0 ~ µ rock, 1 � 2

  3. 2 - M I N S U M M A RY Probabilistic FastText = FastText + Gaussian Mixture Embeddings Words as probability densities • Gaussian Mixture Embeddings Each word = Gaussian Mixture density • stone rock music Disentangled meanings • pop jazz basalt rock ~ µ rock, 0 ~ µ rock, 1 Word embeddings: word vectors are • FastText derived from subword vectors ⇢ abnormal ~ SoA on many benchmarks especially • ~ z norm ~ z abnor RareWord ~ z ... ~ z abnorm Character based models allow for • ~ z ab estimating vectors of unseen words and ~ z ... enhancing � 3

  4. 2 - M I N S U M M A RY Probabilistic FastText Gaussian Mixture (PFT) Embeddings stone rock music stone pop jazz basalt rock rock ~ µ rock, 0 ~ basalt µ rock, 1 + FastText ~ music µ rock, 0 pop jazz ⇢ abnormal ~ ~ rock z norm ~ z abnor ~ z ... ~ z abnorm ~ z ab ~ z ... ~ µ rock, 1 4 �

  5. P R O B A B I L I S T I C FA S T T E X T Able to estimate distributions of • unseen words stone rock C O O L f (“cool”) = C O O L L [“cool”] = C O O L Z f (“coolz”) = basalt L [“coolz”] = ? L [“coolzz”] = ? f (“coolzz”) = music C O O L Z Z ~ µ rock, 0 pop jazz dictionary-based embeddings character-based probabilistic embeddings rock ~ µ rock, 1 � 5

  6. P R O B A B I L I S T I C FA S T T E X T Able to estimate distributions of • unseen words stone rock C O O L f (“cool”) = C O O L L [“cool”] = C O O L Z f (“coolz”) = basalt L [“coolz”] = ? L [“coolzz”] = ? f (“coolzz”) = music C O O L Z Z ~ µ rock, 0 pop jazz dictionary-based embeddings character-based probabilistic embeddings rock High semantic quality for rare words • via root sharing w2gm FastText PFT Spearman Correlation ~ µ rock, 1 0.43 0.48 0.49 on RareWord dataset 6 �

  7. P R O B A B I L I S T I C FA S T T E X T Able to estimate distributions of • unseen words stone rock C O O L f (“cool”) = C O O L L [“cool”] = C O O L Z f (“coolz”) = basalt L [“coolz”] = ? L [“coolzz”] = ? f (“coolzz”) = music C O O L Z Z ~ µ rock, 0 pop jazz dictionary-based embeddings character-based probabilistic embeddings rock High semantic quality for rare words • via root sharing w2gm FastText PFT Spearman Correlation ~ µ rock, 1 0.43 0.48 0.49 on RareWord dataset disentangled meanings • Word Component Nearest neighbors (cosine similarity) rock 0 rocks:0, rocky:0, mudrock:0, rockscape:0 rock 1 punk:0, punk-rock:0, indie:0, pop-rock:0 � 7

  8. P R O B A B I L I S T I C FA S T T E X T Able to estimate distributions of • unseen words stone rock C O O L f (“cool”) = C O O L L [“cool”] = C O O L Z f (“coolz”) = basalt L [“coolz”] = ? L [“coolzz”] = ? f (“coolzz”) = music C O O L Z Z ~ µ rock, 0 pop jazz dictionary-based embeddings character-based probabilistic embeddings rock High semantic quality for rare words • via root sharing w2gm FastText PFT Spearman Correlation ~ µ rock, 1 0.43 0.48 0.49 on RareWord dataset Applicable to foreign languages without • disentangled meanings • any changes in model hyperparameters! Component / Word Component Nearest neighbors (cosine similarity) Word Nearest neighbors (English Translation) Meaning rock 0 rocks:0, rocky:0, mudrock:0, rockscape:0 secondo 0 / 2nd Secondo (2nd), terzo (3rd) , quinto (5th), primo (first) rock 1 punk:0, punk-rock:0, indie:0, pop-rock:0 secondo 1 / according to conformit (compliance), attenendosi (following), cui (which) � 8

  9. V E C T O R E M B E D D I N G S & FA S T T E X T � 9

  10. W O R D E M B E D D I N G S one-hot vector dense representation 0 0 . 1 abnormal abnormality 0 . 2 0 normal 1 - 0 . 1 . . dimension size of vocabulary . . modulation ~ 50 - 1000 ~ Millions . . harmonics 0 . 9 0 1 . 2 0 amplitude } vectors word2vec (Mikolov et al., 2013) • GloVe (Pennington et al., 2014) • 10 �

  11. D E N S E R E P R E S E N TAT I O N O F W O R D S Relationship deduction Meaningful nearest neighbors from vector arithmetic vindicates vindicate exculpate absolve exonerate modulation modulations harmonics amplitude i.e. China - Beijing ~ Japan - Tokyo Mikolov 2013 � 11

  12. C H A R - M O D E L : S U B W O R D R E P R E S E N TAT I O N FastText (P Bojanowski, 2017) 0 1 representation = average of n-gram • 1 X vectors ⇢ w = ~ @ ~ v w + ~ z g A | NG w | + 1 automatic semantic extraction of • g ∈ NG w stems/prefixes/suffices w = <abnormal> N-grams( w ) 3 { h ab, abn, . . . , h abn, abnor, . . . , } ⇢ abnormal ~ ~ z norm ~ z abnor ~ z ... ~ z abnorm ~ z ab ~ z ... � 12

  13. C H A R - M O D E L : S U B W O R D R E P R E S E N TAT I O N FastText (P Bojanowski, 2017) 0 1 representation = average of n-gram • 1 X vectors ⇢ w = ~ @ ~ v w + ~ z g A | NG w | + 1 automatic semantic extraction of • g ∈ NG w stems/prefixes/suffices w = <abnormal> N-grams( w ) 3 { h ab, abn, . . . , h abn, abnor, . . . , } ‘abnor' ⇢ abnormal ~ ~ z norm ~ z abnor ‘abnorm’ ~ z ... ⇢ w · ~ ~ ~ z abnorm z ~ z ab ~ z ... cosine similarity between vector and n-gram vectors � 13

  14. S U B W O R D C O N T R I B U T I O N T O O V E R A L L S E M A N T I C S abnormal • Similar n-grams with high contribution • Similar words have similar semantics abnormality cosine similarity between n-gram vectors and mean vectors � 14

  15. FA S T T E X T W I T H W O R D 2 G M v (1) ~ µ pop, 1 = ~ pop rock ⇢ (0) ~ µ rock , 0 = ~ pop rock pop rock v (1) ⇢ (0) ~ µ pop , 0 = ~ ~ µ rock, 1 = ~ pop rock 0 1 1 ⇢ ( j ) X v ( j ) z ( j ) @ ~ ~ w,i = w + g A | NG w | + 1 g ∈ NG w • Augment Gaussian mixture representation with character-structure (FastText) • Promote independence: using dictionary-level vectors for other components 15 �

  16. S I M I L A R I T Y S C O R E ( E N E R G Y ) B E T W E E N D I S T R I B U T I O N S vector space function space � 16

  17. E N E R G Y O F T W O G A U S S I A N M I X T U R E S closed form! total energy = weighted sum of pairwise partial energies ξ i,j = − α 2 || µ f,i − µ g,i || 2 simplified partial energy ξ 0 , 0 funk, pop-rock, band rock:0 pop:0 bang, crack, snap ξ 0 , 1 ξ 1 , 0 basalt, boulder, sand rock:1 pop:1 jazz, punk, indie ξ 1 , 1 � 17

  18. W O R D S A M P L I N G I like that rock band w i w i +1 w i +2 w i − 2 w i − 1 Dataset: ukWac + WackyPedia (3.5 billion tokens) � 18

  19. L O S S F U N C T I O N Energy-based Max Margin context word: w word: c rock band high E(w,c) negative word: w context: c’ rock dog low E(w,c’) Minimize the objective 19 �

  20. M U LT I M O D A L R E P R E S E N TAT I O N - M I X T U R E O F G A U S S I A N S 0 1 1 X ⇢ w = ~ @ ~ v w + ~ z g A Model parameters: | NG w | + 1 g ∈ NG w dictionary vectors {{ v w i } i = K i =1 } w R O C K char n-gram vectors S T O N E { z g } J A Z Z Model hyperparameters: R O C K S T O N E α , m (covariance scale, margin) 20 �

  21. T R A I N I N G - I L L U S T R AT I O N Mixture of Gaussians Model parameters: S T O N E R O C K dictionary vectors {{ v w i } i = K i =1 } w char n-gram vectors { z g } J A Z Z J A Z Z S T O N E Train with max margin objective R O C K using minibatch SGD (AdaGrad) � 21

  22. T R A I N I N G - I L L U S T R AT I O N Mixture of Gaussians Model parameters: S T O N E R O C K dictionary vectors {{ v w i } i = K i =1 } w char n-gram vectors { z g } J A Z Z J A Z Z S T O N E Train with max margin objective R O C K using minibatch SGD (AdaGrad) � 22

  23. T R A I N I N G - I L L U S T R AT I O N Mixture of Gaussians Model parameters: S T O N E dictionary vectors R O C K {{ v w i } i = K i =1 } w char n-gram vectors { z g } J A Z Z J A Z Z S T O N E Train with max margin objective R O C K using minibatch SGD (AdaGrad) � 23

  24. E VA L U AT I O N � 24

  25. Q U A L I TAT I V E E VA L U AT I O N - N E A R E S T N E I G H B O R S basalt jazz pop stone rock rock � 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend