P R O B A B I L I S T I C FA S T T E X T F O R M U LT I - S E N S E W O R D E M B E D D I N G S B E N AT H I WA R AT K U N , A N D R E W G O R D O N W I L S O N , A N I M A A N A N D K U M A R
2 - M I N S U M M A RY Probabilistic FastText = FastText + Gaussian Mixture Embeddings Words as probability densities • Gaussian Mixture Embeddings Each word = Gaussian Mixture density • stone rock music Disentangled meanings • pop jazz basalt rock ~ µ rock, 0 ~ µ rock, 1 � 2
2 - M I N S U M M A RY Probabilistic FastText = FastText + Gaussian Mixture Embeddings Words as probability densities • Gaussian Mixture Embeddings Each word = Gaussian Mixture density • stone rock music Disentangled meanings • pop jazz basalt rock ~ µ rock, 0 ~ µ rock, 1 Word embeddings: word vectors are • FastText derived from subword vectors ⇢ abnormal ~ SoA on many benchmarks especially • ~ z norm ~ z abnor RareWord ~ z ... ~ z abnorm Character based models allow for • ~ z ab estimating vectors of unseen words and ~ z ... enhancing � 3
2 - M I N S U M M A RY Probabilistic FastText Gaussian Mixture (PFT) Embeddings stone rock music stone pop jazz basalt rock rock ~ µ rock, 0 ~ basalt µ rock, 1 + FastText ~ music µ rock, 0 pop jazz ⇢ abnormal ~ ~ rock z norm ~ z abnor ~ z ... ~ z abnorm ~ z ab ~ z ... ~ µ rock, 1 4 �
P R O B A B I L I S T I C FA S T T E X T Able to estimate distributions of • unseen words stone rock C O O L f (“cool”) = C O O L L [“cool”] = C O O L Z f (“coolz”) = basalt L [“coolz”] = ? L [“coolzz”] = ? f (“coolzz”) = music C O O L Z Z ~ µ rock, 0 pop jazz dictionary-based embeddings character-based probabilistic embeddings rock ~ µ rock, 1 � 5
P R O B A B I L I S T I C FA S T T E X T Able to estimate distributions of • unseen words stone rock C O O L f (“cool”) = C O O L L [“cool”] = C O O L Z f (“coolz”) = basalt L [“coolz”] = ? L [“coolzz”] = ? f (“coolzz”) = music C O O L Z Z ~ µ rock, 0 pop jazz dictionary-based embeddings character-based probabilistic embeddings rock High semantic quality for rare words • via root sharing w2gm FastText PFT Spearman Correlation ~ µ rock, 1 0.43 0.48 0.49 on RareWord dataset 6 �
P R O B A B I L I S T I C FA S T T E X T Able to estimate distributions of • unseen words stone rock C O O L f (“cool”) = C O O L L [“cool”] = C O O L Z f (“coolz”) = basalt L [“coolz”] = ? L [“coolzz”] = ? f (“coolzz”) = music C O O L Z Z ~ µ rock, 0 pop jazz dictionary-based embeddings character-based probabilistic embeddings rock High semantic quality for rare words • via root sharing w2gm FastText PFT Spearman Correlation ~ µ rock, 1 0.43 0.48 0.49 on RareWord dataset disentangled meanings • Word Component Nearest neighbors (cosine similarity) rock 0 rocks:0, rocky:0, mudrock:0, rockscape:0 rock 1 punk:0, punk-rock:0, indie:0, pop-rock:0 � 7
P R O B A B I L I S T I C FA S T T E X T Able to estimate distributions of • unseen words stone rock C O O L f (“cool”) = C O O L L [“cool”] = C O O L Z f (“coolz”) = basalt L [“coolz”] = ? L [“coolzz”] = ? f (“coolzz”) = music C O O L Z Z ~ µ rock, 0 pop jazz dictionary-based embeddings character-based probabilistic embeddings rock High semantic quality for rare words • via root sharing w2gm FastText PFT Spearman Correlation ~ µ rock, 1 0.43 0.48 0.49 on RareWord dataset Applicable to foreign languages without • disentangled meanings • any changes in model hyperparameters! Component / Word Component Nearest neighbors (cosine similarity) Word Nearest neighbors (English Translation) Meaning rock 0 rocks:0, rocky:0, mudrock:0, rockscape:0 secondo 0 / 2nd Secondo (2nd), terzo (3rd) , quinto (5th), primo (first) rock 1 punk:0, punk-rock:0, indie:0, pop-rock:0 secondo 1 / according to conformit (compliance), attenendosi (following), cui (which) � 8
V E C T O R E M B E D D I N G S & FA S T T E X T � 9
W O R D E M B E D D I N G S one-hot vector dense representation 0 0 . 1 abnormal abnormality 0 . 2 0 normal 1 - 0 . 1 . . dimension size of vocabulary . . modulation ~ 50 - 1000 ~ Millions . . harmonics 0 . 9 0 1 . 2 0 amplitude } vectors word2vec (Mikolov et al., 2013) • GloVe (Pennington et al., 2014) • 10 �
D E N S E R E P R E S E N TAT I O N O F W O R D S Relationship deduction Meaningful nearest neighbors from vector arithmetic vindicates vindicate exculpate absolve exonerate modulation modulations harmonics amplitude i.e. China - Beijing ~ Japan - Tokyo Mikolov 2013 � 11
C H A R - M O D E L : S U B W O R D R E P R E S E N TAT I O N FastText (P Bojanowski, 2017) 0 1 representation = average of n-gram • 1 X vectors ⇢ w = ~ @ ~ v w + ~ z g A | NG w | + 1 automatic semantic extraction of • g ∈ NG w stems/prefixes/suffices w = <abnormal> N-grams( w ) 3 { h ab, abn, . . . , h abn, abnor, . . . , } ⇢ abnormal ~ ~ z norm ~ z abnor ~ z ... ~ z abnorm ~ z ab ~ z ... � 12
C H A R - M O D E L : S U B W O R D R E P R E S E N TAT I O N FastText (P Bojanowski, 2017) 0 1 representation = average of n-gram • 1 X vectors ⇢ w = ~ @ ~ v w + ~ z g A | NG w | + 1 automatic semantic extraction of • g ∈ NG w stems/prefixes/suffices w = <abnormal> N-grams( w ) 3 { h ab, abn, . . . , h abn, abnor, . . . , } ‘abnor' ⇢ abnormal ~ ~ z norm ~ z abnor ‘abnorm’ ~ z ... ⇢ w · ~ ~ ~ z abnorm z ~ z ab ~ z ... cosine similarity between vector and n-gram vectors � 13
S U B W O R D C O N T R I B U T I O N T O O V E R A L L S E M A N T I C S abnormal • Similar n-grams with high contribution • Similar words have similar semantics abnormality cosine similarity between n-gram vectors and mean vectors � 14
FA S T T E X T W I T H W O R D 2 G M v (1) ~ µ pop, 1 = ~ pop rock ⇢ (0) ~ µ rock , 0 = ~ pop rock pop rock v (1) ⇢ (0) ~ µ pop , 0 = ~ ~ µ rock, 1 = ~ pop rock 0 1 1 ⇢ ( j ) X v ( j ) z ( j ) @ ~ ~ w,i = w + g A | NG w | + 1 g ∈ NG w • Augment Gaussian mixture representation with character-structure (FastText) • Promote independence: using dictionary-level vectors for other components 15 �
S I M I L A R I T Y S C O R E ( E N E R G Y ) B E T W E E N D I S T R I B U T I O N S vector space function space � 16
E N E R G Y O F T W O G A U S S I A N M I X T U R E S closed form! total energy = weighted sum of pairwise partial energies ξ i,j = − α 2 || µ f,i − µ g,i || 2 simplified partial energy ξ 0 , 0 funk, pop-rock, band rock:0 pop:0 bang, crack, snap ξ 0 , 1 ξ 1 , 0 basalt, boulder, sand rock:1 pop:1 jazz, punk, indie ξ 1 , 1 � 17
W O R D S A M P L I N G I like that rock band w i w i +1 w i +2 w i − 2 w i − 1 Dataset: ukWac + WackyPedia (3.5 billion tokens) � 18
L O S S F U N C T I O N Energy-based Max Margin context word: w word: c rock band high E(w,c) negative word: w context: c’ rock dog low E(w,c’) Minimize the objective 19 �
M U LT I M O D A L R E P R E S E N TAT I O N - M I X T U R E O F G A U S S I A N S 0 1 1 X ⇢ w = ~ @ ~ v w + ~ z g A Model parameters: | NG w | + 1 g ∈ NG w dictionary vectors {{ v w i } i = K i =1 } w R O C K char n-gram vectors S T O N E { z g } J A Z Z Model hyperparameters: R O C K S T O N E α , m (covariance scale, margin) 20 �
T R A I N I N G - I L L U S T R AT I O N Mixture of Gaussians Model parameters: S T O N E R O C K dictionary vectors {{ v w i } i = K i =1 } w char n-gram vectors { z g } J A Z Z J A Z Z S T O N E Train with max margin objective R O C K using minibatch SGD (AdaGrad) � 21
T R A I N I N G - I L L U S T R AT I O N Mixture of Gaussians Model parameters: S T O N E R O C K dictionary vectors {{ v w i } i = K i =1 } w char n-gram vectors { z g } J A Z Z J A Z Z S T O N E Train with max margin objective R O C K using minibatch SGD (AdaGrad) � 22
T R A I N I N G - I L L U S T R AT I O N Mixture of Gaussians Model parameters: S T O N E dictionary vectors R O C K {{ v w i } i = K i =1 } w char n-gram vectors { z g } J A Z Z J A Z Z S T O N E Train with max margin objective R O C K using minibatch SGD (AdaGrad) � 23
E VA L U AT I O N � 24
Q U A L I TAT I V E E VA L U AT I O N - N E A R E S T N E I G H B O R S basalt jazz pop stone rock rock � 25
Recommend
More recommend