Mutual Angular Regularization of Latent Variable Models: Theory, Algorithm and Applications Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1
Latent Variable Models (LVMs) Machine Learning Latent Variable Models Pattern 2
Latent Variable Models Topic Models Gaussian Mixture Model Topics Groups Words Feature vectors Hidden Markov Model, Kalman Filtering, Restricted Boltzmann Machine, Deep Belief Network, Factor Analysis, etc. Neural Network, Sparse Coding, Matrix Factorization, Distance Metric learning, Principal Component Analysis, etc. 3
Latent Variable Models Latent Factors Components Behind Data in LVMs Topics in Documents Topic Models Politics Economics Education Obama GDP University Constitution Bank Knowledge Student Government Marketing Groups in Images Gaussian Mixture Model Tiger Car Food 4
Motivation I: Popularity of latent factors is skewed  Popularity of latent factors follows a power-law distribution Groups in Topics in News Flickr Photos Dominant Topics Long-Tail Topics Dominant Groups Long-Tail Groups Politics Economics Furniture Flower Diamond Painting Obama GDP Sofa Rose Car Food Constitution Bank Closet Tulip Government Marketing Curtain Lily 5
Standard LVMs are insufficient to capture long-tail factors  Latent Dirichlet Allocation (LDA)  “Extremely common words tend to dominate all topics” (Wallach, 2009)  Tencent Peacock LDA, “When learning ≥ 10 5 topics, around 20% ∼ 40% topics have duplicates in practice” (Wang, 2015 )  Restricted Boltzmann Machine Topic 1 Topic 2 Topic 3 president iraq iraq  Ran on 20-Newsgroup dataset clinton united un  Many duplicate topics (e.g., the three iraq un iraqi united weapons lewinsky exemplar topics are all about politics) spkr iraqi saddam  Common words occur repeatedly house nuclear clinton people india baghdad across topics, such as iraq, clinton, lewinsky minister inspectors united, weapons government saddam weapons white military white 6
Standard LVMs are insufficient to capture long-tail factors Latent factors behind data Components in LVM 7
Long-tail factors are important  The amount of long-tail factors is large Long-tail factors  Long-tail factors are more important than dominant factors in some applications  Example: Tencent applied topic models for advertisement and showed that long- tail topics such as “lose weight”, “nursing” improves click-through rate by 40% (Jin, 2015) 8
Diversification Latent factors behind data Components in LVM 9
Motivation II: Tradeoff induced by the number of components k  Tradeoff between Expressiveness and Complexity  Small k: low expressiveness, low complexity  Large k: high expressiveness, high complexity  Can we achieve the best of both worlds?  Small k: high expressiveness, low complexity 10 10
Reduce model complexity without sacrificing expressiveness Without diversification With diversification Data Samples Use components to capture the principal directions of data point cloud Components in LVM 11 11
Mutual Angular Regularization of LVMs  Goal: encourage the components to diversely spread out to (1) improve the coverage of long-tail latent factors; (2) reduce model complexity without compromising expressiveness  Approach:  Define a score based on mutual angles to measure the diversity of components  Use the score to regularize latent variable models and control the geometry of the latent space during learning 12 12
Outline  Mutual Angular Regularizer  Algorithm  Applications  Theory 13 13
Mutual Angular Regularizer  Components are parametrized by vectors  In Latent Dirichlet Allocation, each topic has a multinomial vector  In Sparse Coding, each dictionary item has a real vector  Measure the dissimilarity between two vectors  Measure the diversity of a vector set 14 14
Dissimilarity between two vectors  Invariant to scale, translation, rotation and orientation of the two vectors  Euclidean distance, L1 distance  Distance 𝑒 is variant to scale d d O O  Negative cosine similarity  Negative cosine similarity 𝑏 is variant to orientation O O a=0.6 a=-0.6 15 15
Dissimilarity between two vectors  Non-obtuse angle 𝜄 𝜄 𝜄 𝜄 O O O  Invariant to scale, translation, rotation and orientation of the two vectors  Definition    x y     arccos   x y   16 16
Measure the diversity of a vector set  Based on the pairwise dissimilarity measure between vectors    a K  The diversity of a set of vectors is defined as A  i i 1 2   Mutual    K K K K  K K  1 1 1    a a          ( ) A   Angular i j   arccos    ij ij pq     K K ( 1) K K ( 1) K K ( 1) ij       i 1 j 1 i 1 j 1 p 1 q 1 a a     Regularizer    i j j i j i q p Mean of angles Variance of angles  Mean: summarize how these vectors are different from each other on the whole  Variance: encourage the vectors to evenly spread out 17 17
LVM with Mutual Angular Regularization (MAR-LVM)    max L D ( ; A ) ( ) A A 2   K K K K  K K  1  1  1         ( ) A      ij ij pq   K K ( 1) K K ( 1) K K ( 1)       i 1 j 1 i 1 j 1 p 1 q 1      j i j i q p    a a     i j arccos   ij a a   i j 18 18
Algorithm  Challenge: the mutual angular regularizer is non-smooth and    a K non-convex w.r.t the parameter vectors A  i i 1  Derive a smooth lower bound  The lower bound is easier to derive if the parameter vectors lie on a sphere  Decompose the parameter vectors into magnitudes and directions  Proved that optimizing the lower bound with gradient ascent method can increase the mutual angular regularizer in each iteration 19 19
Optimization Reparametrize     a g a g a a 1 A diag( ) g A i i i i i i = Ω diag(𝐡)𝐁 Ω 𝐁 Magnitude Direction ~ ~    max L ( D ; g A ) ( A ) ~    g , A max L D A ( ; ) ( ) A ~ g A ,    i , a 1 , g 0 s . t . i i Alternating Optimization ~ ~ g g Fix , optimize Fix , optimize A A ~ ~ ~    max ~ max L ( D ; g A ) ( A ) L ( D ; g A ) g A ~     s . t . a i , 1 s . t . i , g 0 i i 20 20
Optimize 𝑩 ~ ~    max ~ g A A L ( D ; ) ( ) A ~   s . t . i , a 1 i Lower bound      2             T T   ( ) A ( ) A arcsin det A A arcsin det A A   2 𝑈 𝑩 is the volume of the parallelipiped Intuition of the lower bound: det 𝑩 . The larger det 𝑩 𝑈 𝑩 is, the more likely that the formed by the vectors in 𝑩 have larger angles (not surely). Γ 𝑩 is an increasing function w.r.t vectors in 𝑩 𝑈 𝑩 . Hence larger Γ 𝑩 is likely to yield larger Ω 𝑩 . det 𝑩 Optimize the lower bound, which is smooth and much more amenable for optimization ~ ~    max ~ L ( D ; g A ) ( A ) A ~   s . t . i , a 1 i 21 21
Close Alignment between the Regularizer and its Lower Bound  If the lower bound is optimized with projected gradient ascent (PGA), the mutual angular regularizer can be increased in each iteration of the PGA procedure  Optimizing the lower bound with PGA can increase the mean of the angles in each iteration  Optimizing the lower bound with PGA can decrease the variance of the angles in each iteration 2   - K K K K  K K  1 1 1           ( ) A     ij ij pq   K K ( 1) K K ( 1) K K ( 1)       i 1 j 1 i 1 j 1 p 1 q 1      j i j i q p Variance Mean 22 22
Geometry Interpretation of the Close Alignment  The gradient of the lower bound w.r.t is orthogonal to all a i     other vectors a a , , , a a 1 2 K i  Move along its gradient direction would enlarge its angle a i with other vectors a a a are parameter vectors g ˆ  2 3 1 1 a a a g is the gradient of 1 a and are orthogonal to 2 1 1 3     a ˆ a a g 1 1 1 1    ˆ     a The angle between and is greater than a 3 1 a a a between and 3 3 1 a 2 23 23
Recommend
More recommend