Large-Scale Matrix Factorization with Distributed Stochastic - PowerPoint PPT Presentation

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla December 17, 2011 Peter J. Haas Yannis Sismanis Christina Teflioudi Faraz Makari

Outline Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary 2 / 28

Matrix completion visualized Original image 4 / 28

Matrix completion visualized Original image Partially observed image 4 / 28

Matrix completion visualized Original image Partially observed image Reconstructed image 4 / 28

Matrix completion for recommender systems ◮ Discover latent factors ( r = 1) Avatar The Matrix Up Alice 4 2 Bob 3 2 Charlie 5 3 5 / 28

Matrix completion for recommender systems ◮ Discover latent factors ( r = 1) Avatar The Matrix Up ( 2.24 ) ( 1.92 ) ( 1.18 ) Alice 4 2 ( 1.98 ) Bob 3 2 ( 1.21 ) Charlie 5 3 ( 2.30 ) 5 / 28

Matrix completion for recommender systems ◮ Discover latent factors ( r = 1) Avatar The Matrix Up ( 2.24 ) ( 1.92 ) ( 1.18 ) Alice 4 2 ( 1.98 ) ( 3.8 ) ( 2.3 ) Bob 3 2 ( 1.21 ) ( 2.7 ) ( 2.3 ) Charlie 5 3 ( 2.30 ) ( 5.2 ) ( 2.7 ) ◮ Minimum loss � ( V ij − [ WH ] ij ) 2 min W , H ( i , j ) ∈ Z 5 / 28

Matrix completion for recommender systems ◮ Discover latent factors ( r = 1) Avatar The Matrix Up ( 2.24 ) ( 1.92 ) ( 1.18 ) ? Alice 4 2 ( 1.98 ) ( 4.4 ) ( 3.8 ) ( 2.3 ) Bob 3 2 ? ( 1.21 ) ( 2.7 ) ( 2.3 ) ( 1.4 ) ? Charlie 5 3 ( 2.30 ) ( 5.2 ) ( 4.4 ) ( 2.7 ) ◮ Minimum loss � ( V ij − [ WH ] ij ) 2 min W , H ( i , j ) ∈ Z 5 / 28

Matrix completion for recommender systems ◮ Discover latent factors ( r = 1) Avatar The Matrix Up ( 2.24 ) ( 1.92 ) ( 1.18 ) ? Alice 4 2 ( 1.98 ) ( 4.4 ) ( 3.8 ) ( 2.3 ) Bob 3 2 ? ( 1.21 ) ( 2.7 ) ( 2.3 ) ( 1.4 ) ? Charlie 5 3 ( 2.30 ) ( 5.2 ) ( 4.4 ) ( 2.7 ) ◮ Minimum loss � ( V ij − µ − u i − m j − [ WH ] ij ) 2 min W , H , u , m ( i , j ) ∈ Z ◮ Bias 5 / 28

Matrix completion for recommender systems ◮ Discover latent factors ( r = 1) Avatar The Matrix Up ( 2.24 ) ( 1.92 ) ( 1.18 ) Alice ? 4 2 ( 1.98 ) ( 4.4 ) ( 3.8 ) ( 2.3 ) Bob 3 2 ? ( 1.21 ) ( 2.7 ) ( 2.3 ) ( 1.4 ) Charlie 5 ? 3 ( 2.30 ) ( 5.2 ) ( 4.4 ) ( 2.7 ) ◮ Minimum loss � ( V ij − µ − u i − m j − [ WH ] ij ) 2 min W , H , u , m ( i , j ) ∈ Z + λ ( � W � + � H � + � u � + � m � ) ◮ Bias, regularization 5 / 28

Matrix completion for recommender systems ◮ Discover latent factors ( r = 1) Avatar The Matrix Up ( 2.24 ) ( 1.92 ) ( 1.18 ) Alice ? 4 2 ( 1.98 ) ( 4.4 ) ( 3.8 ) ( 2.3 ) Bob 3 2 ? ( 1.21 ) ( 2.7 ) ( 2.3 ) ( 1.4 ) Charlie 5 ? 3 ( 2.30 ) ( 5.2 ) ( 4.4 ) ( 2.7 ) ◮ Minimum loss � ( V ij − µ − u i ( t ) − m j ( t ) − [ W ( t ) H ] ij ) 2 min W , H , u , m ( i , j , t ) ∈ Z t + λ ( � W ( t ) � + � H � + � u ( t ) � + � m ( t ) � ) ◮ Bias, regularization, time, . . . 5 / 28

Generalized Matrix Factorization ◮ A general machine learning problem ◮ Recommender systems, text indexing, face recognition, . . . 6 / 28

Generalized Matrix Factorization ◮ A general machine learning problem ◮ Recommender systems, text indexing, face recognition, . . . ◮ Training data ◮ V : m × n input matrix (e.g., rating matrix) ◮ Z : training set of indexes in V (e.g., subset of known ratings) V ij V 6 / 28

Generalized Matrix Factorization ◮ A general machine learning problem ◮ Recommender systems, text indexing, face recognition, . . . ◮ Training data ◮ V : m × n input matrix (e.g., rating matrix) ◮ Z : training set of indexes in V (e.g., subset of known ratings) ◮ Parameter space ◮ W : row factors (e.g., m × r latent customer factors) ◮ H : column factors (e.g., r × n latent movie factors) H H ∗ j W W i ∗ V ij V 6 / 28

Generalized Matrix Factorization ◮ A general machine learning problem ◮ Recommender systems, text indexing, face recognition, . . . ◮ Training data ◮ V : m × n input matrix (e.g., rating matrix) ◮ Z : training set of indexes in V (e.g., subset of known ratings) ◮ Parameter space ◮ W : row factors (e.g., m × r latent customer factors) ◮ H : column factors (e.g., r × n latent movie factors) H ◮ Model H ∗ j ◮ L ij ( W i ∗ , H ∗ j ): loss at element ( i , j ) ◮ Includes prediction error, regularization, auxiliary information, . . . ◮ Constraints (e.g., non-negativity) W W i ∗ V ij V 6 / 28

Generalized Matrix Factorization ◮ A general machine learning problem ◮ Recommender systems, text indexing, face recognition, . . . ◮ Training data ◮ V : m × n input matrix (e.g., rating matrix) ◮ Z : training set of indexes in V (e.g., subset of known ratings) ◮ Parameter space ◮ W : row factors (e.g., m × r latent customer factors) ◮ H : column factors (e.g., r × n latent movie factors) H ◮ Model H ∗ j ◮ L ij ( W i ∗ , H ∗ j ): loss at element ( i , j ) ◮ Includes prediction error, regularization, auxiliary information, . . . ◮ Constraints (e.g., non-negativity) W W i ∗ V ij ◮ Find best model � argmin L ij ( W i ∗ , H ∗ j ) V W , H ( i , j ) ∈ Z 6 / 28

Successful Applications ◮ Movie recommendation (Netflix, competition papers) ◮ > 12M users, > 20k movies, 2.4B ratings (projected) ◮ 36GB data, 9.2GB model (projected) ◮ Latent factor model ◮ Website recommendation (Microsoft, WWW10) ◮ 51M users, 15M URLs, 1.2B clicks ◮ 17.8GB data, 161GB metadata, 49GB model ◮ Gaussian non-negative matrix factorization ◮ News personalization (Google, WWW07) ◮ Millions of users, millions of stories, ? clicks ◮ Probabilistic latent semantic indexing 7 / 28

Successful Applications ◮ Movie recommendation (Netflix, competition papers) ◮ > 12M users, > 20k movies, 2.4B ratings (projected) ◮ 36GB data, 9.2GB model (projected) ◮ Latent factor model ◮ Website recommendation (Microsoft, WWW10) ◮ 51M users, 15M URLs, 1.2B clicks ◮ 17.8GB data, 161GB metadata, 49GB model ◮ Gaussian non-negative matrix factorization ◮ News personalization (Google, WWW07) ◮ Millions of users, millions of stories, ? clicks ◮ Probabilistic latent semantic indexing Distributed processing is necessary! ◮ Big data ◮ Large models ◮ Expensive computations 7 / 28

Stochastic Gradient Descent 1.0 5 4.5 4 5 ◮ Find minimum θ ∗ of function L . 5 5 . 6 6 5 . 5 4 7.5 7 3 . 5 3 2 . 5 0.5 0.0 ● 0 . 5 − 0.5 1 1.5 2 − 1.0 6.5 4.5 5 6 7 4 5 5 . − 1.0 − 0.5 0.0 0.5 1.0 9 / 28

Stochastic Gradient Descent 1.0 5 4.5 4 5 ◮ Find minimum θ ∗ of function L . 5 5 . 6 6 5 . 5 4 7.5 7 3 . 5 3 2 . 5 ◮ Pick a starting point θ 0 0.5 ● ◮ Approximate gradient ˆ L ′ ( θ 0 ) 0.0 ● 0 . 5 − 0.5 1 1.5 2 − 1.0 6.5 4.5 5 6 7 4 5 5 . − 1.0 − 0.5 0.0 0.5 1.0 9 / 28

Stochastic Gradient Descent 1.0 5 4.5 4 5 ◮ Find minimum θ ∗ of function L . 5 5 . 6 6 5 . 5 4 7.5 7 3 . 5 3 2 . 5 ◮ Pick a starting point θ 0 0.5 ● ◮ Approximate gradient ˆ L ′ ( θ 0 ) ◮ Move “approximately” downhill 0.0 ● 0 . 5 − 0.5 1 1.5 2 − 1.0 6.5 4.5 5 6 7 4 5 5 . − 1.0 − 0.5 0.0 0.5 1.0 9 / 28

Stochastic Gradient Descent 1.0 5 4.5 4 5 ◮ Find minimum θ ∗ of function L 5 . 5 . 6 6 5 . 5 4 7.5 7 3 . 5 3 2 . 5 ◮ Pick a starting point θ 0 0.5 ● ◮ Approximate gradient ˆ L ′ ( θ 0 ) ◮ Move “approximately” downhill 0.0 ● ◮ Stochastic difference equation 0 . 5 − 0.5 1 θ n +1 = θ n − ǫ n ˆ L ′ ( θ n ) 1.5 2 − 1.0 6.5 4.5 5 6 7 4 5 5 . − 1.0 − 0.5 0.0 0.5 1.0 9 / 28

Large-Scale Matrix Factorization with Distributed Stochastic - PowerPoint PPT Presentation

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla December 17, 2011 Peter J. Haas Yannis Sismanis Christina Teflioudi Faraz Makari Outline Matrix Factorization Stochastic Gradient Descent

Online-Updating Regularized Kernel Matrix Factorization Models for Large-Scale Recommender

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

L101: Matrix Factorization In a nutshell Matrix factorization/completion you know? In NLP?

CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs Wei Tan, IBM T. J. Watson

Tensor Factorization via Matrix Factorization Volodymyr Kuleshov Arun Tejasvi Chaganty Percy

A Model For Mixed Linear-Tropical Matrix Factorization James Hook, Sanjar Karaev, Pauli Miettinen

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Singular Value Decomposition (matrix factorization) Singular Value Decomposition The SVD is a

Matrix Factorization and Factorization Machines for Recommender Systems Chih-Jen Lin Department

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Large-Scale Data Fusion Tutorial at BC^2, Basel 2015 by Collective Matrix Factorization

NetSMF: Large-Scale Network Embedding as Sparse Matrix Factorization Jiezhong Qiu Tsinghua

Robust Spectral Inference for Joint Stochastic Matrix Factorization Kun Dong Cornell University

Multimodal Visualization Based On Non-negative Matrix Factorization Jorge Camargo Juan Caicedo

Introduction to NEXT TUESDAY (25th November) and THURSDAY Second Life (27th November) we will

Outline Introduction Related Work System Architecture: three major software modules

Case Studies: Brtal Legend tara@doublefine.com ~50 unique unit types ~50 unique unit types

Human Computer Intelligent Interaction Thomas S. Huang Department of Electrical and Computer

Get Your Business Marketing Ready MELISSA LOVE 3 keys to success. 1.Take massive action 2.

Hints for AVATAR (and some more) Martin Suda Czech Technical University in Prague, Czech

When Should We Add Theory Axioms And Which Ones? Giles Reger 1 , Martin Suda 1 2 1 School of

Measuring QoS in Web-Based Virtual Worlds: an Evaluation of