Matrix Factorization DS-GA 1013 / MATH-GA 2824 Optimization-based - PowerPoint PPT Presentation

Matrix Factorization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda

Low-rank models Matrix completion Structured low-rank models

Motivation Quantity y [ i , j ] depends on indices i and j We observe examples and want to predict new instances In collaborative filtering, y [ i , j ] is rating given to a movie i by a user j

Collaborative filtering Bob Molly Mary Larry   1 1 5 4 The Dark Knight 2 1 4 5 Spiderman 3     4 5 2 1 Love Actually   Y :=   5 4 2 1 Bridget Jones’s Diary     4 5 1 2 Pretty Woman 1 2 5 5 Superman 2

Simple model Assumptions: ◮ Some movies are more popular in general ◮ Some users are more generous in general y [ i , j ] ≈ a [ i ] b [ j ] ◮ a [ i ] quantifies popularity of movie i ◮ b [ j ] quantifies generosity of user j

Simple model Problem: Fitting a and b to the data yields nonconvex problem Example: 1 movie, 1 user, rating 1 yields cost function ( 1 − ab ) 2 To fix scale set | a | = 1

( 1 − ab ) 2 b 20 . 0 a 10 . 0 a = − 1 a = +1

Rank-1 model Assume m movies are all rated by n users Model becomes a � b T Y ≈ � We can fit it by solving � � b T � � � � � � a � min � Y − � subject to || � a || 2 = 1 � � � a ∈ R m , � F � b ∈ R n Equivalent to

Rank-1 model Assume m movies are all rated by n users Model becomes a � b T Y ≈ � We can fit it by solving � � b T � � � � � � a � min � Y − � subject to || � a || 2 = 1 � � � a ∈ R m , � F � b ∈ R n Equivalent to X ∈ R m × n || Y − X || F min subject to rank ( X ) = 1

Best rank- k approximation Let USV T be the SVD of a matrix A ∈ R m × n The truncated SVD U : , 1 : k S 1 : k , 1 : k V T : , 1 : k is the best rank- k approximation � � � � � � � � � A − � U : , 1 : k S 1 : k , 1 : k V T : , 1 : k = arg min � A � � F { � A | rank ( ˜ A ) = k }

Rank-1 model v T σ 1 � u 1 � = arg X ∈ R m × n || Y − X || F min subject to rank ( X ) = 1 1 The solution to � � b T � � � � � � a � min � Y − � subject to || � a || 2 = 1 � � � a ∈ R m , � F � b ∈ R n is � a min = � b min =

Rank-1 model v T σ 1 � u 1 � = arg X ∈ R m × n || Y − X || F min subject to rank ( X ) = 1 1 The solution to � � b T � � � � � � a � min � Y − � subject to || � a || 2 = 1 � � � a ∈ R m , � F � b ∈ R n is � a min = � u 1 � b min = σ 1 � v 1

Rank- r model Certain people like certain movies: r factors r � y [ i , j ] ≈ a l [ i ] b l [ j ] l = 1 For each factor l ◮ a l [ i ] : movie i is positively ( > 0), negatively ( < 0) or not ( ≈ 0) associated to factor l ◮ b l [ j ] : user j likes ( > 0), hates ( < 0) or is indifferent ( ≈ 0) to factor l

Rank- r model Equivalent to A ∈ R m × r , B ∈ R r × n Y ≈ AB , SVD solves A ∈ R m × r , B ∈ R r × n || Y − A B || F min subject to || � a 1 || 2 = 1 , . . . , || � a r || 2 = 1 a r , � b 1 , . . . , � Problem: Many possible ways of choosing � a 1 , . . . , � b r SVD constrains them to be orthogonal

Collaborative filtering Bob Molly Mary Larry   1 1 5 4 The Dark Knight 2 1 4 5 Spiderman 3     4 5 2 1 Love Actually   Y :=   5 4 2 1 Bridget Jones’s Diary     4 5 1 2 Pretty Woman 1 2 5 5 Superman 2

SVD   7 . 79 0 0 0   0 1 . 62 0 0 1 T = USV T = U A − µ� 1 �    V T  0 0 1 . 55 0 0 0 0 0 . 62 m n � � µ := 1 A ij n i = 1 j = 1

Rank 1 model Bob Molly Mary Larry   1 . 34 ( 1 ) 1 . 19 ( 1 ) 4 . 66 ( 5 ) 4 . 81 ( 4 ) The Dark Knight 1 . 55 ( 2 ) 1 . 42 ( 1 ) 4 . 45 ( 4 ) 4 . 58 ( 5 ) Spiderman 3     4 . 45 ( 4 ) 4 . 58 ( 5 ) 1 . 55 ( 2 ) 1 . 42 ( 1 )   Love Actually ¯ v T A + σ 1 � u 1 � 1 =   4 . 43 ( 5 ) 4 . 56 ( 4 ) 1 . 57 ( 2 ) 1 . 44 ( 1 )   B.J.’s Diary   4 . 43 ( 4 ) 4 . 56 ( 5 ) 1 . 57 ( 1 ) 1 . 44 ( 2 ) Pretty Woman 1 . 34 ( 1 ) 1 . 19 ( 2 ) 4 . 66 ( 5 ) 4 . 81 ( 5 ) Superman 2

Movies D. Knight Sp. 3 Love Act. B.J.’s Diary P. Woman Sup. 2 � a 1 = ( − 0 . 45 − 0 . 39 0 . 39 0 . 39 0 . 39 − 0 . 45 ) Coefficients cluster movies into action (+) and romantic (-)

Users Bob Molly Mary Larry � b 1 = ( 3 . 74 4 . 05 − 3 . 74 − 4 . 05 ) Coefficients cluster people into action (-) and romantic (+)

Low-rank models Matrix completion Structured low-rank models

Netflix Prize ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

Matrix completion Bob Molly Mary Larry   1 ? 5 4 The Dark Knight ? 1 4 5 Spiderman 3     4 5 2 ? Love Actually     5 4 2 1 Bridget Jones’s Diary     4 5 1 2 Pretty Woman 1 2 ? 5 Superman 2

Matrix completion as an inverse problem � 1 � ? 5 ? 3 2 For a fixed sampling pattern, underdetermined system of equations   Y 11         1 0 0 0 0 0  Y 21  1              0 0 0 1 0 0   Y 12   3        =             0 0 0 0 1 0 5 Y 22           0 0 0 0 0 1 Y 13 2   Y 23

Isn’t this completely ill posed? Assumption: Matrix is low rank, depends on ≈ r ( m + n ) parameters As long as data > parameters recovery is possible (in principle)   1 1 1 1 ? 1   1 1 1 1 1 1     1 1 1 1 1 1 ? 1 1 1 1 1

Matrix cannot be sparse   0 0 0 0 0 0   0 0 0 23 0 0     0 0 0 0 0 0 0 0 0 0 0 0

Singular vectors cannot be sparse       1 0 1 1 1 1   � �   � �   1 0 1 1 1 1       1 1 1 1 + 1 2 3 4 =       1 0 1 1 1 1 1 1 2 3 4 5

Incoherence The matrix must be incoherent: its singular vectors must be spread out For 1 / √ n ≤ µ ≤ 1 1 ≤ i ≤ r , 1 ≤ j ≤ m | U ij | ≤ µ max 1 ≤ i ≤ r , 1 ≤ j ≤ n | V ij | ≤ µ max for the left U 1 , . . . , U r and right V 1 , . . . , V r singular vectors

Measurements We must see an entry in each row/column at least     1 1 1 1 1     � � ? ? ? ? ?      = 1 1 1 1    1 1 1 1 1 1 1 1 1 1 Assumption: Random sampling (usually does not hold in practice!)

Low-rank matrix estimation First idea: X ∈ R m × n rank ( X ) min such that X Ω = y Ω : indices of revealed entries y : revealed entries Computationally intractable because of missing entries Tractable alternative: X ∈ R m × n || X || ∗ min such that X Ω = y

Exact recovery Guarantees by Gross 2011, Candès and Recht 2008, Candès and Tao 2009 X ∈ R m × n || X || ∗ min such that X Ω = y achieves exact recovery with high probability as long as the number of samples is proportional to r ( n + m ) up to log terms The proof is based on the construction of a dual certificate

Low-rank matrix estimation If data are noisy y || 2 X ∈ R m × n || X Ω − � min 2 + λ || X || ∗ where λ > 0 is a regularization parameter

Matrix completion via nuclear-norm minimization Bob Molly Mary Larry   1 2 ( 1 ) 5 4 The Dark Knight 2 ( 2 ) 1 4 5 Spiderman 3     4 5 2 2 ( 1 ) Love Actually     5 4 2 1 Bridget Jones’s Diary     4 5 1 2 Pretty Woman 1 2 5 ( 5 ) 5 Superman 2

Proximal gradient method Method to solve the optimization problem minimize f ( � x ) + h ( � x ) , where f is differentiable and prox h is tractable Proximal-gradient iteration: x ( 0 ) = arbitrary initialization � � � x ( k ) �� x ( k + 1 ) = prox α k h x ( k ) − α k ∇ f � � �

Proximal operator of nuclear norm The solution X to 1 2 || Y − X || 2 min F + τ || X || ∗ X ∈ R m × n is obtained by soft-thresholding the SVD of Y X prox = D τ ( Y ) D τ ( M ) := U S τ ( S ) V T where M = U SV T � S ii − τ if S ii > τ S τ ( S ) ii := 0 otherwise

Subdifferential of the nuclear norm Let X ∈ R m × n be a rank- r matrix with SVD USV T , where U ∈ R m × r , V ∈ R n × r and S ∈ R r × r A matrix G is a subgradient of the nuclear norm at X if and only if G := UV T + W where W satisfies || W || ≤ 1 U T W = 0 W V = 0

Proximal operator of nuclear norm The subgradients of 1 2 || Y − X || 2 F + τ || X || ∗ are of the form Y − X + τ G where G is a subgradient of the nuclear norm at X D τ ( Y ) is a minimizer if and only if G = 1 τ ( Y − D τ ( Y )) is a subgradient of the nuclear norm at D τ ( Y )

Matrix Factorization DS-GA 1013 / MATH-GA 2824 Optimization-based - PowerPoint PPT Presentation

Matrix Factorization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda Low-rank models Matrix completion Structured low-rank models Motivation

L101: Matrix Factorization In a nutshell Matrix factorization/completion you know? In NLP?

Online-Updating Regularized Kernel Matrix Factorization Models for Large-Scale Recommender

Tensor Factorization via Matrix Factorization Volodymyr Kuleshov Arun Tejasvi Chaganty Percy

A Model For Mixed Linear-Tropical Matrix Factorization James Hook, Sanjar Karaev, Pauli Miettinen

Singular Value Decomposition (matrix factorization) Singular Value Decomposition The SVD is a

Matrix Factorization and Factorization Machines for Recommender Systems Chih-Jen Lin Department

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Robust Spectral Inference for Joint Stochastic Matrix Factorization Kun Dong Cornell University

CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs Wei Tan, IBM T. J. Watson

Multimodal Visualization Based On Non-negative Matrix Factorization Jorge Camargo Juan Caicedo

Matrix Factorization For Topic Models Dr. Derek Greene Insight Latent Space Workshop

Structured sparse methods for matrix factorization Francis Bach Sierra team, INRIA - Ecole

Matrix Factorization with Binary Components Uniqueness in a randomized model Felix Krahmer,

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Greedy Orthogonal Pivoting for Non-negative Matrix Factorization Kai Zhang, Jun Liu, Jie Zhang,

Piazza Recitation session : Review of linear algebra Location: Thursday, April 11, from

Thermalization and Random Matrices Anatoly Dymarsky University of Kentucky Great Lakes Strings

Local regime of 1d random band matrices Tatyana Shcherbina Princeton University QMath13:

Natural Language Processing STOR 390 4/18/17 Kurt Vonnegut on the Shapes of Stories

A Minimization Algorithm Consider the minimization problem: * M min M M * subject

Quantum Diffusion and Delocalization for Random Band Matrices Antti Knowles Harvard University

Cache-oblivious sparse matrixvector multiplication Albert-Jan Yzelman April 3, 2009 Joint

The Faber-Manteuffel Theorem and its Consequences Petr Tich joint work with Vance Faber,