matrix factorization
play

Matrix Factorization DS-GA 1013 / MATH-GA 2824 Optimization-based - PowerPoint PPT Presentation

Matrix Factorization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda Low-rank models Matrix completion Structured low-rank models Motivation


  1. Matrix Factorization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda

  2. Low-rank models Matrix completion Structured low-rank models

  3. Motivation Quantity y [ i , j ] depends on indices i and j We observe examples and want to predict new instances In collaborative filtering, y [ i , j ] is rating given to a movie i by a user j

  4. Collaborative filtering Bob Molly Mary Larry   1 1 5 4 The Dark Knight 2 1 4 5 Spiderman 3     4 5 2 1 Love Actually   Y :=   5 4 2 1 Bridget Jones’s Diary     4 5 1 2 Pretty Woman 1 2 5 5 Superman 2

  5. Simple model Assumptions: ◮ Some movies are more popular in general ◮ Some users are more generous in general y [ i , j ] ≈ a [ i ] b [ j ] ◮ a [ i ] quantifies popularity of movie i ◮ b [ j ] quantifies generosity of user j

  6. Simple model Problem: Fitting a and b to the data yields nonconvex problem Example: 1 movie, 1 user, rating 1 yields cost function ( 1 − ab ) 2 To fix scale set | a | = 1

  7. ( 1 − ab ) 2 b 20 . 0 a 10 . 0 a = − 1 a = +1

  8. Rank-1 model Assume m movies are all rated by n users Model becomes a � b T Y ≈ � We can fit it by solving � � b T � � � � � � a � min � Y − � subject to || � a || 2 = 1 � � � a ∈ R m , � F � b ∈ R n Equivalent to

  9. Rank-1 model Assume m movies are all rated by n users Model becomes a � b T Y ≈ � We can fit it by solving � � b T � � � � � � a � min � Y − � subject to || � a || 2 = 1 � � � a ∈ R m , � F � b ∈ R n Equivalent to X ∈ R m × n || Y − X || F min subject to rank ( X ) = 1

  10. Best rank- k approximation Let USV T be the SVD of a matrix A ∈ R m × n The truncated SVD U : , 1 : k S 1 : k , 1 : k V T : , 1 : k is the best rank- k approximation � � � � � � � � � A − � U : , 1 : k S 1 : k , 1 : k V T : , 1 : k = arg min � A � � F { � A | rank ( ˜ A ) = k }

  11. Rank-1 model v T σ 1 � u 1 � = arg X ∈ R m × n || Y − X || F min subject to rank ( X ) = 1 1 The solution to � � b T � � � � � � a � min � Y − � subject to || � a || 2 = 1 � � � a ∈ R m , � F � b ∈ R n is � a min = � b min =

  12. Rank-1 model v T σ 1 � u 1 � = arg X ∈ R m × n || Y − X || F min subject to rank ( X ) = 1 1 The solution to � � b T � � � � � � a � min � Y − � subject to || � a || 2 = 1 � � � a ∈ R m , � F � b ∈ R n is � a min = � u 1 � b min = σ 1 � v 1

  13. Rank- r model Certain people like certain movies: r factors r � y [ i , j ] ≈ a l [ i ] b l [ j ] l = 1 For each factor l ◮ a l [ i ] : movie i is positively ( > 0), negatively ( < 0) or not ( ≈ 0) associated to factor l ◮ b l [ j ] : user j likes ( > 0), hates ( < 0) or is indifferent ( ≈ 0) to factor l

  14. Rank- r model Equivalent to A ∈ R m × r , B ∈ R r × n Y ≈ AB , SVD solves A ∈ R m × r , B ∈ R r × n || Y − A B || F min subject to || � a 1 || 2 = 1 , . . . , || � a r || 2 = 1 a r , � b 1 , . . . , � Problem: Many possible ways of choosing � a 1 , . . . , � b r SVD constrains them to be orthogonal

  15. Collaborative filtering Bob Molly Mary Larry   1 1 5 4 The Dark Knight 2 1 4 5 Spiderman 3     4 5 2 1 Love Actually   Y :=   5 4 2 1 Bridget Jones’s Diary     4 5 1 2 Pretty Woman 1 2 5 5 Superman 2

  16. SVD   7 . 79 0 0 0   0 1 . 62 0 0 1 T = USV T = U A − µ� 1 �    V T  0 0 1 . 55 0 0 0 0 0 . 62 m n � � µ := 1 A ij n i = 1 j = 1

  17. Rank 1 model Bob Molly Mary Larry   1 . 34 ( 1 ) 1 . 19 ( 1 ) 4 . 66 ( 5 ) 4 . 81 ( 4 ) The Dark Knight 1 . 55 ( 2 ) 1 . 42 ( 1 ) 4 . 45 ( 4 ) 4 . 58 ( 5 ) Spiderman 3     4 . 45 ( 4 ) 4 . 58 ( 5 ) 1 . 55 ( 2 ) 1 . 42 ( 1 )   Love Actually ¯ v T A + σ 1 � u 1 � 1 =   4 . 43 ( 5 ) 4 . 56 ( 4 ) 1 . 57 ( 2 ) 1 . 44 ( 1 )   B.J.’s Diary   4 . 43 ( 4 ) 4 . 56 ( 5 ) 1 . 57 ( 1 ) 1 . 44 ( 2 ) Pretty Woman 1 . 34 ( 1 ) 1 . 19 ( 2 ) 4 . 66 ( 5 ) 4 . 81 ( 5 ) Superman 2

  18. Movies D. Knight Sp. 3 Love Act. B.J.’s Diary P. Woman Sup. 2 � a 1 = ( − 0 . 45 − 0 . 39 0 . 39 0 . 39 0 . 39 − 0 . 45 ) Coefficients cluster movies into action (+) and romantic (-)

  19. Users Bob Molly Mary Larry � b 1 = ( 3 . 74 4 . 05 − 3 . 74 − 4 . 05 ) Coefficients cluster people into action (-) and romantic (+)

  20. Low-rank models Matrix completion Structured low-rank models

  21. Netflix Prize ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

  22. Matrix completion Bob Molly Mary Larry   1 ? 5 4 The Dark Knight ? 1 4 5 Spiderman 3     4 5 2 ? Love Actually     5 4 2 1 Bridget Jones’s Diary     4 5 1 2 Pretty Woman 1 2 ? 5 Superman 2

  23. Matrix completion as an inverse problem � 1 � ? 5 ? 3 2 For a fixed sampling pattern, underdetermined system of equations   Y 11         1 0 0 0 0 0  Y 21  1              0 0 0 1 0 0   Y 12   3        =             0 0 0 0 1 0 5 Y 22           0 0 0 0 0 1 Y 13 2   Y 23

  24. Isn’t this completely ill posed? Assumption: Matrix is low rank, depends on ≈ r ( m + n ) parameters As long as data > parameters recovery is possible (in principle)   1 1 1 1 ? 1   1 1 1 1 1 1     1 1 1 1 1 1 ? 1 1 1 1 1

  25. Matrix cannot be sparse   0 0 0 0 0 0   0 0 0 23 0 0     0 0 0 0 0 0 0 0 0 0 0 0

  26. Singular vectors cannot be sparse       1 0 1 1 1 1   � �   � �   1 0 1 1 1 1       1 1 1 1 + 1 2 3 4 =       1 0 1 1 1 1 1 1 2 3 4 5

  27. Incoherence The matrix must be incoherent: its singular vectors must be spread out For 1 / √ n ≤ µ ≤ 1 1 ≤ i ≤ r , 1 ≤ j ≤ m | U ij | ≤ µ max 1 ≤ i ≤ r , 1 ≤ j ≤ n | V ij | ≤ µ max for the left U 1 , . . . , U r and right V 1 , . . . , V r singular vectors

  28. Measurements We must see an entry in each row/column at least     1 1 1 1 1     � � ? ? ? ? ?      = 1 1 1 1    1 1 1 1 1 1 1 1 1 1 Assumption: Random sampling (usually does not hold in practice!)

  29. Low-rank matrix estimation First idea: X ∈ R m × n rank ( X ) min such that X Ω = y Ω : indices of revealed entries y : revealed entries Computationally intractable because of missing entries Tractable alternative: X ∈ R m × n || X || ∗ min such that X Ω = y

  30. Exact recovery Guarantees by Gross 2011, Candès and Recht 2008, Candès and Tao 2009 X ∈ R m × n || X || ∗ min such that X Ω = y achieves exact recovery with high probability as long as the number of samples is proportional to r ( n + m ) up to log terms The proof is based on the construction of a dual certificate

  31. Low-rank matrix estimation If data are noisy y || 2 X ∈ R m × n || X Ω − � min 2 + λ || X || ∗ where λ > 0 is a regularization parameter

  32. Matrix completion via nuclear-norm minimization Bob Molly Mary Larry   1 2 ( 1 ) 5 4 The Dark Knight 2 ( 2 ) 1 4 5 Spiderman 3     4 5 2 2 ( 1 ) Love Actually     5 4 2 1 Bridget Jones’s Diary     4 5 1 2 Pretty Woman 1 2 5 ( 5 ) 5 Superman 2

  33. Proximal gradient method Method to solve the optimization problem minimize f ( � x ) + h ( � x ) , where f is differentiable and prox h is tractable Proximal-gradient iteration: x ( 0 ) = arbitrary initialization � � � x ( k ) �� x ( k + 1 ) = prox α k h x ( k ) − α k ∇ f � � �

  34. Proximal operator of nuclear norm The solution X to 1 2 || Y − X || 2 min F + τ || X || ∗ X ∈ R m × n is obtained by soft-thresholding the SVD of Y X prox = D τ ( Y ) D τ ( M ) := U S τ ( S ) V T where M = U SV T � S ii − τ if S ii > τ S τ ( S ) ii := 0 otherwise

  35. Subdifferential of the nuclear norm Let X ∈ R m × n be a rank- r matrix with SVD USV T , where U ∈ R m × r , V ∈ R n × r and S ∈ R r × r A matrix G is a subgradient of the nuclear norm at X if and only if G := UV T + W where W satisfies || W || ≤ 1 U T W = 0 W V = 0

  36. Proximal operator of nuclear norm The subgradients of 1 2 || Y − X || 2 F + τ || X || ∗ are of the form Y − X + τ G where G is a subgradient of the nuclear norm at X D τ ( Y ) is a minimizer if and only if G = 1 τ ( Y − D τ ( Y )) is a subgradient of the nuclear norm at D τ ( Y )

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend