Probabilistic Low-Rank Matrix Completion with Adaptive Spectral - PowerPoint PPT Presentation

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Fran¸ cois Caron Department of Statistics, Oxford STATLEARN 2014, Paris April 7, 2014 Joint work with Adrien Todeschini, Marie Chavent (INRIA, U. of Bordeaux) F. Caron 1 / 44

Outline Introduction Complete case Matrix completion Experiments Conclusion and perspectives F. Caron 2 / 44

Matrix Completion ◮ Netflix prize ◮ 480k users and 18k movies providing 1-5 ratings ◮ 99% of the ratings are missing ◮ Objective: predict missing entries in order to make recommendations Movies . . .   × × × 1 4 . . .   × × × 1 × . . .     × × × Users 2 5 . . .     × × 3 1 4 . . . ... . . . . . . . . . . . . . . . . . . F. Caron 3 / 44

Matrix Completion Objective Complete a matrix X of size m × n from a subset of its entries Applications ◮ Recommender systems ◮ Image inpainting ◮ Imputation of missing data   × × × . . . ✷ ✷   × × × × . . . ✷     × × × ✷ ✷ . . .     × × . . . ✷ ✷ ✷ . . . . . . . . . . . . . . . . . . F. Caron 4 / 44

Matrix Completion ◮ Potentially large matrices (each dimension of order 10 4 − 10 6 ) ◮ Very sparsely observed (1%-10%) F. Caron 5 / 44

Low rank Matrix Completion ◮ Assume that the complete matrix Z is of low rank B T ≃ Z A �� m × n m × k k × n with k ≪ min( m, n ) .   � � � � � . . .   � � � � � . . . � � � � � . . .     � � � � � � � � . . .     � � � � � � � � . . .         � � � � � � � � . . .         . . . � � � � � � � � .. .. .. .. .. .. .. .. . . . F. Caron 6 / 44

Low rank Matrix Completion ◮ Assume that the complete matrix Z is of low rank B T ≃ Z A �� m × n m × k k × n with k ≪ min( m, n ) . Items   � � � � � . . .   Features → � � � � � . . . ↓ � � � � � . . .     � � � � � � � � . . .     � � � � � � � � . . .         Users � � � � � � � � . . .         . . . � � � � � � � � .. .. .. .. .. .. .. .. . . . F. Caron 6 / 44

Low rank Matrix Completion ◮ Let Ω ⊂ { 1 , . . . , m } × { 1 , . . . , n } be the subset of observed entries ◮ For ( i, j ) ∈ Ω iid ∼ N (0 , σ 2 ) X ij = Z ij + ε ij , ε ij where σ 2 > 0 F. Caron 7 / 44

Low rank Matrix Completion ◮ Optimization problem � 1 ( X ij − Z ij ) 2 minimize + λ rank ( Z ) 2 σ 2 Z ( i,j ) ∈ Ω � �� – loglikelihood penalty where λ > 0 is some regularization parameter. ◮ Non-convex ◮ Computationally hard for general subset Ω F. Caron 8 / 44

Low rank Matrix Completion ◮ Matrix completion with nuclear norm penalty � 1 ( X ij − Z ij ) 2 + λ � Z � ∗ minimize 2 σ 2 Z ( i,j ) ∈ Ω � �� – loglikelihood penalty where � Z � ∗ is the nuclear norm of Z , or the sum of the singular values of Z . ◮ Convex relaxation of the rank penalty optimization [Fazel, 2002, Cand` es and Recht, 2009, Cand` es and Tao, 2010] F. Caron 9 / 44

Low rank Matrix Completion Soft-Impute algorithm ◮ Start with an initial matrix Z (0) ◮ At each iteration t = 1 , 2 , . . . ◮ Replace the missing elements in X with those in Z ( t − 1) ◮ Perform a soft-thresholded SVD on the completed matrix, with shrinkage λ to obtain the low rank matrix Z ( t ) [Mazumder et al., 2010] F. Caron 10 / 44

Low rank Matrix Completion Soft-Impute algorithm ◮ Soft-thresholded SVD yields a low-rank representation ◮ Each iteration decreases the value of the nuclear norm objective function towards its minimum ◮ Various strategies proposed to scale the algorithm to problems where n, m of order 10 6 ◮ Same shrinkage applied to all singular values [Mazumder et al., 2010] F. Caron 11 / 44

Contributions ◮ Probabilistic interpretation of the nuclear norm objective function ◮ Maximum A Posteriori estimation assuming exponential priors on the singular values ◮ Soft-Impute = Expectation-Maximization algorithm ◮ Construction of alternative non-convex objective functions building on hierarchical priors ◮ Bridge the gap between the rank penalty and the nuclear norm penalty ◮ EM: Adaptative algorithm that iteratively adjusts the shrinkage coefficients for each singular value ◮ Similar to adaptive lasso in multivariate regression ◮ Numerical results show the interest of the approach on various datasets F. Caron 12 / 44

Outline Introduction Complete case Hierarchical adaptive spectral penalty EM algorithm for MAP estimation Matrix completion Experiments Conclusion and perspectives F. Caron 13 / 44

Nuclear Norm penalty ◮ Complete matrix X ◮ Nuclear norm objective function 1 2 σ 2 || X − Z || 2 F + λ || Z || ∗ minimize Z where || · || 2 F is the Frobenius norm ◮ Global solution given by a soft-thresholded SVD � Z = S λσ 2 ( X ) V T with where S λ ( X ) = � U � D λ � D λ = diag (( � � d 1 − λ ) + , . . . , ( � d r − λ ) + ) and t + = max( t, 0) . [Cai et al., 2010, Mazumder et al., 2010] F. Caron 14 / 44

Nuclear Norm penalty ◮ Maximum A Posteriori (MAP) estimate � [log p ( X | Z ) + log p ( Z )] Z = arg max Z under the prior p ( Z ) ∝ exp ( − λ � Z � ∗ ) where Z = UDV T with D = diag ( d 1 , d 2 , . . . , d r ) , and iid ∼ Haar uniform prior on unitary matrices U, V iid ∼ Exp( λ ) d i F. Caron 15 / 44

Hierarchical adaptive spectral penalty ◮ Each singular value has its own random shrinkage coefficient ◮ Hierarchical model, for each singular value i = 1 , . . . , r d i | γ i ∼ Exp( γ i ) γ i ∼ Gamma( a, b ) ◮ Marginal distribution over d i : � ∞ ab a p ( d i ) = Exp( d i ; γ i ) Gamma( γ i ; a, b ) dγ i = ( d i + b ) a +1 0 Pareto distribution with heavier tails than exponential distribution [Todeschini et al., 2013] F. Caron 16 / 44

Hierarchical adaptive spectral penalty 1 β = ∞ β = 2 0.9 β = 0 . 1 0.8 0.7 0.6 p ( d i ) 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 d i Figure: Marginal distribution p ( d i ) with a = b = β ◮ HASP penalty � r pen ( Z ) = − log p ( Z ) = ( a + 1) log( b + d i ) i =1 ◮ Admits as special case the nuclear norm penalty λ || Z || ∗ when a = λb and b → ∞ . F. Caron 17 / 44

Hierarchical adaptive spectral penalty (a) Nuclear norm (b) HASP ( β = 1 ) (c) HASP ( β = (d) Rank penalty 0 . 1 ) (e) ℓ 1 norm (f) HAL ( β = 1 ) (g) HAL ( β = 0 . 1 ) (h) ℓ 0 norm Figure: Top: Manifold of constant penalty, for a symmetric 2 × 2 matrix Z = [ x, y ; y, z ] for (a) the nuclear norm, hierarchical adaptive spectral penalty with a = b = β (b) β = 1 and (c) β = 0 . 1 , and (d) the rank penalty. Bottom: contour of constant penalty for a diagonal matrix [ x, 0; 0 , z ] , where one recovers the classical (e) lasso, (f-g) hierarchical lasso and (h) ℓ 0 penalties. F. Caron 18 / 44

EM algorithm for MAP estimation Expectation Maximization (EM) algorithm to obtain a MAP estimate � [log p ( X | Z ) + log p ( Z )] Z = arg max Z i.e. to minimize r � 1 2 σ 2 � X − Z � 2 L ( Z ) = F + ( a + 1) log( b + d i ) i =1 F. Caron 19 / 44

EM algorithm for MAP estimation ◮ Latent variables: γ = ( γ 1 , . . . , γ r ) ◮ E step: Q ( Z, Z ∗ ) = E [log( p ( X, Z, γ )) | Z ∗ , X ] r � 1 2 σ 2 � X − Z � 2 = C − F − ω i d i i =1 a +1 where ω i = E [ γ i | d ∗ i ] = i . b + d ∗ F. Caron 20 / 44

EM algorithm for MAP estimation ◮ M step: r � 1 2 σ 2 � X − Z � 2 minimize F + ω i d i (1) Z i =1 (1) is an adaptive spectral penalty regularized optimization problem, a +1 with weights ω i = i . b + d ∗ d ∗ 1 ≥ d ∗ 2 ≥ . . . ≥ d ∗ r ⇒ 0 ≤ ω 1 ≤ ω 2 ≤ . . . ≤ ω r (2) Given condition (2), the solution is given by a weighted soft-thresholded SVD � Z = S σ 2 ω ( X ) (3) V T with where S ω ( X ) = � U � D ω � D ω = diag (( � � d 1 − ω 1 ) + , . . . , ( � d r − ω r ) + ) . [Ga¨ ıffas and Lecu´ e, 2011] F. Caron 21 / 44

EM algorithm for MAP estimation 6 Nuclear Norm HASP ( β = 2) 5 HASP ( β = 0 . 1) 4 � d i 3 2 1 0 0 1 2 3 4 5 6 � d i Figure: Thresholding rules on the singular values � d i of X The weights will penalize less heavily higher singular values, hence reducing bias. F. Caron 22 / 44

Low rank estimation of complete matrices Hierarchical Adaptive Soft Thresholded (HAST) algorithm for low rank estimation of complete matrices Initialize Z (0) . At iteration t ≥ 1 • For i = 1 , . . . , r , compute the weights ω ( t ) a +1 = i b + d ( t − 1) i • Set Z ( t ) = S σ 2 ω ( t ) ( X ) • If L ( Z ( t − 1) ) − L ( Z ( t ) ) < ε then return � Z = Z ( t ) L ( Z ( t − 1) ) ◮ Admits soft-thresholded SVD operator as a special case when a = bλ and b = β → ∞ . F. Caron 23 / 44

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral - PowerPoint PPT Presentation

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Fran cois Caron Department of Statistics, Oxford STATLEARN 2014, Paris April 7, 2014 Joint work with Adrien Todeschini, Marie Chavent (INRIA, U. of

2 3 4 5 8 9 MINNEAPOLIS MILWAUKEE MSA RANK #16 MSA RANK #39 CHICAGO MSA RANK #3

Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation

Low Rank Matrix Completion: A Smoothed 0 -Search Wei Dai Jointly with Guangyu Zhou and

Matrix invertibility Rank-Nullity Theorem: For any n -column matrix A , nullity A + rank A = n

1 SVD applications: rank, column, row, and null spaces Rank : the rank of a matrix is equal to:

Multiple-Rank Updates to Matrix Factorizations Zack 8/30/2013 Outline u Introduction u

Bayesian Estimation of Low-rank Matrices Pierre Alquier Journes de Statistique du Sud,

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

On the minimum rank of a graph Jisu Jeong June 21, 2013 Jisu Jeong On the minimum rank of a

A message-passing approach to low-rank matrix reconstruction and application to clustering

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Low-rank Matrix Completion via Convex Optimization Ben Recht Center for the Mathematics of

Lecture 15: Exact Tensor Completion Joint Work with David Steurer Lecture Outline Part I:

Information market based recommender systems fusion Efthimios Bothos Konstantinos Christidis

Ensemble and Boosting Algorithms Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net

Conclusions Larry Holder CptS 570 Machine Learning School of Electrical Engineering and

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets

CS573 Data Privacy and Security Li Xiong Department of Mathematics and Computer Science Emory

CS480/680 Lecture 22: July 22, 2019 Ensemble Learning [RN] Sec. 18.10, [M] Sec. 16.2.5, [B]

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

Anon-Pass: Practical Anonymous Subscriptions Michael Z. Lee , Alan M. Dunn , Jonathan Katz