Fast Coordinate Descent methods for Non-Negative Matrix - PowerPoint PPT Presentation

Non-negative Matrix Factorization Fast Coordinate Descent methods for Non-Negative Matrix Factorization Inderjit S. Dhillon University of Texas at Austin SIAM Conference on Applied Linear Algebra Valencia, Spain June 19, 2012 Joint work with Cho-Jui Hsieh

Non-negative Matrix Factorization Outline Non-negative Matrix Factorization 1 Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF NMF with KL-divergence Non-negative Tensor Factorization (NTF)

Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Outline Non-negative Matrix Factorization 1 Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF NMF with KL-divergence Non-negative Tensor Factorization (NTF)

Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Problem Definition Input: Given a non-negative matrix V ∈ R m × n and the target rank k Output: two non-negative matrices W ∈ R m × k and H ∈ R n × k , such that WH T is a good approximation to V . Usually m , n ≫ k . How to measure goodness of approximation? Two widely used choices: Least squares NMF: ( V ij − ( WH T ) ij ) 2 � W , H ≥ 0 f ( W , H ) ≡ � V − WH T � 2 min F = i , j KL-divergence NMF: V ij log( V ij / ( WH T ) ij ) − V ij + ( WH T ) ij � W , H ≥ 0 L ( W , H ) ≡ min i , j

Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Problem Definition (Cont’d) Applications: text mining, image processing, . . . . Can get more interpretable basis than SVD. To achieve better sparsity, researchers have proposed adding L1 regularization terms on W and H :   1  2 � V − WH T � 2  � � ( W , H ) = arg min F + ρ 1 W ir + ρ 2 H jr W , H ≥ 0   i , r j , r

Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Existing Optimization Methods NMF is nonconvex , but is convex when W or H is fixed. Recent methods follow the alternating minimization framework: Iteratively solve min W ≥ 0 f ( W , H ) and min H ≥ 0 f ( W , H ) until convergence. For least squares NMF, each sub-problem can be exactly or approximately solved by Multiplicative rule (Lee and Seung, 2001) 1 Projected gradient method (Lin, 2007) 2 Newton type updates (Kim, Sra and Dhillon, 2007) 3 Active set method (Kim and Park, 2008) 4 Cyclic coordinate descent method (Chichocki and Phan, 2009) 5

Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Coordinate Descent Method Update one variable at a time until convergence: ( W , H ) ← ( W + sE ir , H ). Get s by solving a one-variable problem: s : W ir + s ≥ 0 g W min ir ( s ) ≡ f ( W + sE ir , H ) . For square loss, g W has a closed form solution: ir s ∗ = max � 0 , W ir − g ′ ir (0) / g ′′ � ir (0) − W ir , ir (0) = ∇ W ir f ( W , H ) = ( WH T H − VH ) ir , where g ′ W ir f ( W , H ) = ( H T H ) rr . ir (0) = ∇ 2 g ′′

Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Cyclic Coordinate Descent for Least Squares NMF (FastHals) Recently, (Chichocki and Phan, 2009) proposed a cyclic coordinate descent algorithm (FastHals) for least squares NMF. Fixed update sequence: W 11 , , W 1 , 2 , . . . , W 1 , k , W 2 , 1 , . . . , W m , k , . . . , H 1 , 1 , . . . , H n , k , W 1 , 1 , . . . Each update has time complexity O ( k ).

Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Variable Selection FastHals updates variables uniformly. However, an efficient algorithm should update variables with frequency proportional to their “importance” ! We propose a Greedy Coordinate Descent method (GCD) for NMF. 4 −3 7.8 x 10 x 10 5 CD FastHals 7.6 4 7.4 values in the solution number of updates Objective Value 7.2 3 5 7 4 2 6.8 3 6.6 2 1 6.4 1 0 0 6.2 0 0 0.5 0.5 1 1 1.5 1.5 2 2 0 2 4 6 8 10 variables in H 6 6 Number of Coordinate Updates 7 x 10 x 10 x 10 # updates vs obj The behavior of FastHals The behavior of GCD

Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Greedy Coordinate Descent (GCD) Stategy — select variables which maximally reduce objective function When W ir is selected, the objective function can be reduced by ir s ∗ − 1 2( H T H ) rr ( s ∗ ) 2 , D W ir ≡ f ( W , H ) − f ( W + s ∗ E ir , H ) = − G W where G W ≡ ∇ W f ( W , H ) = WH T H − VH , and s ∗ is the optimal step size. If D W can be easily maintained, we can choose variables with the largest objective function value reduction according to D W .

Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) How to maintain D W (objective value reduction) s ∗ can be computed from G W and H T H (from one-variable update rule). = − G W s ∗ − 1 2 ( H T H ) rr ( s ∗ ) 2 , where G W = WH T H − VH . D W ir Therefore, we can maintain D W if G W and H T H are known. When W ir ← W ir + s ∗ , the i th row of G W is changed: + s ∗ ( H T H ) rj ∀ j = 1 , . . . , k . G W ← G W ij ij Therefore, time for maintaining D W is only O ( k ), which has the same time complexity as Cyclic Coordinate Descent!

Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Greedy Coordinate Descent (GCD) Follow the alternating minimization framework, our algorithm GCD alternatively updates variables in W and H . When updating one variables in W , we can maintain D W in O ( k ) time. We conduct a sequence of updates on W : W (0) , W (1) , . . . with a corresponding sequence ( D W ) (0) , ( D W ) (1) , . . . When to switch from W ’s updates to H ’s updates? We update variables in W until the maximum function value decrease is small enough. < ǫ p init , where p init = ( D W ) (0) D W max ij j

Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Greedy Coordinate Descent (GCD) Initialize H T H , W T W . While (not converged) 1. Compute G W = W ( H T H ) − VH . 2. Compute D W according to G W . 3. Compute p init = max i , r ( D W ir ). 4. For each row i of W - q i = arg max r D W i , r i , q i > ǫ p init - While D W 4.1 Update W i , q i . 4.2 Update W T W and D W 4.3 q i ← arg max r D W ir 5. For updates to H , repeat steps analogous to Steps 1 through 4.

Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Comparisons relative Time (in seconds) dataset m n k error GCD FHals PGrad BPivot 10 − 4 10 2.3 2.1 1.7 0.6 Synth03 500 1,000 10 − 4 30 9.3 26.6 12.4 4.0 10 − 4 10 0.21 0.43 0.53 0.56 Synth08 500 1,000 10 − 4 30 0.43 0.77 2.54 2.86 0.0410 2.3 4.0 13.5 10.6 CBCL 361 2,429 49 0.0376 8.9 18.0 45.6 30.9 0.0373 14.6 29.0 84.6 51.5 0.0365 1.8 6.5 9.0 7.4 ORL 10,304 400 25 0.0335 14.1 30.3 98.6 33.9 0.0332 33.0 63.3 256.8 76.5

Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Comparisons Results on MNIST ( m = 780 , n = 60000 , # nz = 8994156 , k = 10). 0 10 GCD FastHals Relative function value difference BlockPivot −1 10 −2 10 −3 10 −4 10 −5 10 0 50 100 150 200 250 300 350 time(sec)

Fast Coordinate Descent methods for Non-Negative Matrix - PowerPoint PPT Presentation

Non-negative Matrix Factorization Fast Coordinate Descent methods for Non-Negative Matrix Factorization Inderjit S. Dhillon University of Texas at Austin SIAM Conference on Applied Linear Algebra Valencia, Spain June 19, 2012 Joint work with

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Greedy Orthogonal Pivoting for Non-negative Matrix Factorization Kai Zhang, Jun Liu, Jie Zhang,

Data Mining and Matrices 06 Non-Negative Matrix Factorization Rainer Gemulla, Pauli Miettinen

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work

Multimodal Visualization Based On Non-negative Matrix Factorization Jorge Camargo Juan Caicedo

TIME-DEPENDENT PARAMETRIC AND HARMONIC TEMPLATES IN NON-NEGATIVE MATRIX FACTORIZATION 13 th

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Implementation Issues More from Interface point of view V Eye Y U N X Z Viewing Coordinate

Transformations & Transformations & Coordinate Systems Coordinate Systems CSCD 472?

The Negative Marker in Romanian Negative Concord Gianina Iord achioaia Seminar f ur

Algorithms: Gradient Descent This classic greedy algorithm for minimization uses the negative of

Completely positive and copositive matrices and optimization Bob s birthday conference The

Introduction: Mathematical optimization Motivating Example Applications Least-squares(LS) and

Wolfe Practical Machine Learning Using Probabilistic Programming and Optimization Sameer Singh

Reformulations in Mathematical Programming Leo Liberti LIX, Ecole Polytechnique, France CTW

Multiple-Rank Updates to Matrix Factorizations Zack 8/30/2013 Outline u Introduction u

Revised Simplex Method Marco Chiarandini Department of Mathematics & Computer Science

Inverse Kinematics (part 2) CSE169: Computer Animation Instructor: Steve Rotenberg UCSD, Winter

Parallel Numerical Algorithms Solution of Boundary Value Problems 1 Overview of Lecture

Fast Coordinate Descent methods for Non-Negative Matrix - PowerPoint PPT Presentation

Non-negative Matrix Factorization Fast Coordinate Descent methods for Non-Negative Matrix Factorization Inderjit S. Dhillon University of Texas at Austin SIAM Conference on Applied Linear Algebra Valencia, Spain June 19, 2012 Joint work with

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Greedy Orthogonal Pivoting for Non-negative Matrix Factorization Kai Zhang, Jun Liu, Jie Zhang,

Data Mining and Matrices 06 Non-Negative Matrix Factorization Rainer Gemulla, Pauli Miettinen

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work

Multimodal Visualization Based On Non-negative Matrix Factorization Jorge Camargo Juan Caicedo

TIME-DEPENDENT PARAMETRIC AND HARMONIC TEMPLATES IN NON-NEGATIVE MATRIX FACTORIZATION 13 th

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Implementation Issues More from Interface point of view V Eye Y U N X Z Viewing Coordinate

Transformations &amp; Transformations &amp; Coordinate Systems Coordinate Systems CSCD 472?

The Negative Marker in Romanian Negative Concord Gianina Iord achioaia Seminar f ur

Algorithms: Gradient Descent This classic greedy algorithm for minimization uses the negative of

Completely positive and copositive matrices and optimization Bob s birthday conference The

Introduction: Mathematical optimization Motivating Example Applications Least-squares(LS) and

Wolfe Practical Machine Learning Using Probabilistic Programming and Optimization Sameer Singh

Reformulations in Mathematical Programming Leo Liberti LIX, Ecole Polytechnique, France CTW

Multiple-Rank Updates to Matrix Factorizations Zack 8/30/2013 Outline u Introduction u

Revised Simplex Method Marco Chiarandini Department of Mathematics &amp; Computer Science

Inverse Kinematics (part 2) CSE169: Computer Animation Instructor: Steve Rotenberg UCSD, Winter

Parallel Numerical Algorithms Solution of Boundary Value Problems 1 Overview of Lecture

Transformations & Transformations & Coordinate Systems Coordinate Systems CSCD 472?

Revised Simplex Method Marco Chiarandini Department of Mathematics & Computer Science