Data Mining and Matrices 06 Non-Negative Matrix Factorization - PowerPoint PPT Presentation

Data Mining and Matrices 06 – Non-Negative Matrix Factorization Rainer Gemulla, Pauli Miettinen May 23, 2013

Non-Negative Datasets Some datasets are intrinsically non-negative: Counters (e.g., no. occurrences of each word in a text document) Quantities (e.g., amount of each ingredient in a chemical experiment) Intensities (e.g., intensity of each color in an image) The corresponding data matrix D has only non-negative values. Decompositions such as SVD and SDD may involve negative values in factors and components Negative values describe the absence of something Often no natural interpretation Can we find a decomposition that is more natural to non-negative data? 2 / 39

Example (SVD) Consider the following “bridge” matrix and its truncated SVD: 1.5 0 0.3 0.6 0.3 0.6 0.3 1 1 1 1 1 0.8 0.6 0 1.3 0.5 −0.3 0.5 −0.3 0.5 0 1 0 1 0 0.5 −0.5 0 1 0 1 0 0.5 −0.5 V T D = U Σ Here are the corresponding components: 1 1 1 1 1 0.6 1.3 0.6 1.3 0.6 0.4 −0.3 0.4 −0.3 0.4 0 1 0 1 0 0.3 0.8 0.3 0.8 0.3 −0.3 0.2 −0.3 0.2 −0.3 0 1 0 1 0 0.3 0.8 0.3 0.8 0.3 −0.3 0.2 −0.3 0.2 −0.3 U ∗ 1 D 11 V T U ∗ 2 D 22 V T = + D ∗ 1 ∗ 2 Negative values make interpretation unnatural or difficult. 3 / 39

Outline Non-Negative Matrix Factorization 1 Algorithms 2 Probabilistic Latent Semantic Analysis 3 Summary 4 4 / 39

Non-Negative Matrix Factorization (NMF) Definition (Non-negative matrix factorization, basic form) Given a non-negative matrix D ∈ R m × n , a non-negative matrix + factorization of rank k is D ≈ LR , where L ∈ R m × r and R ∈ R r × n are both non-negative. + + Additive decomposition: factors and components non-negative → No cancellation effects Rows of R can be thought as “parts” Row of D obtained by mixing (or “assembling”) parts in L Smallest r such that D = LR exists is called non-negative rank of D rank( D ) ≤ rank + ( D ) ≤ min { m , n } 5 / 39

Example (NMF) Consider the following “bridge” matrix and its rank-2 NMF: 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 1 0 1 0 0 1 0 1 0 1 0 0 1 = D L R Here are the corresponding components: 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 D = L ∗ 1 R 1 ∗ + L ∗ 2 R 2 ∗ Non-negative matrix decomposition encourage a more natural, part-based representation and (sometimes) sparsity. 6 / 39

Decomposing faces (PCA) D i ∗ (original) [ LR ] i ∗ = L i ∗ R [ U Σ V T ] i ∗ V T = U i ∗ Σ PCA factors are hard to interpret. 7 / 39 Lee and Seung, 1999.

Decomposing faces (NMF) D i ∗ (original) [ LR ] i ∗ = L i ∗ R NMF factors correspond to parts of faces. 8 / 39 Lee and Seung, 1999.

Decomposing digits (NMF) D R NMF factors correspond to parts of digits and “background”. 9 / 39 Cichocki et al., 2009.

Some applications Text mining (more later) Bioinformatics Microarray analysis Mineral exploration Neuroscience Image understanding Air pollution research Chemometrics Spectral data analysis Linear sparse coding Image classification Clustering Neural learning process Sound recognition Remote sensing Object characterization . . . 10 / 39 Cichocki et al., 2009.

Gaussian NMF Gaussian NMF is the most basic form of non-negative factorizations: � D − LR � 2 minimize F L ∈ R m × r s . t . + R ∈ R r × n + Truncated SVD minimizes the same objective (but without non-negativity constraints) Many other variants exist ◮ Different objective functions (e.g., KL-divergence) ◮ Additional regularizations (e.g., L 1 -regularization) ◮ Different constraints (e.g., orthogonality of R ) ◮ Different compositions (e.g., 3 matrices) ◮ multi-layer NMF, semi-NMF, sparse NMF, tri-NMF, symmetric NMF, orthogonal NMF, non-smooth NMF (nsNMF), overlapping NMF, convolutive NMF (CNMF), k-Means, . . . 11 / 39

k-Means can be seen as a variant of NMF Additional constraint: L contains exactly one 1 in each row, rest 0 D i ∗ (original) [ LR ] i ∗ = L i ∗ R k-Means factors correspond to prototypical faces. 12 / 39 Lee and Seung, 1999.

NMF is not unique Factors are not “ordered” 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 1 0 = = 0 0 0 0 1 0 1 0 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 1 1 0 One way of ordering: decreasing Frobenius norm of components (i.e., order by � L ∗ k R k ∗ � F ) Factors/components are not unique 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 = + 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0.5 1 0.5 1 0 0.5 0 0.5 0 = + 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 = + 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 Additional constraints or regularization can encourage uniqueness. 13 / 39

NMF is not hierarchical Rank-1 NMF 1 1 1 1 1 0.6 1.3 0.6 1.3 0.6 0.8 ≈ = 0.7 1.5 0.7 1.5 0.7 0 1 0 1 0 0.3 0.8 0.3 0.8 0.3 0.5 0 1 0 1 0 0.3 0.8 0.3 0.8 0.3 0.5 Rank-2 NMF 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 = + 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 1 1 1 1 1 1 0 = 0 1 0 1 0 0 1 0 1 Best rank- k approximation may differ significantly from best rank-( k − 1) approximation Rank influences sparsity, interpretability, and statistical fidelity Optimum choice of rank is not well-studied (often requires experimentation) 14 / 39

Outline Non-Negative Matrix Factorization 1 Algorithms 2 Probabilistic Latent Semantic Analysis 3 Summary 4 15 / 39

NMF is difficult We focus on minimizing L ( L , R ) = � D − LR � 2 F . For varying m , n , and r , problem is NP-hard When rank( D ) = 1 (or r = 1), can be solved in polynomial time Take first non-zero column of D as L m × 1 1 Determine R 1 × n entry by entry (using the fact that D ∗ j = LR 1 j ) 2 Problem is not convex ◮ Local optimum may not correspond to global optimum ◮ Generally little hope to find global optimum But: Problem is biconvex ◮ For fixed R , f ( L ) = � D − LR � 2 F is convex f ( L ) = � i � D i ∗ − L i ∗ R � 2 (chain rule) F ∇ L ik f ( L ) = − 2( D i ∗ − L i ∗ R ) R T (product rule) k ∗ ∇ 2 L ik f ( L ) = 2 R k ∗ R T k ∗ ≥ 0 (does not depend on L ) ◮ For fixed L , f ( R ) = � D − LR � 2 F is convex ◮ Allows for efficient algorithms 16 / 39

General framework Gradient descent generally slow Stochastic gradient descent inappropriate Key approach: alternating minimization 1: Pick starting point L 0 and R 0 2: while not converged do Keep R fixed, optimize L 3: Keep L fixed, optimize R 4: 5: end while Update steps 3 and 4 easier than full problem Also called alternating projections or (block) coordinate descent Starting point ◮ Random ◮ Multi-start initialization: try multiple random starting points, run a few epochs, continue with best ◮ Based on SVD ◮ . . . 17 / 39

Example Ignore non-negativity for now. Consider the regularized least-square error: L ( L , R ) = � D − LR � 2 F + λ ( � L � 2 F + � R � 2 F ) By setting m = n = r = 1, D = (1) and λ = 0 . 05, we obtain L ( l , r ) = (1 − lr ) 2 + 0 . 05( l 2 + r 2 ) ∇ l f ( l ) = − 2 r (1 − lr ) + 0 . 1 l 50 ∇ r f ( r ) = − 2 l (1 − lr ) + 0 . 1 r 40 Local optima: L �� 30 � � � ( l , r 2 19 19 19 19 ) 20 , , − 20 , − 1 20 20 20 0 r − 1 Stationary point: (0,0) 10 − 2 − 2 − 1 0 1 2 l 18 / 39

Example (ALS) f ( l , r ) = (1 − lr ) 2 + 0 . 05( l 2 + r 2 ) 2 r l ← min l f ( l ) = 2 r 2 +0 . 1 2 l 2.5 r ← min r f ( r ) = 2 l 2 +0 . 1 Step l r 2.0 ● ● 0 2 2 ● ● 1 0 . 49 2 1.5 ●● 2 0 . 49 1 . 68 ●● ● ● 3 0 . 58 1 . 68 ● ● ● ● r ● ● ● ● ● ● 1.0 ● ● ● ● 4 0 . 58 1 . 49 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● . . . . . . . . . 0.5 100 0 . 97 0 . 97 Converges to 0.0 local minimum 0.0 0.5 1.0 1.5 2.0 2.5 19 / 39

Example (ALS) f ( l , r ) = (1 − lr ) 2 + 0 . 05( l 2 + r 2 ) 2 r l ← min l f ( l ) = 2 r 2 +0 . 1 2 l 2.5 r ← min r f ( r ) = 2 l 2 +0 . 1 Step l r 2.0 0 2 0 0 0 0 1.5 0 0 0 . . . . . . . . . r 1.0 Converges to stationary point 0.5 0.0 ● ● 0.0 0.5 1.0 1.5 2.0 2.5 20 / 39

Alternating non-negative least squares (ANLS) Uses non-negative least squares approximation of L and R : � D − LR � 2 � D − LR � 2 argmin and argmin F F L ∈ R m × r R ∈ R r × n + + Equivalently: find non-negative least squares solution to LR = D Common approach: Solve unconstrained least squares problems and “remove” negative values. E.g., when columns (rows) of L ( R ) are linearly independent, set L = [ DR † ] ǫ R = [ L † D ] ǫ and where ◮ R † = R T ( RR T ) − 1 is the right pseudo-inverse of R ◮ L † = ( L T L ) − 1 L T is the left pseudo-inverse of L ◮ [ a ] ǫ = max { ǫ, a } for ǫ = 0 or some small constant (e.g., ǫ = 10 − 9 ) Difficult to analyze due to non-linear update steps Often slow convergence to a “bad” local minimum (better when regularized) 21 / 39

Data Mining and Matrices 06 Non-Negative Matrix Factorization - PowerPoint PPT Presentation

Data Mining and Matrices 06 Non-Negative Matrix Factorization Rainer Gemulla, Pauli Miettinen May 23, 2013 Non-Negative Datasets Some datasets are intrinsically non-negative: Counters (e.g., no. occurrences of each word in a text document)

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

MATHEMATICS 1 CONTENTS Matrices Special matrices Operations with matrices Matrix

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Data Mining and Matrices 01 Introduction Rainer Gemulla, Pauli Miettinen April 18, 2013

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices & quadratic forms)

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

JUST THE MATHS SLIDES NUMBER 9.9 MATRICES 9 (Modal & spectral matrices) by

Transformations and Matrices Transformations I Transformations are functions Matrices

Matrices with Application to Page Rank Markov Matrices Pagerank Anil Maheshwari

Structural Matrices in MDOF Systems Evaluation of Structural Matrices Choice of Property

Structural Matrices in MDOF Systems Structural Matrices Evaluation of Structural Giacomo Boffi

Threat Intelligence Jeremy Batterman Global Leader Threat Intelligence GREM, EnCE, GCFA, MBA 3

ECE444: Software Engineering Requirements 4:Risk, Prototypes Shurui Zhou Learning Goals (Last

The ethics of electronic privacy Diana Acosta-Navas PhD Candidate, Harvard Philosophy Department

I nnovative Technology Experiences for Students and Teachers (I TEST) Program Division of

The Teaching Model Sasikumar M Sasikumar M Overview Concerned about how to teach

Why should the THRIVE Framework for system change and the National i-THRIVE Programme be

Cyber Security for Smart Grid Devices Annarita Giani Electrical Engineering & Computer

CSCI 2133 Rapid Programming Techniques for Innovation Week 1 About Course Projects CSCI

Data Mining and Matrices 06 Non-Negative Matrix Factorization - PowerPoint PPT Presentation

Data Mining and Matrices 06 Non-Negative Matrix Factorization Rainer Gemulla, Pauli Miettinen May 23, 2013 Non-Negative Datasets Some datasets are intrinsically non-negative: Counters (e.g., no. occurrences of each word in a text document)

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

MATHEMATICS 1 CONTENTS Matrices Special matrices Operations with matrices Matrix

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Data Mining and Matrices 01 Introduction Rainer Gemulla, Pauli Miettinen April 18, 2013

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices &amp; quadratic forms)

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

JUST THE MATHS SLIDES NUMBER 9.9 MATRICES 9 (Modal &amp; spectral matrices) by

Transformations and Matrices Transformations I Transformations are functions Matrices

Matrices with Application to Page Rank Markov Matrices Pagerank Anil Maheshwari

Structural Matrices in MDOF Systems Evaluation of Structural Matrices Choice of Property

Structural Matrices in MDOF Systems Structural Matrices Evaluation of Structural Giacomo Boffi

Threat Intelligence Jeremy Batterman Global Leader Threat Intelligence GREM, EnCE, GCFA, MBA 3

ECE444: Software Engineering Requirements 4:Risk, Prototypes Shurui Zhou Learning Goals (Last

The ethics of electronic privacy Diana Acosta-Navas PhD Candidate, Harvard Philosophy Department

I nnovative Technology Experiences for Students and Teachers (I TEST) Program Division of

The Teaching Model Sasikumar M Sasikumar M Overview Concerned about how to teach

Why should the THRIVE Framework for system change and the National i-THRIVE Programme be

Cyber Security for Smart Grid Devices Annarita Giani Electrical Engineering &amp; Computer

CSCI 2133 Rapid Programming Techniques for Innovation Week 1 About Course Projects CSCI

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices & quadratic forms)

JUST THE MATHS SLIDES NUMBER 9.9 MATRICES 9 (Modal & spectral matrices) by

Cyber Security for Smart Grid Devices Annarita Giani Electrical Engineering & Computer