New matrix norms for structured matrix estimation Jean-Philippe Vert - PowerPoint PPT Presentation

New matrix norms for structured matrix estimation Jean-Philippe Vert Optimization and Statistical Learning workshop Les Houches, France, Jan 11-16, 2015

Outline Atomic norms 1 Sparse matrices with disjoint column supports 2 Low-rank matrices with sparse factors 3 http://www.homemade-gifts-made-easy.com/make-paper-lanterns.html

Outline Atomic norms 1 Sparse matrices with disjoint column supports 2 Low-rank matrices with sparse factors 3

Atomic Norm (Chandrasekaran et al., 2012) Definition Given a set of atoms A , the associated atomic norm is � x � A = inf { t > 0 | x ∈ t conv ( A ) } . NB: This is really a norm if A is centrally symmetric and spans R p Primal and dual form of the norm � � � � � x � A = inf c a | x = c a a , c a > 0 , ∀ a ∈ A a ∈A a ∈A � x � ∗ = sup � a , x � A a ∈A

Examples Vector ℓ 1 -norm: x ∈ R p �→ � x � 1 � � A = ± e k | 1 ≤ k ≤ p Matrix trace norm: Z ∈ R m 1 × m 2 �→ � Z � ∗ (sum of singular value) � � ab ⊤ : a ∈ R m 1 , b ∈ R m 2 , � a � 2 = � b � 2 = 1 A =

Group lasso (Yuan and Lin, 2006) For x ∈ R p and G = { g 1 , . . . , g G } a partition of [ 1 , p ] : � � x � 1 , 2 = � x g � 2 g ∈G is the atomic norm associated to the set of atoms � { u ∈ R p : supp ( u ) = g , � u � 2 = 1 } A G = g ∈G G = {{ 1 , 2 } , { 3 }} � x � 1 , 2 = � ( x 1 , x 2 ) ⊤ � 2 + � x 3 � 2 � � x 2 1 + x 2 x 2 = 2 + 3

Group lasso with overlaps How to generalize the group lasso when the groups overlap? Set features to zero by groups (Jenatton et al., 2011) � � x � 1 , 2 = � x g � 2 g ∈G Select support as a union of groups (Jacob et al., 2009) � x � A G , see also MKL (Bach et al., 2004) G = {{ 1 , 2 } , { 2 , 3 }}

Joint work with... Kevin Vervier, Pierre Mahé, Jean-Baptiste Veyrieras (Biomerieux) Alexandre d’Aspremont (CNRS/ENS)

Columns with disjoint supports X = Motivation: multiclass or multitask classification problems where we want to select features specific to each class or task Example: recognize identify and emotion of a person from an image (Romera-Paredes et al., 2012), or hierarchical coarse-to-fine classifier (Xiao et al., 2011; Hwang et al., 2011)

From disjoint supports to orthogonal columns X = Two vectors v 1 and v 2 have disjoint support iff | v 1 | and | v 2 | are orthogonal If Ω ortho ( X ) is a norm to estimate matrices with orthogonal columns, then Ω disjoint ( X ) = Ω ortho ( | X | ) = − W ≤ X ≤ W Ω ortho ( W ) min is a norm to estimate matrices with disjoint column supports. How to estimate matrices with orthogonal columns? NOTE: more general than orthogonal matrices

Penalty for orthogonal columns For X = [ x 1 , . . . , x p ] ∈ R n × p we want x ⊤ i x j = 0 for i � = j A natural "relaxation": � � � � � � x ⊤ Ω( X ) = i x j � i � = j But not convex

Convex penalty for orthogonal columns � � p � � � � K ii � x i � 2 + � x ⊤ Ω K ( X ) = K ij i x j � i = 1 i � = j Theorem (Xiao et al., 2011) If ¯ K is positive semidefinite, then Ω K is convex, where � | K ii | if i = j , ¯ � � K ij = � K ij � − otherwise.

Can we be tighter? � � p � � � � � x i � 2 + � x ⊤ Ω K ( X ) = K ij i x j � i = 1 i � = j

Can we be tighter? � � p � � � � � x i � 2 + � x ⊤ Ω K ( X ) = K ij i x j � i = 1 i � = j Let O be the set of matrices of unit Frobenius norm, with orthogonal columns � � X ∈ R n × p : X ⊤ X is diagonal and Trace ( X ⊤ X ) = 1 O = Note that ∀ X ∈ O , Ω K ( X ) = 1 The atomic norm � X � O associated to O is the tightest convex penalty to recover the atoms in O !

Optimality of Ω K for p = 2 Theorem (Vervier, Mahé, d’Aspremont, Veyrieras and V., 2014) For any X ∈ R n × 2 , � X � 2 O = Ω K ( X ) with � 1 � 1 K = . 1 1

Case p > 2 Ω K ( X ) � = � X � 2 O But sparse combinations of matrices in O may not be interesting anyway... Theorem (Vervier et al., 2014) For any p ≥ 2, let K be a symmetric p -by- p matrix with non-negative entries and such that, � ∀ i = 1 , . . . , p K ii = K ij . j � = i Then � K ij � ( x i , x j ) � 2 Ω K ( X ) = O . i < j

Simulations Regression Y = XW + ǫ , W has disjoint column support, n = p = 10 ● Ridge Regression LASSO 0.6 LASSO Disjoint Supports 0.40 Xiao ● 0.5 Disjoint Supports 0.4 disjointness ● MSE 0.35 ● 0.3 ● 0.2 ● ● ● ● 0.30 ● 0.1 0.0 10 20 30 40 50 10 20 30 40 50 Training set size Training set size

Example: multiclass classification of MS spectra multi HAE YER ESH − SHG ENT Spectra CIT STR CLO LIS BAC 0 Features

Joint work with... Emile Richard (Stanford) Guillaume Obozinski (Ecole des Ponts - ParisTech)

Low-rank matrices with sparse factors X = r � u i v ⊤ X = i i = 1 factors not orthogonal a priori � = from assuming the SVD of X is sparse

Dictionary Learning n n � � � x i − D α i � 2 min 2 + λ � α i � 1 s.t. ∀ j , � d j � 2 ≤ 1 . A ∈ R k × n i = 1 i = 1 D ∈ R p × k Dictionary Learning X T α D . = e.g. overcomplete dictionaries for natural images sparse decomposition (Elad and Aharon, 2006)

Dictionary Learning /Sparse PCA n n � � � x i − D α i � 2 min 2 + λ � α i � 1 s.t. ∀ j , � d j � 2 ≤ 1 . A ∈ R k × n i = 1 i = 1 D ∈ R p × k Dictionary Learning Sparse PCA X T α X T D D . . = = α e.g. overcomplete dictionaries e.g. microarray data for natural images sparse dictionary sparse decomposition (Witten et al., 2009; Bach et al., (Elad and Aharon, 2006) 2008) Sparsity of the loadings vs sparsity of the dictionary elements

Applications Low rank factorization with “community structure" Modeling clusters or community structure in social networks or recommendation systems (Richard et al., 2012). Subspace clustering (Wang et al., 2013) Up to an unknown permutation, X ⊤ = � X ⊤ X ⊤ � . . . 1 K with X k low rank, so that there exists a low rank matrix Z k such that X k = Z k X k . Finally, X = ZX with Z = BkDiag ( Z 1 , . . . , Z K ) . Sparse PCA from ˆ Σ n Sparse bilinear regression y = x ⊤ Mx ′ + ε

Existing approaches Bi-convex formulations U , V L ( UV ⊤ ) + λ ( � U � 1 + � V � 1 ) , min with U ∈ R n × r , V ∈ R p × r . Convex formulation for sparse and low rank min Z L ( Z ) + λ � Z � 1 + µ � Z � ∗ Doan and Vavasis (2013); Richard et al. (2012) factors not necessarily sparse as r increases.

A new formulation for sparse matrix factorization r � Assumptions: a i b ⊤ X = i i = 1 All left factors a i have support of size k . All right factors b i have support of size q . Goals: Propose a convex formulation for sparse matrix factorization that is able to handle multiple sparse factors permits to identify the sparse factors themselves leads to better statistical performance than ℓ 1 /trace norm. Propose algorithms based on this formulation.

The ( k , q ) -rank of a matrix Sparse unit vectors: j = { a ∈ R n : � a � 0 ≤ j , � a � 2 = 1 } A n ( k , q ) -rank of a m 1 × m 2 matrix Z : � � r � i , ( a i , b i , c i ) ∈ A m 1 k ×A m 2 c i a i b ⊤ r k , q ( Z ) = min r : Z = q × R + i = 1 � � ∞ � i , ( a i , b i , c i ) ∈ A m 1 k ×A m 2 c i a i b ⊤ = min � c � 0 : Z = q × R + i = 1 Z r k , q ( Z ) = 3 =

The ( k , q ) trace norm (Richard et al., 2014) For a matrix Z ∈ R m 1 × m 2 , we have combinatorial penality � Z � 0 rank ( Z ) convex relaxation � Z � 1 � Z � ∗

The ( k , q ) trace norm (Richard et al., 2014) For a matrix Z ∈ R m 1 × m 2 , we have ( 1 , 1 ) -rank ( k , q ) -rank ( m 1 , m 2 ) -rank combinatorial penality � Z � 0 r k , q ( Z ) rank ( Z ) convex relaxation � Z � 1 � Z � ∗

The ( k , q ) trace norm (Richard et al., 2014) For a matrix Z ∈ R m 1 × m 2 , we have ( 1 , 1 ) -rank ( k , q ) -rank ( m 1 , m 2 ) -rank combinatorial penality � Z � 0 r k , q ( Z ) rank ( Z ) convex relaxation � Z � 1 Ω k , q ( Z ) � Z � ∗ The ( k , q ) trace norm Ω k , q ( Z ) is the atomic norm associated with � � ab ⊤ | a ∈ A m 1 k , b ∈ A m 2 A k , q := , q namely: � � ∞ � i , ( a i , b i , c i ) ∈ A m 1 k ×A m 2 c i a i b ⊤ Ω k , q ( Z ) = inf � c � 1 : Z = q × R + i = 1

New matrix norms for structured matrix estimation Jean-Philippe Vert - PowerPoint PPT Presentation

New matrix norms for structured matrix estimation Jean-Philippe Vert Optimization and Statistical Learning workshop Les Houches, France, Jan 11-16, 2015 Outline Atomic norms 1 Sparse matrices with disjoint column supports 2 Low-rank

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Chapter 7 Norms and Distance Measures Chapter 7 Vector Norms Norms are functions which measure

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Enhancing compliance Why do people comply? Adriaan Denkers FIOD-ECD Why do people obey the law?

Norms Norm is a measure of size of a vector or matrix. Typical vector norms: Let v = [ v 1 , v

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

6/12/2014 Social norms theoretical background Basic elements of social norms campaign

Lecture 4 : Norms, Culture and Identity Zaki Wahhaj Why Norms, Culture & Identity? All

Norms and and Electronic Electronic Institutions Institutions Norms for Behaviour for

NORMS, INNER PRODUCTS, AND ORTHOGONALITY Vector norms Generalize the familiar concept of

= = p CSE 541 x x x x = n p 1-norm: x x =

Structured Electronic Design Structured Electronic Design ET 8016 5 ECTS credits 1

L101: Introduction to Structured Prediction Ryan Cotterell What is structured prediction?

Semi-structured data Data is not just text, but is not as well- Semi-structured data

Lattice Points in Polytopes Richard P. Stanley U. Miami & M.I.T. A lattice polygon Georg

The FreeMABSys Project and the MABSys Library Franois Lemaire , Asl rgpl. University

Reminders Continue working on Homework #2 It is due in class on Thursday Check Piazza

CSE 115 Introduction to Computer Science I Road map Review (sorting) Empirical Demo

IEEE IAS Technical Books Coordinating Committee (IAS/TBCC) 2013 I&CPS Lisa Perry IEEE-SA

De-biasing arbitrary convex regularizers and asymptotic normality Pierre C Bellec, Rutgers

How Fast Can Higher-Order Masking Be in Software? Dahmun Goudarzi and Matthieu Rivain EUROCRYPT

Theoretical Analysis of Adversarial Learning: A Minimax Approach Zhuozhuo Tu 1 , Jingwei Zhang 2,1

Sambuz

Useful Links

Newsletter

Mail Us

New matrix norms for structured matrix estimation Jean-Philippe Vert - PowerPoint PPT Presentation

New matrix norms for structured matrix estimation Jean-Philippe Vert Optimization and Statistical Learning workshop Les Houches, France, Jan 11-16, 2015 Outline Atomic norms 1 Sparse matrices with disjoint column supports 2 Low-rank

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Chapter 7 Norms and Distance Measures Chapter 7 Vector Norms Norms are functions which measure

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Enhancing compliance Why do people comply? Adriaan Denkers FIOD-ECD Why do people obey the law?

Norms Norm is a measure of size of a vector or matrix. Typical vector norms: Let v = [ v 1 , v

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

6/12/2014 Social norms theoretical background Basic elements of social norms campaign

Lecture 4 : Norms, Culture and Identity Zaki Wahhaj Why Norms, Culture &amp; Identity? All

Norms and and Electronic Electronic Institutions Institutions Norms for Behaviour for

NORMS, INNER PRODUCTS, AND ORTHOGONALITY Vector norms Generalize the familiar concept of

= = p CSE 541 x x x x = n p 1-norm: x x =

Structured Electronic Design Structured Electronic Design ET 8016 5 ECTS credits 1

L101: Introduction to Structured Prediction Ryan Cotterell What is structured prediction?

Semi-structured data Data is not just text, but is not as well- Semi-structured data

Lattice Points in Polytopes Richard P. Stanley U. Miami &amp; M.I.T. A lattice polygon Georg

The FreeMABSys Project and the MABSys Library Franois Lemaire , Asl rgpl. University

Reminders Continue working on Homework #2 It is due in class on Thursday Check Piazza

CSE 115 Introduction to Computer Science I Road map Review (sorting) Empirical Demo

IEEE IAS Technical Books Coordinating Committee (IAS/TBCC) 2013 I&amp;CPS Lisa Perry IEEE-SA

De-biasing arbitrary convex regularizers and asymptotic normality Pierre C Bellec, Rutgers

How Fast Can Higher-Order Masking Be in Software? Dahmun Goudarzi and Matthieu Rivain EUROCRYPT

Theoretical Analysis of Adversarial Learning: A Minimax Approach Zhuozhuo Tu 1 , Jingwei Zhang 2,1

Sambuz

Useful Links

Newsletter

Mail Us

Lecture 4 : Norms, Culture and Identity Zaki Wahhaj Why Norms, Culture & Identity? All

Lattice Points in Polytopes Richard P. Stanley U. Miami & M.I.T. A lattice polygon Georg

IEEE IAS Technical Books Coordinating Committee (IAS/TBCC) 2013 I&CPS Lisa Perry IEEE-SA