Machine learning on the symmetric group Jean-Philippe Vert ML ML - PowerPoint PPT Presentation

Machine learning on the symmetric group Jean-Philippe Vert

What if inputs are permutations? Permutation: a bijection σ : [ 1 , N ] → [ 1 , N ] σ ( i ) = rank of item i Composition ( σ 1 σ 2 )( i ) = σ 1 ( σ 2 ( i )) S N the symmetric group | S N | = N !

Examples Ranking data Ranks extracted from data (histogram equalization, quantile normalization...)

Examples Batch effects, calibration of experimental measures

Learning from permutations Assume your data are permutations and you want to learn f : S N → R A solutions: embed S N to a Euclidean (or Hilbert) space Φ : S N → R p and learn a linear function: f β ( σ ) = β ⊤ Φ( σ ) The corresponding kernel is K ( σ 1 , σ 2 ) = Φ( σ 1 ) ⊤ Φ( σ 2 )

How to define the embedding Φ : S N → R p ? Should encode interesting features Should lead to efficient algorithms Should be invariant to renaming of the items, i.e., the kernel should be right-invariant ∀ σ 1 , σ 2 , π ∈ S N , K ( σ 1 π, σ 2 π ) = K ( σ 1 , σ 2 )

Some attempts SUQUAN Kendall (Jiao and Vert, 2015, 2017, 2018; Le Morvan and Vert, 2017)

SUQUAN embedding (Le Morvan and Vert, 2017) Let Φ( σ ) = Π σ the permutation representation (Serres, 1977): � 1 if σ ( j ) = i , [Π σ ] ij = 0 otherwise. Right invariant: � � < Φ( σ ) , Φ( σ ′ ) > = Tr Π σ Π ⊤ Π σ Π − 1 � � = Tr = Tr (Π σ Π σ ′− 1 ) = Tr (Π σσ ′− 1 ) σ ′ σ ′

Link with quantile normalization (QN) Take σ ( x ) = rank ( x ) with x ∈ R N Fix a target quantile f ∈ R n "Keep the order of x , change the values to f " [Ψ f ( x )] i = f σ ( x )( i ) ⇔ Ψ f ( x ) = Π σ ( x ) f

How to choose a "good" target distribution?

Supervised QN (SUQUAN) Standard QN: Fix f arbitrarily 1 QN all samples to get Ψ f ( x 1 ) , . . . , Ψ f ( x N ) 2 Learn a model on normalized data, e.g.: 3 � N � 1 � � � w ⊤ Ψ f ( x i ) + b min ℓ i + λ Ω( w ) N w , b i = 1 SUQUAN: jointly learn f and the model: N � � 1 � � � w ⊤ Ψ f ( x i ) + b + λ Ω( w ) + γ Ω 2 ( f ) min ℓ i N w , b , f i = 1

SUQAN as rank-1 matrix regression over Φ( σ ) Linear SUQUAN therefore solves � N � 1 � � � w ⊤ Ψ f ( x i ) + b min + λ Ω( w ) + γ Ω 2 ( f ) ℓ i N w , b , f i = 1 � N � 1 � � � w ⊤ Π ⊤ = min σ ( x i ) f + b + λ Ω( w ) + γ Ω 2 ( f ) ℓ N w , b , f i = 1 � N � 1 � < Π σ ( x i ) , fw ⊤ > Frobenius + b � � = min + λ Ω( w ) + γ Ω 2 ( f ) ℓ N w , b , f i = 1 A particular linear model to estimate a rank-1 matrix M = fw ⊤ Each sample σ ∈ S N is represented by the matrix Π σ ∈ R n × n Non-convex Alternative optimization of f and w is easy

Experiments: CIFAR-10 Image classification into 10 classes (45 binary problems) N = 5 , 000 per class, p = 1 , 024 pixels Linear logistic regression on raw pixels AUC on test set − SUQUAN BND 0.90 ● ● ● ● ● ● 0.85 0.85 ● ● ● ● ● ● ● ● ● ● ● ● 0.80 ● ● ● ● ● ● AUC 0.75 ● ● ● 0.75 ● ● ● 0.70 ● ● ● ● ● ● ● ● ● 0.65 ● ● 0.65 ● ● 0.60 ● ● cauchy exponential uniform gaussian median SUQUAN SVD SUQUAN BND SUQUAN SPAV 0.65 0.75 0.85 AUC on test set − median

Experiments: CIFAR-10 Example: horse vs. plane Different methods learn different quantile functions original median SVD SUQUAN BND 0 400 800 0 400 800 0 400 800 Index Index Index

Limits of the SUQUAN embedding Linear model on Φ( σ ) = Π σ ∈ R N × N Captures first-order information of the form " i-th feature ranked at the j-th position " What about higher-order information such as " feature i larger than feature j "?

The Kendall embedding (Jiao and Vert, 2015, 2017) � 1 if σ ( i ) < σ ( j ) , Φ i , j ( σ ) = 0 otherwise.

Geometry of the embedding For any two permutations σ, σ ′ ∈ S N : Inner product Φ( σ ) ⊤ Φ( σ ′ ) = � 1 σ ( i ) <σ ( j ) 1 σ ′ ( i ) <σ ′ ( j ) = n c ( σ, σ ′ ) 1 ≤ i � = j ≤ n n c = number of concordant pairs Distance � Φ( σ ) − Φ( σ ′ ) � 2 = ( 1 σ ( i ) <σ ( j ) − 1 σ ′ ( i ) <σ ′ ( j ) ) 2 = 2 n d ( σ, σ ′ ) � 1 ≤ i , j ≤ n n d = number of discordant pairs

Kendall and Mallows kernels The Kendall kernel is K τ ( σ, σ ′ ) = n c ( σ, σ ′ ) The Mallows kernel is K λ M ( σ, σ ′ ) = e − λ n d ( σ,σ ′ ) ∀ λ ≥ 0 Theorem (Jiao and Vert, 2015, 2017) The Kendall and Mallows kernels are positive definite right-invariant kernels and can be evaluated in O ( N log N ) time Kernel trick useful with few samples in large dimensions

Remark Kondor and Barbarosa (2010) proposed the diffusion kernel on the Cayley graph of the symmetric group generated by adjacent transpositions. Computationally intensive ( O ( N 2 N )) Mallows kernel is written as M ( σ, σ ′ ) = e − λ n d ( σ,σ ′ ) , K λ where n d ( σ, σ ′ ) is the shortest path distance on the Cayley graph. Cayley graph of S 4 It can be computed in O ( N log N )

Applications and Vert, 2017). Average performance on 10 microarray classification problems (Jiao acc 0.4 0.6 0.8 1.0 SVMkdtALL SVMlinearTOP SVMlinearALL SVMkdtTOP SVMpolyALL KFDkdtALL kTSP SVMpolyTOP KFDlinearALL KFDpolyALL ● ● TSP SVMrbfALL ● KFDrbfALL APMV

Extension: weighted Kendall kernel? Can we weight differently pairs based on their ranks? This would ensure a right-invariant kernel, i.e., the overall geometry does not change if we relabel the items ∀ σ 1 , σ 2 , π ∈ S N , K ( σ 1 π, σ 2 π ) = K ( σ 1 , σ 2 )

Related work Given a weight function w : [ 1 , n ] 2 → R , many weighted versions of the Kendall’s τ have been proposed: � w ( σ ( i ) , σ ( j )) 1 σ ( i ) <σ ( j ) 1 σ ′ ( i ) <σ ′ ( j ) Shieh (1998) 1 ≤ i � = j ≤ n w ( σ ( i ) , σ ( j )) p σ ( i ) − p σ ′ ( i ) p σ ( j ) − p σ ′ ( j ) � σ ( j ) − σ ′ ( j ) 1 σ ( i ) <σ ( j ) 1 σ ′ ( i ) <σ ′ ( j ) σ ( i ) − σ ′ ( i ) 1 ≤ i � = j ≤ n Kumar and Vassilvitskii (2010) � w ( i , j ) 1 σ ( i ) <σ ( j ) 1 σ ′ ( i ) <σ ′ ( j ) Vigna (2015) 1 ≤ i � = j ≤ n However, they are either not symmetric (1st and 2nd), or not right-invariant (3rd)

A right-invariant weighted Kendall kernel (Jiao and Vert, 2018) Theorem For any matrix U ∈ R n × n , � K U ( σ, σ ′ ) = U σ ( i ) ,σ ( j ) U σ ′ ( i ) ,σ ′ ( j ) 1 σ ( i ) <σ ( j ) 1 σ ′ ( i ) <σ ′ ( j ) , 1 ≤ i � = j ≤ n is a right-invariant p.d. kernel on S N .

Examples U a , b corresponds to the weight of (items ranked at) positions a and b in a permutation. Interesting choices include: Top-k. For some k ∈ [ 1 , n ] , � 1 if a ≤ k and b ≤ k , U a , b = 0 otherwise. Additive. For some u ∈ R n , take U ij = u i + u j Multiplicative. For some u ∈ R n , take U ij = u i u j Theorem (Kernel trick) . The weighted Kendall kernel can be computed in O ( n ln ( n )) for the top-k, additive or multiplicative weights.

Learning the weights (1/2) K U can be written as K U ( σ, σ ′ ) = Φ U ( σ ) ⊤ Φ U ( σ ′ ) with � � Φ U ( σ ) = U σ ( i ) ,σ ( j ) 1 σ ( i ) <σ ( j ) 1 ≤ i � = j ≤ n Interesting fact: For any upper triangular matrix U ∈ R n × n , Φ U ( σ ) = Π ⊤ σ U Π σ with (Π σ ) ij = 1 i = σ ( j ) Hence a linear model on Φ U can be rewritten as f β, U ( σ ) = � β, Φ U ( σ ) � Frobenius ( n × n ) � � β, Π ⊤ = σ U Π σ Frobenius ( n × n ) � Π σ ⊗ Π σ , vec ( U ) ⊗ ( vec ( β )) ⊤ � = Frobenius ( n 2 × n 2 )

Learning the weights (2/2) � Π σ ⊗ Π σ , vec ( U ) ⊗ ( vec ( β )) ⊤ � f β, U ( σ ) = Frobenius ( n 2 × n 2 ) This is symmetric in U and β Instead of fixing the weights U and optimizing β , we can jointly optimize β and U to learn the weights U Same as SUQAN, with Π σ ⊗ Π σ instead of Π σ

Experiments Eurobarometer data (Christensen, 2010) >12k individuals rank 6 sources of information Binary classification problem: predict age from ranking (>40y vs <40y) 0.7 accuracy 0.6 0.5 standard (or top−6) top−5 top−4 top−3 top−2 average add weight (hb) mult weight (hb) add weight (log) mult weight (log) learned weight (svd) learned weight (opt) type of weighted kernel

Towards higher-order representations � Π σ ⊗ Π σ , vec ( U ) ⊗ ( vec ( β )) ⊤ � f β, U ( σ ) = Frobenius ( n 2 × n 2 ) A particular rank-1 linear model for the embedding Σ σ = Π σ ⊗ Π σ ∈ ( { 0 , 1 } ) n 2 × n 2 Σ is the direct sum of the second-order and first-order permutation representations: Σ ∼ = τ ( n − 2 , 1 , 1 ) ⊕ τ ( n − 1 , 1 ) This generalizes SUQUAN which considers the first-order representation Π σ only: � Π σ , w ⊗ β ⊤ � h β, w ( σ ) = Frobenius ( n × n ) Generalization possible to higher-order information by using higher-order linear representations of the symmetric group, which are the good basis for right-invariant kernels (Bochner theorem)...

Machine learning on the symmetric group Jean-Philippe Vert ML ML - PowerPoint PPT Presentation

Machine learning on the symmetric group Jean-Philippe Vert ML ML ML ML What if inputs are permutations? Permutation: a bijection : [ 1 , N ] [ 1 , N ] ( i ) = rank of item i Composition ( 1 2 )( i ) = 1 ( 2 ( i )) S N

Class 42: Free symmetric top Class 42: Free symmetric top Free symmetric top in body frame Assume

Inequalities for Symmetric Polynomials Curtis Greene October 24, 2009 Inequalities for Symmetric

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

An Introduction to Symmetric Functions Ira M. Gessel Department of Mathematics Brandeis

Outline Crypto intro Computer Security: Secret Key Crypto Symmetric crypto Achieving security

Outline Crypto intro Computer Security: Secret Key Crypto Symmetric crypto Bart Jacobs

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices & quadratic forms)

Symmetric Designs Lucia Moura School of Electrical Engineering and Computer Science University

0 4 1 3 2 No deterministic symmetric dining solution [RL81] Probabilistic symmetric

The Strong Symmetric Genus of Almost All D -type Generalized Symmetric Groups Michael A. Jackson

Finding low-rank structure in messy data Laura Balzano University of Michigan Michigan Institute

Ranking and Calibrating Click-Attributed Purchases in Performance Display Advertising Sougata

Approximating likelihood ratios with calibrated classifiers Gilles Louppe DIANA meeting

Op#mizing u#lity: Postprocessing to ensure constraints CompSci

Between Discrete and Continuous Optimization: Submodularity & Optimization Stefanie

Structured Graph Learning Via Laplacian Spectral Constraints Sandeep Kumar, Jiaxi Ying, Jos

Linear Dimension Reduction (in L 2 ) Linear Dimension Reduction: R D R d Goal: Find a low-dim.

Machine learning and convex optimization with submodular functions Francis Bach Sierra

Machine learning on the symmetric group Jean-Philippe Vert ML ML - PowerPoint PPT Presentation

Machine learning on the symmetric group Jean-Philippe Vert ML ML ML ML What if inputs are permutations? Permutation: a bijection : [ 1 , N ] [ 1 , N ] ( i ) = rank of item i Composition ( 1 2 )( i ) = 1 ( 2 ( i )) S N

Class 42: Free symmetric top Class 42: Free symmetric top Free symmetric top in body frame Assume

Inequalities for Symmetric Polynomials Curtis Greene October 24, 2009 Inequalities for Symmetric

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

An Introduction to Symmetric Functions Ira M. Gessel Department of Mathematics Brandeis

Outline Crypto intro Computer Security: Secret Key Crypto Symmetric crypto Achieving security

Outline Crypto intro Computer Security: Secret Key Crypto Symmetric crypto Bart Jacobs

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices &amp; quadratic forms)

Symmetric Designs Lucia Moura School of Electrical Engineering and Computer Science University

0 4 1 3 2 No deterministic symmetric dining solution [RL81] Probabilistic symmetric

The Strong Symmetric Genus of Almost All D -type Generalized Symmetric Groups Michael A. Jackson

Finding low-rank structure in messy data Laura Balzano University of Michigan Michigan Institute

Ranking and Calibrating Click-Attributed Purchases in Performance Display Advertising Sougata

Approximating likelihood ratios with calibrated classifiers Gilles Louppe DIANA meeting

Op#mizing u#lity: Postprocessing to ensure constraints CompSci

Between Discrete and Continuous Optimization: Submodularity &amp; Optimization Stefanie

Structured Graph Learning Via Laplacian Spectral Constraints Sandeep Kumar, Jiaxi Ying, Jos

Linear Dimension Reduction (in L 2 ) Linear Dimension Reduction: R D R d Goal: Find a low-dim.

Machine learning and convex optimization with submodular functions Francis Bach Sierra

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices & quadratic forms)

Between Discrete and Continuous Optimization: Submodularity & Optimization Stefanie