Random Matrix Advances in Machine Learning (Imaging and Machine - PowerPoint PPT Presentation

Basics of Random Matrix Theory/Spiked Models 10/41 Spiked Models Small rank perturbation: C p = I p + P , P of low rank. 1 p/n = 1 ( p = 500 ) 0 . 8 0 . 6 0 . 4 0 . 2 0 0 1 2 3 4 5 6 7 8 1 n Y p Y T Figure: Eigenvalues of p , eig( C p ) = { 1 , . . . , 1 , 2 , 3 , 4 , 5 } . � �� p − 4 10 / 41

Basics of Random Matrix Theory/Spiked Models 10/41 Spiked Models Small rank perturbation: C p = I p + P , P of low rank. 1 p/n = 2 ( p = 500 ) 0 . 8 0 . 6 0 . 4 0 . 2 0 0 1 2 3 4 5 6 7 8 1 n Y p Y T Figure: Eigenvalues of p , eig( C p ) = { 1 , . . . , 1 , 2 , 3 , 4 , 5 } . � �� p − 4 10 / 41

Basics of Random Matrix Theory/Spiked Models 11/41 Spiked Models Theorem (Eigenvalues [Baik,Silverstein’06] ) 1 2 Let Y p = C p X p , with ◮ X p with i.i.d. zero mean, unit variance, E [ | X p | 4 ij ] < ∞ . ◮ C p = I p + P , P = U Ω U ∗ , where, for K fixed, Ω = diag ( ω 1 , . . . , ω K ) ∈ R K × K , with ω 1 ≥ . . . ≥ ω K > 0 . 11 / 41

Basics of Random Matrix Theory/Spiked Models 11/41 Spiked Models Theorem (Eigenvalues [Baik,Silverstein’06] ) 1 2 Let Y p = C p X p , with ◮ X p with i.i.d. zero mean, unit variance, E [ | X p | 4 ij ] < ∞ . ◮ C p = I p + P , P = U Ω U ∗ , where, for K fixed, Ω = diag ( ω 1 , . . . , ω K ) ∈ R K × K , with ω 1 ≥ . . . ≥ ω K > 0 . Then, as p, n → ∞ , p/n → c ∈ (0 , ∞ ) , denoting λ m = λ m ( 1 n Y p Y ∗ p ) ( λ m > λ m +1 ), � > (1 + √ c ) 2 , ω m > √ c 1 + ω m + c 1+ ω m a . s . λ m − → (1 + √ c ) 2 ω m , ω m ∈ (0 , √ c ] . 11 / 41

Basics of Random Matrix Theory/Spiked Models 12/41 Spiked Models Theorem (Eigenvectors [Paul’07] ) 1 2 Let Y p = C p X p , with ◮ X p with i.i.d. zero mean, unit variance, E [ | X p | 4 ij ] < ∞ . ◮ C p = I p + P , P = U Ω U ∗ = � K i =1 ω i u i u ∗ i , ω 1 > . . . > ω M > 0 . 12 / 41

Basics of Random Matrix Theory/Spiked Models 12/41 Spiked Models Theorem (Eigenvectors [Paul’07] ) 1 2 Let Y p = C p X p , with ◮ X p with i.i.d. zero mean, unit variance, E [ | X p | 4 ij ] < ∞ . ◮ C p = I p + P , P = U Ω U ∗ = � K i =1 ω i u i u ∗ i , ω 1 > . . . > ω M > 0 . Then, as p, n → ∞ , p/n → c ∈ (0 , ∞ ) , for a, b ∈ C p deterministic and ˆ u i eigenvector of λ i ( 1 n Y p Y ∗ p ) , i b − 1 − cω − 2 a . s . a ∗ ˆ u ∗ i a ∗ u i u ∗ i b · 1 ω i > √ c u i ˆ − → 0 1 + cω − 1 i In particular, → 1 − cω − 2 i u i | 2 a . s . i u ∗ | ˆ − · 1 ω i > √ c . 1 + cω − 1 i 12 / 41

Basics of Random Matrix Theory/Spiked Models 13/41 Spiked Models 1 0 . 8 0 . 6 1 u 1 | 2 u T | ˆ 0 . 4 0 . 2 p = 100 0 0 1 2 3 4 Population spike ω 1 1 1 u 1 | 2 for Y p = C u T p X p , C p = I p + ω 1 u 1 u T 2 Figure: Simulated versus limiting | ˆ 1 , p/n = 1 / 3 , varying ω 1 . 13 / 41

Basics of Random Matrix Theory/Spiked Models 13/41 Spiked Models 1 0 . 8 0 . 6 1 u 1 | 2 u T | ˆ 0 . 4 0 . 2 p = 100 p = 200 0 0 1 2 3 4 Population spike ω 1 1 1 u 1 | 2 for Y p = C u T p X p , C p = I p + ω 1 u 1 u T 2 Figure: Simulated versus limiting | ˆ 1 , p/n = 1 / 3 , varying ω 1 . 13 / 41

Basics of Random Matrix Theory/Spiked Models 13/41 Spiked Models 1 0 . 8 0 . 6 1 u 1 | 2 u T | ˆ 0 . 4 0 . 2 p = 100 p = 200 p = 400 0 0 1 2 3 4 Population spike ω 1 1 1 u 1 | 2 for Y p = C u T p X p , C p = I p + ω 1 u 1 u T 2 Figure: Simulated versus limiting | ˆ 1 , p/n = 1 / 3 , varying ω 1 . 13 / 41

Basics of Random Matrix Theory/Spiked Models 13/41 Spiked Models 1 0 . 8 0 . 6 1 u 1 | 2 u T | ˆ 0 . 4 p = 100 p = 200 0 . 2 p = 400 1 − c/ω 2 1 1+ c/ω 1 0 0 1 2 3 4 Population spike ω 1 1 1 u 1 | 2 for Y p = C u T p X p , C p = I p + ω 1 u 1 u T 2 Figure: Simulated versus limiting | ˆ 1 , p/n = 1 / 3 , varying ω 1 . 13 / 41

Basics of Random Matrix Theory/Spiked Models 14/41 Other Spiked Models Similar results for multiple matrix models: 1 1 ◮ Y p = 1 2 X p X ∗ n ( I + P ) p ( I + P ) 2 ◮ Y p = 1 n X p X ∗ p + P ◮ Y p = 1 n X ∗ p ( I + P ) X ◮ Y p = 1 n ( X p + P ) ∗ ( X p + P ) ◮ etc. 14 / 41

Application to Machine Learning/ 15/41 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Application to Machine Learning 15 / 41

Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : 16 / 41

Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled 16 / 41

Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions , difficult to analyze 16 / 41

Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions , difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets . 16 / 41

Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions , difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets . What can we say about those? : 16 / 41

Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions , difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets . What can we say about those? : ◮ Much more than we think , and actually much more than has been said so far! 16 / 41

Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions , difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets . What can we say about those? : ◮ Much more than we think , and actually much more than has been said so far! ◮ Key observation 1 : In “non-trivial” (not so) large dimensional settings, machine learning intuitions break down! 16 / 41

Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions , difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets . What can we say about those? : ◮ Much more than we think , and actually much more than has been said so far! ◮ Key observation 1 : In “non-trivial” (not so) large dimensional settings, machine learning intuitions break down! ◮ Key observation 2 : In these “non-trivial” settings, RMT explains a lot of things and can improve algorithms! 16 / 41

Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions , difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets . What can we say about those? : ◮ Much more than we think , and actually much more than has been said so far! ◮ Key observation 1 : In “non-trivial” (not so) large dimensional settings, machine learning intuitions break down! ◮ Key observation 2 : In these “non-trivial” settings, RMT explains a lot of things and can improve algorithms! ◮ Key observation 3 : Universality goes a long way...: RMT findings are compliant with real data observations! 16 / 41

Takeaway Message 1 “RMT Explains Why Machine Learning Intuitions Collapse in Large Dimensions”

Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : 18 / 41

Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 18 / 41

Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , 18 / 41

Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , (non-trivial because otherwise too easy or too hard) 18 / 41

Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , (non-trivial because otherwise too easy or too hard) Classical method: spectral clustering 18 / 41

Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , (non-trivial because otherwise too easy or too hard) Classical method: spectral clustering ◮ Extract and cluster the dominant eigenvectors of K = { κ ( x i , x j ) } n i,j =1 18 / 41

Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , (non-trivial because otherwise too easy or too hard) Classical method: spectral clustering ◮ Extract and cluster the dominant eigenvectors of � 1 p � x i − x j � 2 � K = { κ ( x i , x j ) } n i,j =1 , κ ( x i , x j ) = f . 18 / 41

Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , (non-trivial because otherwise too easy or too hard) Classical method: spectral clustering ◮ Extract and cluster the dominant eigenvectors of � 1 p � x i − x j � 2 � K = { κ ( x i , x j ) } n i,j =1 , κ ( x i , x j ) = f . ◮ Why? Finite-dimensional intuition 18 / 41

Application to Machine Learning/ 19/41 The curse of dimensionality and its consequences (2) In reality, here is what happens... Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) and second eigenvector v 2 ( x i ∼ N ( ± µ, I p ) , µ = (2 , 0 , . . . , 0) T ∈ R p ). 19 / 41

Application to Machine Learning/ 19/41 The curse of dimensionality and its consequences (2) In reality, here is what happens... Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) and second eigenvector v 2 ( x i ∼ N ( ± µ, I p ) , µ = (2 , 0 , . . . , 0) T ∈ R p ). Key observation : Under growth rate assumptions, k �� 1 τ = 2 tr n a p � x i − x j � 2 − τ a . s . � � max − → 0 , n C a . � � p 1 ≤ i � = j ≤ n i =1 19 / 41

Application to Machine Learning/ 19/41 The curse of dimensionality and its consequences (2) In reality, here is what happens... Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) and second eigenvector v 2 ( x i ∼ N ( ± µ, I p ) , µ = (2 , 0 , . . . , 0) T ∈ R p ). Key observation : Under growth rate assumptions, k �� 1 τ = 2 tr n a p � x i − x j � 2 − τ a . s . � � max − → 0 , n C a . � � p 1 ≤ i � = j ≤ n i =1 ◮ this suggests K ≃ f ( τ )1 n 1 T n ! 19 / 41

Application to Machine Learning/ 19/41 The curse of dimensionality and its consequences (2) In reality, here is what happens... Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) and second eigenvector v 2 ( x i ∼ N ( ± µ, I p ) , µ = (2 , 0 , . . . , 0) T ∈ R p ). Key observation : Under growth rate assumptions, k �� 1 τ = 2 tr n a p � x i − x j � 2 − τ a . s . � � max − → 0 , n C a . � � p 1 ≤ i � = j ≤ n i =1 ◮ this suggests K ≃ f ( τ )1 n 1 T n ! ◮ more importantly, in non-trivial settings, data are neither close, nor far! 19 / 41

Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : 20 / 41

Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse 20 / 41

Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... 20 / 41

Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in! 20 / 41

Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in! Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , K ≃ 1 � K � � a . s . pZZ T + JAJ T + ∗ � K − ˆ ˆ − → 0 , 20 / 41

Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in! Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , K ≃ 1 � K � � a . s . pZZ T + JAJ T + ∗ � K − ˆ ˆ − → 0 , with J = [ j 1 , . . . , j k ] ∈ R n × k , j a = (0 , 1 n a , 0) T (the clusters!) 20 / 41

Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in! Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , K ≃ 1 � K � � a . s . pZZ T + JAJ T + ∗ � K − ˆ ˆ − → 0 , with J = [ j 1 , . . . , j k ] ∈ R n × k , j a = (0 , 1 n a , 0) T (the clusters!) and A ∈ R k × k function of: ◮ f ( τ ) , f ′ ( τ ) , f ′′ ( τ ) ◮ � µ a − µ b � , tr ( C a − C b ) , tr (( C a − C b ) 2 ) , for a, b ∈ { 1 , . . . , k } . 20 / 41

Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in! Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , K ≃ 1 � K � � a . s . pZZ T + JAJ T + ∗ � K − ˆ ˆ − → 0 , with J = [ j 1 , . . . , j k ] ∈ R n × k , j a = (0 , 1 n a , 0) T (the clusters!) and A ∈ R k × k function of: ◮ f ( τ ) , f ′ ( τ ) , f ′′ ( τ ) ◮ � µ a − µ b � , tr ( C a − C b ) , tr (( C a − C b ) 2 ) , for a, b ∈ { 1 , . . . , k } . ➫ This is a spiked model! We can study it fully! 20 / 41

Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in! Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , K ≃ 1 � K � � a . s . pZZ T + JAJ T + ∗ � K − ˆ ˆ − → 0 , with J = [ j 1 , . . . , j k ] ∈ R n × k , j a = (0 , 1 n a , 0) T (the clusters!) and A ∈ R k × k function of: ◮ f ( τ ) , f ′ ( τ ) , f ′′ ( τ ) ◮ � µ a − µ b � , tr ( C a − C b ) , tr (( C a − C b ) 2 ) , for a, b ∈ { 1 , . . . , k } . ➫ This is a spiked model! We can study it fully! RMT can explain tools ML engineers use everyday. 20 / 41

Application to Machine Learning/ 21/41 Theoretical Findings versus MNIST 0 . 2 Eigenvalues of K 0 . 15 0 . 1 5 · 10 − 2 0 0 10 20 30 40 50 Figure: Eigenvalues of K (red) and (equivalent Gaussian model) ˆ K (white), MNIST data, p = 784 , n = 192 . 21 / 41

Application to Machine Learning/ 21/41 Theoretical Findings versus MNIST 0 . 2 Eigenvalues of K Eigenvalues of ˆ K as if Gaussian model 0 . 15 0 . 1 5 · 10 − 2 0 0 10 20 30 40 50 Figure: Eigenvalues of K (red) and (equivalent Gaussian model) ˆ K (white), MNIST data, p = 784 , n = 192 . 21 / 41

Application to Machine Learning/ 22/41 Theoretical Findings versus MNIST Figure: Leading four eigenvectors of K for MNIST data ( red ) and theoretical findings ( blue ). 22 / 41

Application to Machine Learning/ 23/41 Theoretical Findings versus MNIST Eigenvector 2 /Eigenvector 1 Eigenvector 3 /Eigenvector 2 0 . 1 0 . 2 0 . 1 0 0 − 0 . 1 − 0 . 1 − . 08 − . 07 − . 06 − 0 . 1 0 0 . 1 Figure: 2 D representation of eigenvectors of K , for the MNIST dataset. Theoretical means and 1 - and 2 -standard deviations in blue . Class 1 in red , Class 2 in black , Class 3 in green . 23 / 41

Takeaway Message 2 “RMT Reassesses and Improves Data Processing”

Application to Machine Learning/ 25/41 Improving Kernel Spectral Clustering Thanks to [C-Benaych’16]: Possibility to improve kernels: 25 / 41

Application to Machine Learning/ 25/41 Improving Kernel Spectral Clustering Thanks to [C-Benaych’16]: Possibility to improve kernels: ◮ by “focusing kernels” on best discriminative statistics: tune f ′ ( τ ) , f ′′ ( τ ) 25 / 41

Application to Machine Learning/ 25/41 Improving Kernel Spectral Clustering Thanks to [C-Benaych’16]: Possibility to improve kernels: ◮ by “focusing kernels” on best discriminative statistics: tune f ′ ( τ ) , f ′′ ( τ ) ◮ by “killing” non discriminative feature directions. 25 / 41

Application to Machine Learning/ 25/41 Improving Kernel Spectral Clustering Thanks to [C-Benaych’16]: Possibility to improve kernels: ◮ by “focusing kernels” on best discriminative statistics: tune f ′ ( τ ) , f ′′ ( τ ) ◮ by “killing” non discriminative feature directions. Example: Covariance-based discrimation, kernel f ( t ) = exp( − 1 2 t ) versus f ( t ) = ( t − τ ) 2 (think about the surprising kernel shape!) 25 / 41

Application to Machine Learning/ 26/41 Another, more striking, example: Semi-supervised Learning Semi-supervised learning : a great idea that never worked! 26 / 41

Application to Machine Learning/ 26/41 Another, more striking, example: Semi-supervised Learning Semi-supervised learning : a great idea that never worked! ◮ Setting : assume now ◮ x ( a ) , . . . , x ( a ) na, [ l ] already labelled (few), 1 ◮ x ( a ) na, [ l ]+1 , . . . , x ( a ) na unlabelled (a lot). 26 / 41

Application to Machine Learning/ 26/41 Another, more striking, example: Semi-supervised Learning Semi-supervised learning : a great idea that never worked! ◮ Setting : assume now ◮ x ( a ) , . . . , x ( a ) na, [ l ] already labelled (few), 1 ◮ x ( a ) na, [ l ]+1 , . . . , x ( a ) na unlabelled (a lot). ◮ Machine Learning original idea : find “scores” F ia for x i to belong to class a k � � � 2 , F [ l ] F = argmin F ∈ R n × k K ij F ia − F jb ia = δ { x i ∈C a } . a =1 26 / 41

Application to Machine Learning/ 26/41 Another, more striking, example: Semi-supervised Learning Semi-supervised learning : a great idea that never worked! ◮ Setting : assume now ◮ x ( a ) , . . . , x ( a ) na, [ l ] already labelled (few), 1 ◮ x ( a ) na, [ l ]+1 , . . . , x ( a ) na unlabelled (a lot). ◮ Machine Learning original idea : find “scores” F ia for x i to belong to class a k � � � 2 , F [ l ] F = argmin F ∈ R n × k K ij F ia − F jb ia = δ { x i ∈C a } . a =1 ◮ Explicit solution : � � − 1 F [ u ] = I n [ u ] − D − 1 D − 1 [ u ] K [ ul ] F [ l ] [ u ] K [ uu ] where D = diag( K 1 n ) (degree matrix) and [ ul ] , [ uu ] , . . . blocks of l abeled/ u nlabeled data. 26 / 41

Application to Machine Learning/ 27/41 The finite-dimensional intuition: What we expect 27 / 41

Application to Machine Learning/ 28/41 The reality: What we see! Setting. p = 400 , n = 1000 , x i ∼ N ( ± µ, I p ) . Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) . Display. Scores F ik (left) and F ◦ ik = F ik − 1 2 ( F i 1 + F i 2 ) (right). 28 / 41

Application to Machine Learning/ 28/41 The reality: What we see! Setting. p = 400 , n = 1000 , x i ∼ N ( ± µ, I p ) . Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) . Display. Scores F ik (left) and F ◦ ik = F ik − 1 2 ( F i 1 + F i 2 ) (right). ➫ Score are almost all identical... and do not follow the labelled data! 28 / 41

Application to Machine Learning/ 29/41 MNIST Data Example [ F ( u )] · , 1 (Zeros) 1 . 2 1 F ( u ) · ,a 0 . 8 0 50 100 150 Index Figure: Vectors [ F ( u ) ] · ,a , a = 1 , 2 , 3 , for 3-class MNIST data (zeros, ones, twos), n = 192 , p = 784 , n l /n = 1 / 16 , Gaussian kernel. 29 / 41

Application to Machine Learning/ 29/41 MNIST Data Example [ F ( u )] · , 1 (Zeros) [ F ( u )] · , 2 (Ones) 1 . 2 F ( u ) · ,a 1 0 . 8 0 50 100 150 Index Figure: Vectors [ F ( u ) ] · ,a , a = 1 , 2 , 3 , for 3-class MNIST data (zeros, ones, twos), n = 192 , p = 784 , n l /n = 1 / 16 , Gaussian kernel. 29 / 41

Application to Machine Learning/ 29/41 MNIST Data Example [ F ( u )] · , 1 (Zeros) [ F ( u )] · , 2 (Ones) [ F ( u )] · , 3 (Twos) 1 . 2 F ( u ) · ,a 1 0 . 8 0 50 100 150 Index Figure: Vectors [ F ( u ) ] · ,a , a = 1 , 2 , 3 , for 3-class MNIST data (zeros, ones, twos), n = 192 , p = 784 , n l /n = 1 / 16 , Gaussian kernel. 29 / 41

Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” 30 / 41

Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work 30 / 41

Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! 30 / 41

Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization, it works again... 30 / 41

Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization, it works again... ◮ BUT it does not use efficiently unlabelled data! 30 / 41

Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization, it works again... ◮ BUT it does not use efficiently unlabelled data! Chapelle, Sch¨ olkopf, Zien, “ Semi-Supervised Learning ”, Chapter 4, 2009. Our concern is this: it is frequently the case that we would be better off just discarding the unlabeled data and employing a supervised method, rather than taking a semi-supervised route. Thus we worry about the embarrassing situation where the addition of unlabeled data degrades the performance of a classifier. 30 / 41

Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization, it works again... ◮ BUT it does not use efficiently unlabelled data! Chapelle, Sch¨ olkopf, Zien, “ Semi-Supervised Learning ”, Chapter 4, 2009. Our concern is this: it is frequently the case that we would be better off just discarding the unlabeled data and employing a supervised method, rather than taking a semi-supervised route. Thus we worry about the embarrassing situation where the addition of unlabeled data degrades the performance of a classifier. What RMT can do about it 30 / 41

Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization, it works again... ◮ BUT it does not use efficiently unlabelled data! Chapelle, Sch¨ olkopf, Zien, “ Semi-Supervised Learning ”, Chapter 4, 2009. Our concern is this: it is frequently the case that we would be better off just discarding the unlabeled data and employing a supervised method, rather than taking a semi-supervised route. Thus we worry about the embarrassing situation where the addition of unlabeled data degrades the performance of a classifier. What RMT can do about it ◮ Asymptotic performance analysis: clear understanding of what we see! 30 / 41

Random Matrix Advances in Machine Learning (Imaging and Machine - PowerPoint PPT Presentation

Random Matrix Advances in Machine Learning (Imaging and Machine Learning) Mathematics Workshop #3 Institut Henri Poincar e Romain COUILLET CentraleSup elec, L2S, University of ParisSaclay, France GSTATS IDEX DataScience Chair, GIPSA-lab,

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Random forests and wine Machine Learning Toolbox Random forests Popular type of machine

Random Vectors, Random Matrices, and Matrix Expected Value James H. Steiger Department of

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en-

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

Liberating Communication with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Tale of Two Theories: A Tale of Two Theories: Reconciling Reconciling random matrix theory

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Experiencing a new Internet Architecture Adrian Perrig, Network Security Group The Internet is on

The Global Fuel Economy Ini3a3ve Sheila Watson Execu3ve

MILS C Compl plete S ete Separa rati tion P n Platf tform orm P Protec ecti tion n

EU Research Funding (on Future Internet activities) - today and tomorrow An overview EUROVIEW

Abstract Algebraic Logic 4th lesson Petr Cintula 1 and Carles Noguera 2 1 Institute of Computer

How to Reconcile Symmetries and . . . Physical Theories with Case of a single particle Case of

Comments on Green-Zhou Money as Mechanism in a Bewley Economy May 17, 2004 David K.

Directed Search Lecture 5: Monetary Economics October 2012 Shouyong Shi c Main sources of