random matrix advances in machine learning
play

Random Matrix Advances in Machine Learning (Imaging and Machine - PowerPoint PPT Presentation

Random Matrix Advances in Machine Learning (Imaging and Machine Learning) Mathematics Workshop #3 Institut Henri Poincar e Romain COUILLET CentraleSup elec, L2S, University of ParisSaclay, France GSTATS IDEX DataScience Chair, GIPSA-lab,


  1. Basics of Random Matrix Theory/Spiked Models 10/41 Spiked Models Small rank perturbation: C p = I p + P , P of low rank. 1 p/n = 1 ( p = 500 ) 0 . 8 0 . 6 0 . 4 0 . 2 0 0 1 2 3 4 5 6 7 8 1 n Y p Y T Figure: Eigenvalues of p , eig( C p ) = { 1 , . . . , 1 , 2 , 3 , 4 , 5 } . � �� � p − 4 10 / 41

  2. Basics of Random Matrix Theory/Spiked Models 10/41 Spiked Models Small rank perturbation: C p = I p + P , P of low rank. 1 p/n = 2 ( p = 500 ) 0 . 8 0 . 6 0 . 4 0 . 2 0 0 1 2 3 4 5 6 7 8 1 n Y p Y T Figure: Eigenvalues of p , eig( C p ) = { 1 , . . . , 1 , 2 , 3 , 4 , 5 } . � �� � p − 4 10 / 41

  3. Basics of Random Matrix Theory/Spiked Models 11/41 Spiked Models Theorem (Eigenvalues [Baik,Silverstein’06] ) 1 2 Let Y p = C p X p , with ◮ X p with i.i.d. zero mean, unit variance, E [ | X p | 4 ij ] < ∞ . ◮ C p = I p + P , P = U Ω U ∗ , where, for K fixed, Ω = diag ( ω 1 , . . . , ω K ) ∈ R K × K , with ω 1 ≥ . . . ≥ ω K > 0 . 11 / 41

  4. Basics of Random Matrix Theory/Spiked Models 11/41 Spiked Models Theorem (Eigenvalues [Baik,Silverstein’06] ) 1 2 Let Y p = C p X p , with ◮ X p with i.i.d. zero mean, unit variance, E [ | X p | 4 ij ] < ∞ . ◮ C p = I p + P , P = U Ω U ∗ , where, for K fixed, Ω = diag ( ω 1 , . . . , ω K ) ∈ R K × K , with ω 1 ≥ . . . ≥ ω K > 0 . Then, as p, n → ∞ , p/n → c ∈ (0 , ∞ ) , denoting λ m = λ m ( 1 n Y p Y ∗ p ) ( λ m > λ m +1 ), � > (1 + √ c ) 2 , ω m > √ c 1 + ω m + c 1+ ω m a . s . λ m − → (1 + √ c ) 2 ω m , ω m ∈ (0 , √ c ] . 11 / 41

  5. Basics of Random Matrix Theory/Spiked Models 12/41 Spiked Models Theorem (Eigenvectors [Paul’07] ) 1 2 Let Y p = C p X p , with ◮ X p with i.i.d. zero mean, unit variance, E [ | X p | 4 ij ] < ∞ . ◮ C p = I p + P , P = U Ω U ∗ = � K i =1 ω i u i u ∗ i , ω 1 > . . . > ω M > 0 . 12 / 41

  6. Basics of Random Matrix Theory/Spiked Models 12/41 Spiked Models Theorem (Eigenvectors [Paul’07] ) 1 2 Let Y p = C p X p , with ◮ X p with i.i.d. zero mean, unit variance, E [ | X p | 4 ij ] < ∞ . ◮ C p = I p + P , P = U Ω U ∗ = � K i =1 ω i u i u ∗ i , ω 1 > . . . > ω M > 0 . Then, as p, n → ∞ , p/n → c ∈ (0 , ∞ ) , for a, b ∈ C p deterministic and ˆ u i eigenvector of λ i ( 1 n Y p Y ∗ p ) , i b − 1 − cω − 2 a . s . a ∗ ˆ u ∗ i a ∗ u i u ∗ i b · 1 ω i > √ c u i ˆ − → 0 1 + cω − 1 i In particular, → 1 − cω − 2 i u i | 2 a . s . i u ∗ | ˆ − · 1 ω i > √ c . 1 + cω − 1 i 12 / 41

  7. Basics of Random Matrix Theory/Spiked Models 13/41 Spiked Models 1 0 . 8 0 . 6 1 u 1 | 2 u T | ˆ 0 . 4 0 . 2 p = 100 0 0 1 2 3 4 Population spike ω 1 1 1 u 1 | 2 for Y p = C u T p X p , C p = I p + ω 1 u 1 u T 2 Figure: Simulated versus limiting | ˆ 1 , p/n = 1 / 3 , varying ω 1 . 13 / 41

  8. Basics of Random Matrix Theory/Spiked Models 13/41 Spiked Models 1 0 . 8 0 . 6 1 u 1 | 2 u T | ˆ 0 . 4 0 . 2 p = 100 p = 200 0 0 1 2 3 4 Population spike ω 1 1 1 u 1 | 2 for Y p = C u T p X p , C p = I p + ω 1 u 1 u T 2 Figure: Simulated versus limiting | ˆ 1 , p/n = 1 / 3 , varying ω 1 . 13 / 41

  9. Basics of Random Matrix Theory/Spiked Models 13/41 Spiked Models 1 0 . 8 0 . 6 1 u 1 | 2 u T | ˆ 0 . 4 0 . 2 p = 100 p = 200 p = 400 0 0 1 2 3 4 Population spike ω 1 1 1 u 1 | 2 for Y p = C u T p X p , C p = I p + ω 1 u 1 u T 2 Figure: Simulated versus limiting | ˆ 1 , p/n = 1 / 3 , varying ω 1 . 13 / 41

  10. Basics of Random Matrix Theory/Spiked Models 13/41 Spiked Models 1 0 . 8 0 . 6 1 u 1 | 2 u T | ˆ 0 . 4 p = 100 p = 200 0 . 2 p = 400 1 − c/ω 2 1 1+ c/ω 1 0 0 1 2 3 4 Population spike ω 1 1 1 u 1 | 2 for Y p = C u T p X p , C p = I p + ω 1 u 1 u T 2 Figure: Simulated versus limiting | ˆ 1 , p/n = 1 / 3 , varying ω 1 . 13 / 41

  11. Basics of Random Matrix Theory/Spiked Models 14/41 Other Spiked Models Similar results for multiple matrix models: 1 1 ◮ Y p = 1 2 X p X ∗ n ( I + P ) p ( I + P ) 2 ◮ Y p = 1 n X p X ∗ p + P ◮ Y p = 1 n X ∗ p ( I + P ) X ◮ Y p = 1 n ( X p + P ) ∗ ( X p + P ) ◮ etc. 14 / 41

  12. Application to Machine Learning/ 15/41 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Application to Machine Learning 15 / 41

  13. Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : 16 / 41

  14. Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled 16 / 41

  15. Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions , difficult to analyze 16 / 41

  16. Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions , difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets . 16 / 41

  17. Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions , difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets . What can we say about those? : 16 / 41

  18. Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions , difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets . What can we say about those? : ◮ Much more than we think , and actually much more than has been said so far! 16 / 41

  19. Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions , difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets . What can we say about those? : ◮ Much more than we think , and actually much more than has been said so far! ◮ Key observation 1 : In “non-trivial” (not so) large dimensional settings, machine learning intuitions break down! 16 / 41

  20. Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions , difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets . What can we say about those? : ◮ Much more than we think , and actually much more than has been said so far! ◮ Key observation 1 : In “non-trivial” (not so) large dimensional settings, machine learning intuitions break down! ◮ Key observation 2 : In these “non-trivial” settings, RMT explains a lot of things and can improve algorithms! 16 / 41

  21. Application to Machine Learning/ 16/41 An adventurous venue Machine Learning is not “Simple Linear Statistics” : ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions , difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets . What can we say about those? : ◮ Much more than we think , and actually much more than has been said so far! ◮ Key observation 1 : In “non-trivial” (not so) large dimensional settings, machine learning intuitions break down! ◮ Key observation 2 : In these “non-trivial” settings, RMT explains a lot of things and can improve algorithms! ◮ Key observation 3 : Universality goes a long way...: RMT findings are compliant with real data observations! 16 / 41

  22. Takeaway Message 1 “RMT Explains Why Machine Learning Intuitions Collapse in Large Dimensions”

  23. Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : 18 / 41

  24. Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 18 / 41

  25. Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , 18 / 41

  26. Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , (non-trivial because otherwise too easy or too hard) 18 / 41

  27. Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , (non-trivial because otherwise too easy or too hard) Classical method: spectral clustering 18 / 41

  28. Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , (non-trivial because otherwise too easy or too hard) Classical method: spectral clustering ◮ Extract and cluster the dominant eigenvectors of K = { κ ( x i , x j ) } n i,j =1 18 / 41

  29. Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , (non-trivial because otherwise too easy or too hard) Classical method: spectral clustering ◮ Extract and cluster the dominant eigenvectors of � 1 p � x i − x j � 2 � K = { κ ( x i , x j ) } n i,j =1 , κ ( x i , x j ) = f . 18 / 41

  30. Application to Machine Learning/ 18/41 The curse of dimensionality and its consequences Clustering setting in (not so) large n, p : ◮ GMM setting: x ( a ) , . . . , x ( a ) n a ∼ N ( µ a , C a ) , a = 1 , . . . , k 1 ◮ Non-trivial task: tr ( C a − C b ) = O ( √ p ) , tr [( C a − C b ) 2 ] = O ( p ) � µ a − µ b � = O (1) , (non-trivial because otherwise too easy or too hard) Classical method: spectral clustering ◮ Extract and cluster the dominant eigenvectors of � 1 p � x i − x j � 2 � K = { κ ( x i , x j ) } n i,j =1 , κ ( x i , x j ) = f . ◮ Why? Finite-dimensional intuition 18 / 41

  31. Application to Machine Learning/ 19/41 The curse of dimensionality and its consequences (2) In reality, here is what happens... Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) and second eigenvector v 2 ( x i ∼ N ( ± µ, I p ) , µ = (2 , 0 , . . . , 0) T ∈ R p ). 19 / 41

  32. Application to Machine Learning/ 19/41 The curse of dimensionality and its consequences (2) In reality, here is what happens... Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) and second eigenvector v 2 ( x i ∼ N ( ± µ, I p ) , µ = (2 , 0 , . . . , 0) T ∈ R p ). Key observation : Under growth rate assumptions, k �� � � � 1 τ = 2 tr n a p � x i − x j � 2 − τ a . s . � � max − → 0 , n C a . � � p 1 ≤ i � = j ≤ n i =1 19 / 41

  33. Application to Machine Learning/ 19/41 The curse of dimensionality and its consequences (2) In reality, here is what happens... Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) and second eigenvector v 2 ( x i ∼ N ( ± µ, I p ) , µ = (2 , 0 , . . . , 0) T ∈ R p ). Key observation : Under growth rate assumptions, k �� � � � 1 τ = 2 tr n a p � x i − x j � 2 − τ a . s . � � max − → 0 , n C a . � � p 1 ≤ i � = j ≤ n i =1 ◮ this suggests K ≃ f ( τ )1 n 1 T n ! 19 / 41

  34. Application to Machine Learning/ 19/41 The curse of dimensionality and its consequences (2) In reality, here is what happens... Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) and second eigenvector v 2 ( x i ∼ N ( ± µ, I p ) , µ = (2 , 0 , . . . , 0) T ∈ R p ). Key observation : Under growth rate assumptions, k �� � � � 1 τ = 2 tr n a p � x i − x j � 2 − τ a . s . � � max − → 0 , n C a . � � p 1 ≤ i � = j ≤ n i =1 ◮ this suggests K ≃ f ( τ )1 n 1 T n ! ◮ more importantly, in non-trivial settings, data are neither close, nor far! 19 / 41

  35. Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : 20 / 41

  36. Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse 20 / 41

  37. Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... 20 / 41

  38. Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in! 20 / 41

  39. Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in! Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , K ≃ 1 � K � � a . s . pZZ T + JAJ T + ∗ � K − ˆ ˆ − → 0 , 20 / 41

  40. Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in! Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , K ≃ 1 � K � � a . s . pZZ T + JAJ T + ∗ � K − ˆ ˆ − → 0 , with J = [ j 1 , . . . , j k ] ∈ R n × k , j a = (0 , 1 n a , 0) T (the clusters!) 20 / 41

  41. Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in! Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , K ≃ 1 � K � � a . s . pZZ T + JAJ T + ∗ � K − ˆ ˆ − → 0 , with J = [ j 1 , . . . , j k ] ∈ R n × k , j a = (0 , 1 n a , 0) T (the clusters!) and A ∈ R k × k function of: ◮ f ( τ ) , f ′ ( τ ) , f ′′ ( τ ) ◮ � µ a − µ b � , tr ( C a − C b ) , tr (( C a − C b ) 2 ) , for a, b ∈ { 1 , . . . , k } . 20 / 41

  42. Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in! Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , K ≃ 1 � K � � a . s . pZZ T + JAJ T + ∗ � K − ˆ ˆ − → 0 , with J = [ j 1 , . . . , j k ] ∈ R n × k , j a = (0 , 1 n a , 0) T (the clusters!) and A ∈ R k × k function of: ◮ f ( τ ) , f ′ ( τ ) , f ′′ ( τ ) ◮ � µ a − µ b � , tr ( C a − C b ) , tr (( C a − C b ) 2 ) , for a, b ∈ { 1 , . . . , k } . ➫ This is a spiked model! We can study it fully! 20 / 41

  43. Application to Machine Learning/ 20/41 The curse of dimensionality and its consequences (3) (Major) consequences : ◮ Most machine learning intuitions collapse ◮ But luckily , concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in! Theorem ( [C-Benaych’16] Asymptotic Kernel Behavior) Under growth rate assumptions, as p, n → ∞ , K ≃ 1 � K � � a . s . pZZ T + JAJ T + ∗ � K − ˆ ˆ − → 0 , with J = [ j 1 , . . . , j k ] ∈ R n × k , j a = (0 , 1 n a , 0) T (the clusters!) and A ∈ R k × k function of: ◮ f ( τ ) , f ′ ( τ ) , f ′′ ( τ ) ◮ � µ a − µ b � , tr ( C a − C b ) , tr (( C a − C b ) 2 ) , for a, b ∈ { 1 , . . . , k } . ➫ This is a spiked model! We can study it fully! RMT can explain tools ML engineers use everyday. 20 / 41

  44. Application to Machine Learning/ 21/41 Theoretical Findings versus MNIST 0 . 2 Eigenvalues of K 0 . 15 0 . 1 5 · 10 − 2 0 0 10 20 30 40 50 Figure: Eigenvalues of K (red) and (equivalent Gaussian model) ˆ K (white), MNIST data, p = 784 , n = 192 . 21 / 41

  45. Application to Machine Learning/ 21/41 Theoretical Findings versus MNIST 0 . 2 Eigenvalues of K Eigenvalues of ˆ K as if Gaussian model 0 . 15 0 . 1 5 · 10 − 2 0 0 10 20 30 40 50 Figure: Eigenvalues of K (red) and (equivalent Gaussian model) ˆ K (white), MNIST data, p = 784 , n = 192 . 21 / 41

  46. Application to Machine Learning/ 22/41 Theoretical Findings versus MNIST Figure: Leading four eigenvectors of K for MNIST data ( red ) and theoretical findings ( blue ). 22 / 41

  47. Application to Machine Learning/ 22/41 Theoretical Findings versus MNIST Figure: Leading four eigenvectors of K for MNIST data ( red ) and theoretical findings ( blue ). 22 / 41

  48. Application to Machine Learning/ 23/41 Theoretical Findings versus MNIST Eigenvector 2 /Eigenvector 1 Eigenvector 3 /Eigenvector 2 0 . 1 0 . 2 0 . 1 0 0 − 0 . 1 − 0 . 1 − . 08 − . 07 − . 06 − 0 . 1 0 0 . 1 Figure: 2 D representation of eigenvectors of K , for the MNIST dataset. Theoretical means and 1 - and 2 -standard deviations in blue . Class 1 in red , Class 2 in black , Class 3 in green . 23 / 41

  49. Application to Machine Learning/ 23/41 Theoretical Findings versus MNIST Eigenvector 2 /Eigenvector 1 Eigenvector 3 /Eigenvector 2 0 . 1 0 . 2 0 . 1 0 0 − 0 . 1 − 0 . 1 − . 08 − . 07 − . 06 − 0 . 1 0 0 . 1 Figure: 2 D representation of eigenvectors of K , for the MNIST dataset. Theoretical means and 1 - and 2 -standard deviations in blue . Class 1 in red , Class 2 in black , Class 3 in green . 23 / 41

  50. Takeaway Message 2 “RMT Reassesses and Improves Data Processing”

  51. Application to Machine Learning/ 25/41 Improving Kernel Spectral Clustering Thanks to [C-Benaych’16]: Possibility to improve kernels: 25 / 41

  52. Application to Machine Learning/ 25/41 Improving Kernel Spectral Clustering Thanks to [C-Benaych’16]: Possibility to improve kernels: ◮ by “focusing kernels” on best discriminative statistics: tune f ′ ( τ ) , f ′′ ( τ ) 25 / 41

  53. Application to Machine Learning/ 25/41 Improving Kernel Spectral Clustering Thanks to [C-Benaych’16]: Possibility to improve kernels: ◮ by “focusing kernels” on best discriminative statistics: tune f ′ ( τ ) , f ′′ ( τ ) ◮ by “killing” non discriminative feature directions. 25 / 41

  54. Application to Machine Learning/ 25/41 Improving Kernel Spectral Clustering Thanks to [C-Benaych’16]: Possibility to improve kernels: ◮ by “focusing kernels” on best discriminative statistics: tune f ′ ( τ ) , f ′′ ( τ ) ◮ by “killing” non discriminative feature directions. Example: Covariance-based discrimation, kernel f ( t ) = exp( − 1 2 t ) versus f ( t ) = ( t − τ ) 2 (think about the surprising kernel shape!) 25 / 41

  55. Application to Machine Learning/ 25/41 Improving Kernel Spectral Clustering Thanks to [C-Benaych’16]: Possibility to improve kernels: ◮ by “focusing kernels” on best discriminative statistics: tune f ′ ( τ ) , f ′′ ( τ ) ◮ by “killing” non discriminative feature directions. Example: Covariance-based discrimation, kernel f ( t ) = exp( − 1 2 t ) versus f ( t ) = ( t − τ ) 2 (think about the surprising kernel shape!) 25 / 41

  56. Application to Machine Learning/ 26/41 Another, more striking, example: Semi-supervised Learning Semi-supervised learning : a great idea that never worked! 26 / 41

  57. Application to Machine Learning/ 26/41 Another, more striking, example: Semi-supervised Learning Semi-supervised learning : a great idea that never worked! ◮ Setting : assume now ◮ x ( a ) , . . . , x ( a ) na, [ l ] already labelled (few), 1 ◮ x ( a ) na, [ l ]+1 , . . . , x ( a ) na unlabelled (a lot). 26 / 41

  58. Application to Machine Learning/ 26/41 Another, more striking, example: Semi-supervised Learning Semi-supervised learning : a great idea that never worked! ◮ Setting : assume now ◮ x ( a ) , . . . , x ( a ) na, [ l ] already labelled (few), 1 ◮ x ( a ) na, [ l ]+1 , . . . , x ( a ) na unlabelled (a lot). ◮ Machine Learning original idea : find “scores” F ia for x i to belong to class a k � � � 2 , F [ l ] F = argmin F ∈ R n × k K ij F ia − F jb ia = δ { x i ∈C a } . a =1 26 / 41

  59. Application to Machine Learning/ 26/41 Another, more striking, example: Semi-supervised Learning Semi-supervised learning : a great idea that never worked! ◮ Setting : assume now ◮ x ( a ) , . . . , x ( a ) na, [ l ] already labelled (few), 1 ◮ x ( a ) na, [ l ]+1 , . . . , x ( a ) na unlabelled (a lot). ◮ Machine Learning original idea : find “scores” F ia for x i to belong to class a k � � � 2 , F [ l ] F = argmin F ∈ R n × k K ij F ia − F jb ia = δ { x i ∈C a } . a =1 ◮ Explicit solution : � � − 1 F [ u ] = I n [ u ] − D − 1 D − 1 [ u ] K [ ul ] F [ l ] [ u ] K [ uu ] where D = diag( K 1 n ) (degree matrix) and [ ul ] , [ uu ] , . . . blocks of l abeled/ u nlabeled data. 26 / 41

  60. Application to Machine Learning/ 27/41 The finite-dimensional intuition: What we expect 27 / 41

  61. Application to Machine Learning/ 27/41 The finite-dimensional intuition: What we expect 27 / 41

  62. Application to Machine Learning/ 27/41 The finite-dimensional intuition: What we expect 27 / 41

  63. Application to Machine Learning/ 28/41 The reality: What we see! Setting. p = 400 , n = 1000 , x i ∼ N ( ± µ, I p ) . Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) . Display. Scores F ik (left) and F ◦ ik = F ik − 1 2 ( F i 1 + F i 2 ) (right). 28 / 41

  64. Application to Machine Learning/ 28/41 The reality: What we see! Setting. p = 400 , n = 1000 , x i ∼ N ( ± µ, I p ) . Kernel K ij = exp( − 1 2 p � x i − x j � 2 ) . Display. Scores F ik (left) and F ◦ ik = F ik − 1 2 ( F i 1 + F i 2 ) (right). ➫ Score are almost all identical... and do not follow the labelled data! 28 / 41

  65. Application to Machine Learning/ 29/41 MNIST Data Example [ F ( u )] · , 1 (Zeros) 1 . 2 1 F ( u ) · ,a 0 . 8 0 50 100 150 Index Figure: Vectors [ F ( u ) ] · ,a , a = 1 , 2 , 3 , for 3-class MNIST data (zeros, ones, twos), n = 192 , p = 784 , n l /n = 1 / 16 , Gaussian kernel. 29 / 41

  66. Application to Machine Learning/ 29/41 MNIST Data Example [ F ( u )] · , 1 (Zeros) [ F ( u )] · , 2 (Ones) 1 . 2 F ( u ) · ,a 1 0 . 8 0 50 100 150 Index Figure: Vectors [ F ( u ) ] · ,a , a = 1 , 2 , 3 , for 3-class MNIST data (zeros, ones, twos), n = 192 , p = 784 , n l /n = 1 / 16 , Gaussian kernel. 29 / 41

  67. Application to Machine Learning/ 29/41 MNIST Data Example [ F ( u )] · , 1 (Zeros) [ F ( u )] · , 2 (Ones) [ F ( u )] · , 3 (Twos) 1 . 2 F ( u ) · ,a 1 0 . 8 0 50 100 150 Index Figure: Vectors [ F ( u ) ] · ,a , a = 1 , 2 , 3 , for 3-class MNIST data (zeros, ones, twos), n = 192 , p = 784 , n l /n = 1 / 16 , Gaussian kernel. 29 / 41

  68. Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” 30 / 41

  69. Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work 30 / 41

  70. Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! 30 / 41

  71. Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization, it works again... 30 / 41

  72. Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization, it works again... ◮ BUT it does not use efficiently unlabelled data! 30 / 41

  73. Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization, it works again... ◮ BUT it does not use efficiently unlabelled data! Chapelle, Sch¨ olkopf, Zien, “ Semi-Supervised Learning ”, Chapter 4, 2009. Our concern is this: it is frequently the case that we would be better off just discarding the unlabeled data and employing a supervised method, rather than taking a semi-supervised route. Thus we worry about the embarrassing situation where the addition of unlabeled data degrades the performance of a classifier. 30 / 41

  74. Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization, it works again... ◮ BUT it does not use efficiently unlabelled data! Chapelle, Sch¨ olkopf, Zien, “ Semi-Supervised Learning ”, Chapter 4, 2009. Our concern is this: it is frequently the case that we would be better off just discarding the unlabeled data and employing a supervised method, rather than taking a semi-supervised route. Thus we worry about the embarrassing situation where the addition of unlabeled data degrades the performance of a classifier. What RMT can do about it 30 / 41

  75. Application to Machine Learning/ 30/41 Exploiting RMT to resurrect SSL Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization, it works again... ◮ BUT it does not use efficiently unlabelled data! Chapelle, Sch¨ olkopf, Zien, “ Semi-Supervised Learning ”, Chapter 4, 2009. Our concern is this: it is frequently the case that we would be better off just discarding the unlabeled data and employing a supervised method, rather than taking a semi-supervised route. Thus we worry about the embarrassing situation where the addition of unlabeled data degrades the performance of a classifier. What RMT can do about it ◮ Asymptotic performance analysis: clear understanding of what we see! 30 / 41

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend