Random Matrix Advances in Machine Learning (Imaging and Machine - - PowerPoint PPT Presentation

random matrix advances in machine learning
SMART_READER_LITE
LIVE PREVIEW

Random Matrix Advances in Machine Learning (Imaging and Machine - - PowerPoint PPT Presentation

Random Matrix Advances in Machine Learning (Imaging and Machine Learning) Mathematics Workshop #3 Institut Henri Poincar e Romain COUILLET CentraleSup elec, L2S, University of ParisSaclay, France GSTATS IDEX DataScience Chair, GIPSA-lab,


slide-1
SLIDE 1

Random Matrix Advances in Machine Learning

(Imaging and Machine Learning) Mathematics Workshop #3 Institut Henri Poincar´ e Romain COUILLET

CentraleSup´ elec, L2S, University of ParisSaclay, France GSTATS IDEX DataScience Chair, GIPSA-lab, University Grenoble–Alpes, France.

April 1st, 2019

1 / 41

slide-2
SLIDE 2

Outline

Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Application to Machine Learning

2 / 41

slide-3
SLIDE 3

Basics of Random Matrix Theory/ 3/41

Outline

Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Application to Machine Learning

3 / 41

slide-4
SLIDE 4

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 4/41

Outline

Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Application to Machine Learning

4 / 41

slide-5
SLIDE 5

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/41

Context

Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗

1] = Cp:

5 / 41

slide-6
SLIDE 6

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/41

Context

Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗

1] = Cp:

◮ If y1 ∼ N(0, Cp), ML estimator for Cp is the sample covariance matrix (SCM) ˆ Cp = 1 nYpY ∗

p = 1

n

n

  • i=1

yiy∗

i

(Yp = [y1, . . . , yn] ∈ Cp×n).

5 / 41

slide-7
SLIDE 7

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/41

Context

Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗

1] = Cp:

◮ If y1 ∼ N(0, Cp), ML estimator for Cp is the sample covariance matrix (SCM) ˆ Cp = 1 nYpY ∗

p = 1

n

n

  • i=1

yiy∗

i

(Yp = [y1, . . . , yn] ∈ Cp×n). ◮ If n → ∞, then, strong law of large numbers ˆ Cp

a.s.

− → Cp.

  • r equivalently, in spectral norm
  • ˆ

Cp − Cp

  • a.s.

− → 0.

5 / 41

slide-8
SLIDE 8

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/41

Context

Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗

1] = Cp:

◮ If y1 ∼ N(0, Cp), ML estimator for Cp is the sample covariance matrix (SCM) ˆ Cp = 1 nYpY ∗

p = 1

n

n

  • i=1

yiy∗

i

(Yp = [y1, . . . , yn] ∈ Cp×n). ◮ If n → ∞, then, strong law of large numbers ˆ Cp

a.s.

− → Cp.

  • r equivalently, in spectral norm
  • ˆ

Cp − Cp

  • a.s.

− → 0.

Random Matrix Regime

◮ No longer valid if p, n → ∞ with p/n → c ∈ (0, ∞),

  • ˆ

Cp − Cp

  • → 0.

5 / 41

slide-9
SLIDE 9

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/41

Context

Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗

1] = Cp:

◮ If y1 ∼ N(0, Cp), ML estimator for Cp is the sample covariance matrix (SCM) ˆ Cp = 1 nYpY ∗

p = 1

n

n

  • i=1

yiy∗

i

(Yp = [y1, . . . , yn] ∈ Cp×n). ◮ If n → ∞, then, strong law of large numbers ˆ Cp

a.s.

− → Cp.

  • r equivalently, in spectral norm
  • ˆ

Cp − Cp

  • a.s.

− → 0.

Random Matrix Regime

◮ No longer valid if p, n → ∞ with p/n → c ∈ (0, ∞),

  • ˆ

Cp − Cp

  • → 0.

◮ For practical p, n with p ≃ n, leads to dramatically wrong conclusions

5 / 41

slide-10
SLIDE 10

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/41

Context

Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗

1] = Cp:

◮ If y1 ∼ N(0, Cp), ML estimator for Cp is the sample covariance matrix (SCM) ˆ Cp = 1 nYpY ∗

p = 1

n

n

  • i=1

yiy∗

i

(Yp = [y1, . . . , yn] ∈ Cp×n). ◮ If n → ∞, then, strong law of large numbers ˆ Cp

a.s.

− → Cp.

  • r equivalently, in spectral norm
  • ˆ

Cp − Cp

  • a.s.

− → 0.

Random Matrix Regime

◮ No longer valid if p, n → ∞ with p/n → c ∈ (0, ∞),

  • ˆ

Cp − Cp

  • → 0.

◮ For practical p, n with p ≃ n, leads to dramatically wrong conclusions ◮ Even for p = n/100.

5 / 41

slide-11
SLIDE 11

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/41

The Mar˘ cenko–Pastur law

0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 Eigenvalues of ˆ Cp Density

p = 50, n = 200

Figure: Histogram of the eigenvalues of ˆ Cp for c = 1/4, Cp = Ip.

6 / 41

slide-12
SLIDE 12

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/41

The Mar˘ cenko–Pastur law

0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 Eigenvalues of ˆ Cp Density

p = 100, n = 400

Figure: Histogram of the eigenvalues of ˆ Cp for c = 1/4, Cp = Ip.

6 / 41

slide-13
SLIDE 13

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/41

The Mar˘ cenko–Pastur law

0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 Eigenvalues of ˆ Cp Density

p = 250, n = 1000

Figure: Histogram of the eigenvalues of ˆ Cp for c = 1/4, Cp = Ip.

6 / 41

slide-14
SLIDE 14

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/41

The Mar˘ cenko–Pastur law

0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 Eigenvalues of ˆ Cp Density

p = 500, n = 2000

Figure: Histogram of the eigenvalues of ˆ Cp for c = 1/4, Cp = Ip.

6 / 41

slide-15
SLIDE 15

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/41

The Mar˘ cenko–Pastur law

0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 Eigenvalues of ˆ Cp Density

p = 1000, n = 4000

Figure: Histogram of the eigenvalues of ˆ Cp for c = 1/4, Cp = Ip.

6 / 41

slide-16
SLIDE 16

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/41

The Mar˘ cenko–Pastur law

0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 Eigenvalues of ˆ Cp Density

p = 1000, n = 4000 The Mar˘ cenko–Pastur law

Figure: Histogram of the eigenvalues of ˆ Cp for c = 1/4, Cp = Ip.

6 / 41

slide-17
SLIDE 17

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/41

The Mar˘ cenko–Pastur law

Definition (Empirical Spectral Density)

Empirical spectral density (e.s.d.) µp of Hermitian matrix Ap ∈ Cp×p is µp = 1 p

p

  • i=1

δλi(Ap).

7 / 41

slide-18
SLIDE 18

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/41

The Mar˘ cenko–Pastur law

Definition (Empirical Spectral Density)

Empirical spectral density (e.s.d.) µp of Hermitian matrix Ap ∈ Cp×p is µp = 1 p

p

  • i=1

δλi(Ap).

Theorem (Mar˘ cenko–Pastur Law [Mar˘ cenko,Pastur’67])

Xp ∈ Cp×n with i.i.d. zero mean, unit variance entries. As p, n → ∞ with p/n → c ∈ (0, ∞), e.s.d. µp of 1

nXpX∗ p satisfies

µp

a.s.

− → µc weakly, where ◮ µc({0}) = max{0, 1 − c−1}

7 / 41

slide-19
SLIDE 19

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/41

The Mar˘ cenko–Pastur law

Definition (Empirical Spectral Density)

Empirical spectral density (e.s.d.) µp of Hermitian matrix Ap ∈ Cp×p is µp = 1 p

p

  • i=1

δλi(Ap).

Theorem (Mar˘ cenko–Pastur Law [Mar˘ cenko,Pastur’67])

Xp ∈ Cp×n with i.i.d. zero mean, unit variance entries. As p, n → ∞ with p/n → c ∈ (0, ∞), e.s.d. µp of 1

nXpX∗ p satisfies

µp

a.s.

− → µc weakly, where ◮ µc({0}) = max{0, 1 − c−1} ◮ on (0, ∞), µc has continuous density fc supported on [(1 − √c)2, (1 + √c)2] fc(x) = 1 2πcx

  • (x − (1 − √c)2)((1 + √c)2 − x).

7 / 41

slide-20
SLIDE 20

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/41

The Mar˘ cenko–Pastur law

0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 1 1.2 x Density fc(x) c = 0.1

Figure: Mar˘ cenko-Pastur law for different limit ratios c = limp→∞ p/n.

8 / 41

slide-21
SLIDE 21

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/41

The Mar˘ cenko–Pastur law

0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 1 1.2 x Density fc(x) c = 0.1 c = 0.2

Figure: Mar˘ cenko-Pastur law for different limit ratios c = limp→∞ p/n.

8 / 41

slide-22
SLIDE 22

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/41

The Mar˘ cenko–Pastur law

0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 1 1.2 x Density fc(x) c = 0.1 c = 0.2 c = 0.5

Figure: Mar˘ cenko-Pastur law for different limit ratios c = limp→∞ p/n.

8 / 41

slide-23
SLIDE 23

Basics of Random Matrix Theory/Spiked Models 9/41

Outline

Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Application to Machine Learning

9 / 41

slide-24
SLIDE 24

Basics of Random Matrix Theory/Spiked Models 10/41

Spiked Models

Small rank perturbation: Cp = Ip + P, P of low rank.

1 2 3 4 5 6 7 8 0.2 0.4 0.6 0.8 1 p/n = 1/4 (p = 500)

Figure: Eigenvalues of

1 n YpY T p , eig(Cp) = {1, . . . , 1 p−4

, 2, 3, 4, 5}.

10 / 41

slide-25
SLIDE 25

Basics of Random Matrix Theory/Spiked Models 10/41

Spiked Models

Small rank perturbation: Cp = Ip + P, P of low rank.

1 2 3 4 5 6 7 8 0.2 0.4 0.6 0.8 1 p/n = 1/2 (p = 500)

Figure: Eigenvalues of

1 n YpY T p , eig(Cp) = {1, . . . , 1 p−4

, 2, 3, 4, 5}.

10 / 41

slide-26
SLIDE 26

Basics of Random Matrix Theory/Spiked Models 10/41

Spiked Models

Small rank perturbation: Cp = Ip + P, P of low rank.

1 2 3 4 5 6 7 8 0.2 0.4 0.6 0.8 1 p/n = 1 (p = 500)

Figure: Eigenvalues of

1 n YpY T p , eig(Cp) = {1, . . . , 1 p−4

, 2, 3, 4, 5}.

10 / 41

slide-27
SLIDE 27

Basics of Random Matrix Theory/Spiked Models 10/41

Spiked Models

Small rank perturbation: Cp = Ip + P, P of low rank.

1 2 3 4 5 6 7 8 0.2 0.4 0.6 0.8 1 p/n = 2 (p = 500)

Figure: Eigenvalues of

1 n YpY T p , eig(Cp) = {1, . . . , 1 p−4

, 2, 3, 4, 5}.

10 / 41

slide-28
SLIDE 28

Basics of Random Matrix Theory/Spiked Models 11/41

Spiked Models

Theorem (Eigenvalues [Baik,Silverstein’06])

Let Yp = C

1 2

p Xp, with

◮ Xp with i.i.d. zero mean, unit variance, E[|Xp|4

ij] < ∞.

◮ Cp = Ip + P, P = UΩU∗, where, for K fixed, Ω = diag (ω1, . . . , ωK) ∈ RK×K, with ω1 ≥ . . . ≥ ωK > 0.

11 / 41

slide-29
SLIDE 29

Basics of Random Matrix Theory/Spiked Models 11/41

Spiked Models

Theorem (Eigenvalues [Baik,Silverstein’06])

Let Yp = C

1 2

p Xp, with

◮ Xp with i.i.d. zero mean, unit variance, E[|Xp|4

ij] < ∞.

◮ Cp = Ip + P, P = UΩU∗, where, for K fixed, Ω = diag (ω1, . . . , ωK) ∈ RK×K, with ω1 ≥ . . . ≥ ωK > 0. Then, as p, n → ∞, p/n → c ∈ (0, ∞), denoting λm = λm( 1

nYpY ∗ p ) (λm > λm+1),

λm

a.s.

− →

  • 1 + ωm + c 1+ωm

ωm

> (1 + √c)2 , ωm > √c (1 + √c)2 , ωm ∈ (0, √c].

11 / 41

slide-30
SLIDE 30

Basics of Random Matrix Theory/Spiked Models 12/41

Spiked Models

Theorem (Eigenvectors [Paul’07])

Let Yp = C

1 2

p Xp, with

◮ Xp with i.i.d. zero mean, unit variance, E[|Xp|4

ij] < ∞.

◮ Cp = Ip + P, P = UΩU∗ = K

i=1 ωiuiu∗ i , ω1 > . . . > ωM > 0.

12 / 41

slide-31
SLIDE 31

Basics of Random Matrix Theory/Spiked Models 12/41

Spiked Models

Theorem (Eigenvectors [Paul’07])

Let Yp = C

1 2

p Xp, with

◮ Xp with i.i.d. zero mean, unit variance, E[|Xp|4

ij] < ∞.

◮ Cp = Ip + P, P = UΩU∗ = K

i=1 ωiuiu∗ i , ω1 > . . . > ωM > 0.

Then, as p, n → ∞, p/n → c ∈ (0, ∞), for a, b ∈ Cp deterministic and ˆ ui eigenvector

  • f λi( 1

nYpY ∗ p ),

a∗ˆ uiˆ u∗

i b − 1 − cω−2 i

1 + cω−1

i

a∗uiu∗

i b · 1ωi>√c a.s.

− → 0 In particular, |ˆ u∗

i ui|2 a.s.

− → 1 − cω−2

i

1 + cω−1

i

· 1ωi>√c.

12 / 41

slide-32
SLIDE 32

Basics of Random Matrix Theory/Spiked Models 13/41

Spiked Models

1 2 3 4 0.2 0.4 0.6 0.8 1 Population spike ω1 |ˆ uT

1u1|2 p = 100

Figure: Simulated versus limiting |ˆ uT

1u1|2 for Yp = C 1 2 p Xp, Cp = Ip + ω1u1uT 1, p/n = 1/3,

varying ω1.

13 / 41

slide-33
SLIDE 33

Basics of Random Matrix Theory/Spiked Models 13/41

Spiked Models

1 2 3 4 0.2 0.4 0.6 0.8 1 Population spike ω1 |ˆ uT

1u1|2 p = 100 p = 200

Figure: Simulated versus limiting |ˆ uT

1u1|2 for Yp = C 1 2 p Xp, Cp = Ip + ω1u1uT 1, p/n = 1/3,

varying ω1.

13 / 41

slide-34
SLIDE 34

Basics of Random Matrix Theory/Spiked Models 13/41

Spiked Models

1 2 3 4 0.2 0.4 0.6 0.8 1 Population spike ω1 |ˆ uT

1u1|2 p = 100 p = 200 p = 400

Figure: Simulated versus limiting |ˆ uT

1u1|2 for Yp = C 1 2 p Xp, Cp = Ip + ω1u1uT 1, p/n = 1/3,

varying ω1.

13 / 41

slide-35
SLIDE 35

Basics of Random Matrix Theory/Spiked Models 13/41

Spiked Models

1 2 3 4 0.2 0.4 0.6 0.8 1 Population spike ω1 |ˆ uT

1u1|2 p = 100 p = 200 p = 400 1−c/ω2 1 1+c/ω1

Figure: Simulated versus limiting |ˆ uT

1u1|2 for Yp = C 1 2 p Xp, Cp = Ip + ω1u1uT 1, p/n = 1/3,

varying ω1.

13 / 41

slide-36
SLIDE 36

Basics of Random Matrix Theory/Spiked Models 14/41

Other Spiked Models

Similar results for multiple matrix models: ◮ Yp = 1

n(I + P)

1 2 XpX∗

p(I + P)

1 2

◮ Yp = 1

nXpX∗ p + P

◮ Yp = 1

nX∗ p(I + P)X

◮ Yp = 1

n(Xp + P)∗(Xp + P)

◮ etc.

14 / 41

slide-37
SLIDE 37

Application to Machine Learning/ 15/41

Outline

Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Application to Machine Learning

15 / 41

slide-38
SLIDE 38

Application to Machine Learning/ 16/41

An adventurous venue

Machine Learning is not “Simple Linear Statistics”:

16 / 41

slide-39
SLIDE 39

Application to Machine Learning/ 16/41

An adventurous venue

Machine Learning is not “Simple Linear Statistics”: ◮ data are data... and are not easily modeled

16 / 41

slide-40
SLIDE 40

Application to Machine Learning/ 16/41

An adventurous venue

Machine Learning is not “Simple Linear Statistics”: ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions, difficult to analyze

16 / 41

slide-41
SLIDE 41

Application to Machine Learning/ 16/41

An adventurous venue

Machine Learning is not “Simple Linear Statistics”: ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions, difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets.

16 / 41

slide-42
SLIDE 42

Application to Machine Learning/ 16/41

An adventurous venue

Machine Learning is not “Simple Linear Statistics”: ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions, difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets. What can we say about those?:

16 / 41

slide-43
SLIDE 43

Application to Machine Learning/ 16/41

An adventurous venue

Machine Learning is not “Simple Linear Statistics”: ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions, difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets. What can we say about those?: ◮ Much more than we think, and actually much more than has been said so far!

16 / 41

slide-44
SLIDE 44

Application to Machine Learning/ 16/41

An adventurous venue

Machine Learning is not “Simple Linear Statistics”: ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions, difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets. What can we say about those?: ◮ Much more than we think, and actually much more than has been said so far! ◮ Key observation 1: In “non-trivial” (not so) large dimensional settings, machine learning intuitions break down!

16 / 41

slide-45
SLIDE 45

Application to Machine Learning/ 16/41

An adventurous venue

Machine Learning is not “Simple Linear Statistics”: ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions, difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets. What can we say about those?: ◮ Much more than we think, and actually much more than has been said so far! ◮ Key observation 1: In “non-trivial” (not so) large dimensional settings, machine learning intuitions break down! ◮ Key observation 2: In these “non-trivial” settings, RMT explains a lot of things and can improve algorithms!

16 / 41

slide-46
SLIDE 46

Application to Machine Learning/ 16/41

An adventurous venue

Machine Learning is not “Simple Linear Statistics”: ◮ data are data... and are not easily modeled ◮ machine learning algorithms involve non-linear functions, difficult to analyze ◮ recent trends go towards highly complex computer-science oriented methods: deep neural nets. What can we say about those?: ◮ Much more than we think, and actually much more than has been said so far! ◮ Key observation 1: In “non-trivial” (not so) large dimensional settings, machine learning intuitions break down! ◮ Key observation 2: In these “non-trivial” settings, RMT explains a lot of things and can improve algorithms! ◮ Key observation 3: Universality goes a long way...: RMT findings are compliant with real data observations!

16 / 41

slide-47
SLIDE 47

Takeaway Message 1 “RMT Explains Why Machine Learning Intuitions Collapse in Large Dimensions”

slide-48
SLIDE 48

Application to Machine Learning/ 18/41

The curse of dimensionality and its consequences

Clustering setting in (not so) large n, p:

18 / 41

slide-49
SLIDE 49

Application to Machine Learning/ 18/41

The curse of dimensionality and its consequences

Clustering setting in (not so) large n, p: ◮ GMM setting: x(a)

1

, . . . , x(a)

na ∼ N(µa, Ca), a = 1, . . . , k

18 / 41

slide-50
SLIDE 50

Application to Machine Learning/ 18/41

The curse of dimensionality and its consequences

Clustering setting in (not so) large n, p: ◮ GMM setting: x(a)

1

, . . . , x(a)

na ∼ N(µa, Ca), a = 1, . . . , k

◮ Non-trivial task: µa − µb = O(1), tr (Ca − Cb) = O(√p), tr [(Ca − Cb)2] = O(p)

18 / 41

slide-51
SLIDE 51

Application to Machine Learning/ 18/41

The curse of dimensionality and its consequences

Clustering setting in (not so) large n, p: ◮ GMM setting: x(a)

1

, . . . , x(a)

na ∼ N(µa, Ca), a = 1, . . . , k

◮ Non-trivial task: µa − µb = O(1), tr (Ca − Cb) = O(√p), tr [(Ca − Cb)2] = O(p) (non-trivial because otherwise too easy or too hard)

18 / 41

slide-52
SLIDE 52

Application to Machine Learning/ 18/41

The curse of dimensionality and its consequences

Clustering setting in (not so) large n, p: ◮ GMM setting: x(a)

1

, . . . , x(a)

na ∼ N(µa, Ca), a = 1, . . . , k

◮ Non-trivial task: µa − µb = O(1), tr (Ca − Cb) = O(√p), tr [(Ca − Cb)2] = O(p) (non-trivial because otherwise too easy or too hard) Classical method: spectral clustering

18 / 41

slide-53
SLIDE 53

Application to Machine Learning/ 18/41

The curse of dimensionality and its consequences

Clustering setting in (not so) large n, p: ◮ GMM setting: x(a)

1

, . . . , x(a)

na ∼ N(µa, Ca), a = 1, . . . , k

◮ Non-trivial task: µa − µb = O(1), tr (Ca − Cb) = O(√p), tr [(Ca − Cb)2] = O(p) (non-trivial because otherwise too easy or too hard) Classical method: spectral clustering ◮ Extract and cluster the dominant eigenvectors of K = {κ(xi, xj)}n

i,j=1

18 / 41

slide-54
SLIDE 54

Application to Machine Learning/ 18/41

The curse of dimensionality and its consequences

Clustering setting in (not so) large n, p: ◮ GMM setting: x(a)

1

, . . . , x(a)

na ∼ N(µa, Ca), a = 1, . . . , k

◮ Non-trivial task: µa − µb = O(1), tr (Ca − Cb) = O(√p), tr [(Ca − Cb)2] = O(p) (non-trivial because otherwise too easy or too hard) Classical method: spectral clustering ◮ Extract and cluster the dominant eigenvectors of K = {κ(xi, xj)}n

i,j=1 ,

κ(xi, xj) = f

1

pxi − xj2 .

18 / 41

slide-55
SLIDE 55

Application to Machine Learning/ 18/41

The curse of dimensionality and its consequences

Clustering setting in (not so) large n, p: ◮ GMM setting: x(a)

1

, . . . , x(a)

na ∼ N(µa, Ca), a = 1, . . . , k

◮ Non-trivial task: µa − µb = O(1), tr (Ca − Cb) = O(√p), tr [(Ca − Cb)2] = O(p) (non-trivial because otherwise too easy or too hard) Classical method: spectral clustering ◮ Extract and cluster the dominant eigenvectors of K = {κ(xi, xj)}n

i,j=1 ,

κ(xi, xj) = f

1

pxi − xj2 . ◮ Why? Finite-dimensional intuition

18 / 41

slide-56
SLIDE 56

Application to Machine Learning/ 19/41

The curse of dimensionality and its consequences (2)

In reality, here is what happens... Kernel Kij = exp(− 1

2p xi − xj2) and second eigenvector v2

(xi ∼ N(±µ, Ip), µ = (2, 0, . . . , 0)T ∈ Rp).

19 / 41

slide-57
SLIDE 57

Application to Machine Learning/ 19/41

The curse of dimensionality and its consequences (2)

In reality, here is what happens... Kernel Kij = exp(− 1

2p xi − xj2) and second eigenvector v2

(xi ∼ N(±µ, Ip), µ = (2, 0, . . . , 0)T ∈ Rp). Key observation: Under growth rate assumptions, max

1≤i=j≤n

  • 1

p xi − xj2 − τ

  • a.s.

− → 0 , τ = 2 p

k

  • i=1

tr na n Ca.

19 / 41

slide-58
SLIDE 58

Application to Machine Learning/ 19/41

The curse of dimensionality and its consequences (2)

In reality, here is what happens... Kernel Kij = exp(− 1

2p xi − xj2) and second eigenvector v2

(xi ∼ N(±µ, Ip), µ = (2, 0, . . . , 0)T ∈ Rp). Key observation: Under growth rate assumptions, max

1≤i=j≤n

  • 1

p xi − xj2 − τ

  • a.s.

− → 0 , τ = 2 p

k

  • i=1

tr na n Ca. ◮ this suggests K ≃ f(τ)1n1T

n!

19 / 41

slide-59
SLIDE 59

Application to Machine Learning/ 19/41

The curse of dimensionality and its consequences (2)

In reality, here is what happens... Kernel Kij = exp(− 1

2p xi − xj2) and second eigenvector v2

(xi ∼ N(±µ, Ip), µ = (2, 0, . . . , 0)T ∈ Rp). Key observation: Under growth rate assumptions, max

1≤i=j≤n

  • 1

p xi − xj2 − τ

  • a.s.

− → 0 , τ = 2 p

k

  • i=1

tr na n Ca. ◮ this suggests K ≃ f(τ)1n1T

n!

◮ more importantly, in non-trivial settings, data are neither close, nor far!

19 / 41

slide-60
SLIDE 60

Application to Machine Learning/ 20/41

The curse of dimensionality and its consequences (3)

(Major) consequences:

20 / 41

slide-61
SLIDE 61

Application to Machine Learning/ 20/41

The curse of dimensionality and its consequences (3)

(Major) consequences: ◮ Most machine learning intuitions collapse

20 / 41

slide-62
SLIDE 62

Application to Machine Learning/ 20/41

The curse of dimensionality and its consequences (3)

(Major) consequences: ◮ Most machine learning intuitions collapse ◮ But luckily, concentration of distances allows for Taylor expansion, linearization...

20 / 41

slide-63
SLIDE 63

Application to Machine Learning/ 20/41

The curse of dimensionality and its consequences (3)

(Major) consequences: ◮ Most machine learning intuitions collapse ◮ But luckily, concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in!

20 / 41

slide-64
SLIDE 64

Application to Machine Learning/ 20/41

The curse of dimensionality and its consequences (3)

(Major) consequences: ◮ Most machine learning intuitions collapse ◮ But luckily, concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in!

Theorem ([C-Benaych’16] Asymptotic Kernel Behavior)

Under growth rate assumptions, as p, n → ∞,

  • K − ˆ

K

a.s.

− → 0, ˆ K ≃ 1 pZZT + JAJT + ∗

20 / 41

slide-65
SLIDE 65

Application to Machine Learning/ 20/41

The curse of dimensionality and its consequences (3)

(Major) consequences: ◮ Most machine learning intuitions collapse ◮ But luckily, concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in!

Theorem ([C-Benaych’16] Asymptotic Kernel Behavior)

Under growth rate assumptions, as p, n → ∞,

  • K − ˆ

K

a.s.

− → 0, ˆ K ≃ 1 pZZT + JAJT + ∗ with J = [j1, . . . , jk] ∈ Rn×k, ja = (0, 1na, 0)T (the clusters!)

20 / 41

slide-66
SLIDE 66

Application to Machine Learning/ 20/41

The curse of dimensionality and its consequences (3)

(Major) consequences: ◮ Most machine learning intuitions collapse ◮ But luckily, concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in!

Theorem ([C-Benaych’16] Asymptotic Kernel Behavior)

Under growth rate assumptions, as p, n → ∞,

  • K − ˆ

K

a.s.

− → 0, ˆ K ≃ 1 pZZT + JAJT + ∗ with J = [j1, . . . , jk] ∈ Rn×k, ja = (0, 1na, 0)T (the clusters!) and A ∈ Rk×k function of: ◮ f(τ), f′(τ), f′′(τ) ◮ µa − µb, tr (Ca − Cb), tr ((Ca − Cb)2), for a, b ∈ {1, . . . , k}.

20 / 41

slide-67
SLIDE 67

Application to Machine Learning/ 20/41

The curse of dimensionality and its consequences (3)

(Major) consequences: ◮ Most machine learning intuitions collapse ◮ But luckily, concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in!

Theorem ([C-Benaych’16] Asymptotic Kernel Behavior)

Under growth rate assumptions, as p, n → ∞,

  • K − ˆ

K

a.s.

− → 0, ˆ K ≃ 1 pZZT + JAJT + ∗ with J = [j1, . . . , jk] ∈ Rn×k, ja = (0, 1na, 0)T (the clusters!) and A ∈ Rk×k function of: ◮ f(τ), f′(τ), f′′(τ) ◮ µa − µb, tr (Ca − Cb), tr ((Ca − Cb)2), for a, b ∈ {1, . . . , k}. ➫ This is a spiked model! We can study it fully!

20 / 41

slide-68
SLIDE 68

Application to Machine Learning/ 20/41

The curse of dimensionality and its consequences (3)

(Major) consequences: ◮ Most machine learning intuitions collapse ◮ But luckily, concentration of distances allows for Taylor expansion, linearization... ◮ This is where RMT kicks back in!

Theorem ([C-Benaych’16] Asymptotic Kernel Behavior)

Under growth rate assumptions, as p, n → ∞,

  • K − ˆ

K

a.s.

− → 0, ˆ K ≃ 1 pZZT + JAJT + ∗ with J = [j1, . . . , jk] ∈ Rn×k, ja = (0, 1na, 0)T (the clusters!) and A ∈ Rk×k function of: ◮ f(τ), f′(τ), f′′(τ) ◮ µa − µb, tr (Ca − Cb), tr ((Ca − Cb)2), for a, b ∈ {1, . . . , k}. ➫ This is a spiked model! We can study it fully! RMT can explain tools ML engineers use everyday.

20 / 41

slide-69
SLIDE 69

Application to Machine Learning/ 21/41

Theoretical Findings versus MNIST

10 20 30 40 50 5 · 10−2 0.1 0.15 0.2

Eigenvalues of K

Figure: Eigenvalues of K (red) and (equivalent Gaussian model) ˆ K (white), MNIST data, p = 784, n = 192.

21 / 41

slide-70
SLIDE 70

Application to Machine Learning/ 21/41

Theoretical Findings versus MNIST

10 20 30 40 50 5 · 10−2 0.1 0.15 0.2

Eigenvalues of K Eigenvalues of ˆ K as if Gaussian model

Figure: Eigenvalues of K (red) and (equivalent Gaussian model) ˆ K (white), MNIST data, p = 784, n = 192.

21 / 41

slide-71
SLIDE 71

Application to Machine Learning/ 22/41

Theoretical Findings versus MNIST

Figure: Leading four eigenvectors of K for MNIST data (red) and theoretical findings (blue).

22 / 41

slide-72
SLIDE 72

Application to Machine Learning/ 22/41

Theoretical Findings versus MNIST

Figure: Leading four eigenvectors of K for MNIST data (red) and theoretical findings (blue).

22 / 41

slide-73
SLIDE 73

Application to Machine Learning/ 23/41

Theoretical Findings versus MNIST

−.08 −.07 −.06 −0.1 0.1 Eigenvector 2/Eigenvector 1 −0.1 0.1 −0.1 0.1 0.2 Eigenvector 3/Eigenvector 2

Figure: 2D representation of eigenvectors of K, for the MNIST dataset. Theoretical means and 1- and 2-standard deviations in blue. Class 1 in red, Class 2 in black, Class 3 in green.

23 / 41

slide-74
SLIDE 74

Application to Machine Learning/ 23/41

Theoretical Findings versus MNIST

−.08 −.07 −.06 −0.1 0.1 Eigenvector 2/Eigenvector 1 −0.1 0.1 −0.1 0.1 0.2 Eigenvector 3/Eigenvector 2

Figure: 2D representation of eigenvectors of K, for the MNIST dataset. Theoretical means and 1- and 2-standard deviations in blue. Class 1 in red, Class 2 in black, Class 3 in green.

23 / 41

slide-75
SLIDE 75

Takeaway Message 2 “RMT Reassesses and Improves Data Processing”

slide-76
SLIDE 76

Application to Machine Learning/ 25/41

Improving Kernel Spectral Clustering

Thanks to [C-Benaych’16]: Possibility to improve kernels:

25 / 41

slide-77
SLIDE 77

Application to Machine Learning/ 25/41

Improving Kernel Spectral Clustering

Thanks to [C-Benaych’16]: Possibility to improve kernels: ◮ by “focusing kernels” on best discriminative statistics: tune f′(τ), f′′(τ)

25 / 41

slide-78
SLIDE 78

Application to Machine Learning/ 25/41

Improving Kernel Spectral Clustering

Thanks to [C-Benaych’16]: Possibility to improve kernels: ◮ by “focusing kernels” on best discriminative statistics: tune f′(τ), f′′(τ) ◮ by “killing” non discriminative feature directions.

25 / 41

slide-79
SLIDE 79

Application to Machine Learning/ 25/41

Improving Kernel Spectral Clustering

Thanks to [C-Benaych’16]: Possibility to improve kernels: ◮ by “focusing kernels” on best discriminative statistics: tune f′(τ), f′′(τ) ◮ by “killing” non discriminative feature directions. Example: Covariance-based discrimation, kernel f(t) = exp(− 1

2 t) versus

f(t) = (t − τ)2 (think about the surprising kernel shape!)

25 / 41

slide-80
SLIDE 80

Application to Machine Learning/ 25/41

Improving Kernel Spectral Clustering

Thanks to [C-Benaych’16]: Possibility to improve kernels: ◮ by “focusing kernels” on best discriminative statistics: tune f′(τ), f′′(τ) ◮ by “killing” non discriminative feature directions. Example: Covariance-based discrimation, kernel f(t) = exp(− 1

2 t) versus

f(t) = (t − τ)2 (think about the surprising kernel shape!)

25 / 41

slide-81
SLIDE 81

Application to Machine Learning/ 26/41

Another, more striking, example: Semi-supervised Learning

Semi-supervised learning: a great idea that never worked!

26 / 41

slide-82
SLIDE 82

Application to Machine Learning/ 26/41

Another, more striking, example: Semi-supervised Learning

Semi-supervised learning: a great idea that never worked! ◮ Setting: assume now

◮ x(a)

1

, . . . , x(a)

na,[l] already labelled (few),

◮ x(a)

na,[l]+1, . . . , x(a) na unlabelled (a lot). 26 / 41

slide-83
SLIDE 83

Application to Machine Learning/ 26/41

Another, more striking, example: Semi-supervised Learning

Semi-supervised learning: a great idea that never worked! ◮ Setting: assume now

◮ x(a)

1

, . . . , x(a)

na,[l] already labelled (few),

◮ x(a)

na,[l]+1, . . . , x(a) na unlabelled (a lot).

◮ Machine Learning original idea: find “scores” Fia for xi to belong to class a F = argminF ∈Rn×k

k

  • a=1

Kij

  • Fia − Fjb

2 ,

F [l]

ia = δ{xi∈Ca}.

26 / 41

slide-84
SLIDE 84

Application to Machine Learning/ 26/41

Another, more striking, example: Semi-supervised Learning

Semi-supervised learning: a great idea that never worked! ◮ Setting: assume now

◮ x(a)

1

, . . . , x(a)

na,[l] already labelled (few),

◮ x(a)

na,[l]+1, . . . , x(a) na unlabelled (a lot).

◮ Machine Learning original idea: find “scores” Fia for xi to belong to class a F = argminF ∈Rn×k

k

  • a=1

Kij

  • Fia − Fjb

2 ,

F [l]

ia = δ{xi∈Ca}.

◮ Explicit solution: F [u] =

  • In[u] − D−1

[u] K[uu]

−1

D−1

[u] K[ul]F [l]

where D = diag(K1n) (degree matrix) and [ul], [uu], . . . blocks of labeled/unlabeled data.

26 / 41

slide-85
SLIDE 85

Application to Machine Learning/ 27/41

The finite-dimensional intuition: What we expect

27 / 41

slide-86
SLIDE 86

Application to Machine Learning/ 27/41

The finite-dimensional intuition: What we expect

27 / 41

slide-87
SLIDE 87

Application to Machine Learning/ 27/41

The finite-dimensional intuition: What we expect

27 / 41

slide-88
SLIDE 88

Application to Machine Learning/ 28/41

The reality: What we see!

  • Setting. p = 400, n = 1000, xi ∼ N(±µ, Ip). Kernel Kij = exp(− 1

2p xi − xj2).

  • Display. Scores Fik (left) and F ◦

ik = Fik − 1 2 (Fi1 + Fi2) (right).

28 / 41

slide-89
SLIDE 89

Application to Machine Learning/ 28/41

The reality: What we see!

  • Setting. p = 400, n = 1000, xi ∼ N(±µ, Ip). Kernel Kij = exp(− 1

2p xi − xj2).

  • Display. Scores Fik (left) and F ◦

ik = Fik − 1 2 (Fi1 + Fi2) (right).

➫ Score are almost all identical... and do not follow the labelled data!

28 / 41

slide-90
SLIDE 90

Application to Machine Learning/ 29/41

MNIST Data Example

50 100 150 0.8 1 1.2 Index F (u)

·,a [F(u)]·,1 (Zeros)

Figure: Vectors [F (u)]·,a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, nl/n = 1/16, Gaussian kernel.

29 / 41

slide-91
SLIDE 91

Application to Machine Learning/ 29/41

MNIST Data Example

50 100 150 0.8 1 1.2 Index F (u)

·,a [F(u)]·,1 (Zeros) [F(u)]·,2 (Ones)

Figure: Vectors [F (u)]·,a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, nl/n = 1/16, Gaussian kernel.

29 / 41

slide-92
SLIDE 92

Application to Machine Learning/ 29/41

MNIST Data Example

50 100 150 0.8 1 1.2 Index F (u)

·,a [F(u)]·,1 (Zeros) [F(u)]·,2 (Ones) [F(u)]·,3 (Twos)

Figure: Vectors [F (u)]·,a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, nl/n = 1/16, Gaussian kernel.

29 / 41

slide-93
SLIDE 93

Application to Machine Learning/ 30/41

Exploiting RMT to resurrect SSL

Consequences of the finite-dimensional “mismatch”

30 / 41

slide-94
SLIDE 94

Application to Machine Learning/ 30/41

Exploiting RMT to resurrect SSL

Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work

30 / 41

slide-95
SLIDE 95

Application to Machine Learning/ 30/41

Exploiting RMT to resurrect SSL

Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not!

30 / 41

slide-96
SLIDE 96

Application to Machine Learning/ 30/41

Exploiting RMT to resurrect SSL

Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization, it works again...

30 / 41

slide-97
SLIDE 97

Application to Machine Learning/ 30/41

Exploiting RMT to resurrect SSL

Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization, it works again... ◮ BUT it does not use efficiently unlabelled data!

30 / 41

slide-98
SLIDE 98

Application to Machine Learning/ 30/41

Exploiting RMT to resurrect SSL

Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization, it works again... ◮ BUT it does not use efficiently unlabelled data! Chapelle, Sch¨

  • lkopf, Zien, “Semi-Supervised Learning”, Chapter 4, 2009.

Our concern is this: it is frequently the case that we would be better off just discarding the unlabeled data and employing a supervised method, rather than taking a semi-supervised route. Thus we worry about the embarrassing situation where the addition of unlabeled data degrades the performance of a classifier.

30 / 41

slide-99
SLIDE 99

Application to Machine Learning/ 30/41

Exploiting RMT to resurrect SSL

Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization, it works again... ◮ BUT it does not use efficiently unlabelled data! Chapelle, Sch¨

  • lkopf, Zien, “Semi-Supervised Learning”, Chapter 4, 2009.

Our concern is this: it is frequently the case that we would be better off just discarding the unlabeled data and employing a supervised method, rather than taking a semi-supervised route. Thus we worry about the embarrassing situation where the addition of unlabeled data degrades the performance of a classifier. What RMT can do about it

30 / 41

slide-100
SLIDE 100

Application to Machine Learning/ 30/41

Exploiting RMT to resurrect SSL

Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization, it works again... ◮ BUT it does not use efficiently unlabelled data! Chapelle, Sch¨

  • lkopf, Zien, “Semi-Supervised Learning”, Chapter 4, 2009.

Our concern is this: it is frequently the case that we would be better off just discarding the unlabeled data and employing a supervised method, rather than taking a semi-supervised route. Thus we worry about the embarrassing situation where the addition of unlabeled data degrades the performance of a classifier. What RMT can do about it ◮ Asymptotic performance analysis: clear understanding of what we see!

30 / 41

slide-101
SLIDE 101

Application to Machine Learning/ 30/41

Exploiting RMT to resurrect SSL

Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization, it works again... ◮ BUT it does not use efficiently unlabelled data! Chapelle, Sch¨

  • lkopf, Zien, “Semi-Supervised Learning”, Chapter 4, 2009.

Our concern is this: it is frequently the case that we would be better off just discarding the unlabeled data and employing a supervised method, rather than taking a semi-supervised route. Thus we worry about the embarrassing situation where the addition of unlabeled data degrades the performance of a classifier. What RMT can do about it ◮ Asymptotic performance analysis: clear understanding of what we see! ◮ Update the algorithm and provably improve unlabelled data use.

30 / 41

slide-102
SLIDE 102

Application to Machine Learning/ 31/41

Asymptotic Performance Analysis

Theorem ([Mai,C’18] Asymptotic Performance of SSL)

For xi ∈ Cb unlabelled, score vector Fi,· ∈ Rk satisfies: Fi,· − Gb → 0, Gb ∼ N(mb, Σb) with mb ∈ Rk, Σb ∈ Rk×k function of f(τ), f′(τ), f′′(τ), µ1, . . . , µk, C1, . . . , Ck.

31 / 41

slide-103
SLIDE 103

Application to Machine Learning/ 31/41

Asymptotic Performance Analysis

Theorem ([Mai,C’18] Asymptotic Performance of SSL)

For xi ∈ Cb unlabelled, score vector Fi,· ∈ Rk satisfies: Fi,· − Gb → 0, Gb ∼ N(mb, Σb) with mb ∈ Rk, Σb ∈ Rk×k function of f(τ), f′(τ), f′′(τ), µ1, . . . , µk, C1, . . . , Ck. Most importantly: mb, Σb independent of nu (number of unlabelled data).

31 / 41

slide-104
SLIDE 104

Application to Machine Learning/ 31/41

Asymptotic Performance Analysis

Theorem ([Mai,C’18] Asymptotic Performance of SSL)

For xi ∈ Cb unlabelled, score vector Fi,· ∈ Rk satisfies: Fi,· − Gb → 0, Gb ∼ N(mb, Σb) with mb ∈ Rk, Σb ∈ Rk×k function of f(τ), f′(τ), f′′(τ), µ1, . . . , µk, C1, . . . , Ck. Most importantly: mb, Σb independent of nu (number of unlabelled data). Solution: From RMT calculus (but not from ML intuition!), solution is to replace K by ˜ K ≡ PKP, P = In − 1 n1n1T

n.

31 / 41

slide-105
SLIDE 105

Application to Machine Learning/ 31/41

Asymptotic Performance Analysis

Theorem ([Mai,C’18] Asymptotic Performance of SSL)

For xi ∈ Cb unlabelled, score vector Fi,· ∈ Rk satisfies: Fi,· − Gb → 0, Gb ∼ N(mb, Σb) with mb ∈ Rk, Σb ∈ Rk×k function of f(τ), f′(τ), f′′(τ), µ1, . . . , µk, C1, . . . , Ck. Most importantly: mb, Σb independent of nu (number of unlabelled data). Solution: From RMT calculus (but not from ML intuition!), solution is to replace K by ˜ K ≡ PKP, P = In − 1 n1n1T

n.

Performances:

31 / 41

slide-106
SLIDE 106

Application to Machine Learning/ 32/41

Experimental evidence: MNIST

Digits (0,8) (2,7) (6,9) nu = 100 Centered kernel (RMT) 89.5±3.6 89.5±3.4 85.3±5.9 Iterated centered kernel (RMT) 89.5±3.6 89.5±3.4 85.3±5.9 Laplacian 75.5±5.6 74.2±5.8 70.0±5.5 Iterated Laplacian 87.2±4.7 86.0±5.2 81.4±6.8 Manifold 88.0±4.7 88.4±3.9 82.8±6.5 nu = 1000 Centered kernel (RMT) 92.2±0.9 92.5±0.8 92.6±1.6 Iterated centered kernel (RMT) 92.3±0.9 92.5± 0.8 92.9±1.4 Laplacian 65.6±4.1 74.4±4.0 69.5±3.7 Iterated Laplacian 92.2±0.9 92.4±0.9 92.0±1.6 Manifold 91.1±1.7 91.4±1.9 91.4±2.0 Table: Comparison of classification accuracy (%) on MNIST datasets with nl = 10. Computed over 1000 random iterations for nu = 100 and 100 for nu = 1000.

32 / 41

slide-107
SLIDE 107

Application to Machine Learning/ 33/41

Experimental evidence: Traffic signs (HOG features)

Class ID (2,7) (9,10) (11,18) nu = 100 Centered kernel (RMT) 79.0±10.4 77.5±9.2 78.5±7.1 Iterated centered kernel (RMT) 85.3±5.9 89.2±5.6 90.1±6.7 Laplacian 73.8±9.8 77.3±9.5 78.6±7.2 Iterated Laplacian 83.7±7.2 88.0±6.8 87.1±8.8 Manifold 77.6±8.9 81.4±10.4 82.3±10.8 nu = 1000 Centered kernel (RMT) 83.6±2.4 84.6±2.4 88.7±9.4 Iterated centered kernel (RMT) 84.8±3.8 88.0±5.5 96.4±3.0 Laplacian 72.7±4.2 88.9±5.7 95.8±3.2 Iterated Laplacian 83.0±5.5 88.2±6.0 92.7±6.1 Manifold 77.7±5.8 85.0±9.0 90.6±8.1 Table: Comparison of classification accuracy (%) on German Traffic Sign datasets with nl = 10. Computed over 1000 random iterations for nu = 100 and 100 for nu = 1000.

33 / 41

slide-108
SLIDE 108

Takeaway Message 3 “RMT Also Grasps ‘Real Data’ Processing”

slide-109
SLIDE 109

Application to Machine Learning/ 35/41

From i.i.d. to concentrated random vectors

Current Problem. Data models based on vectors of i.i.d. entries (or even only Gaussian).

35 / 41

slide-110
SLIDE 110

Application to Machine Learning/ 35/41

From i.i.d. to concentrated random vectors

Current Problem. Data models based on vectors of i.i.d. entries (or even only Gaussian). Good news. In RMT, exploitation of time and feature dimensions brings universality!, i.e., only first moments matter irrespective of distribution.

35 / 41

slide-111
SLIDE 111

Application to Machine Learning/ 35/41

From i.i.d. to concentrated random vectors

Current Problem. Data models based on vectors of i.i.d. entries (or even only Gaussian). Good news. In RMT, exploitation of time and feature dimensions brings universality!, i.e., only first moments matter irrespective of distribution. The Solution?. Concentrated random vectors go a long way beyond!

35 / 41

slide-112
SLIDE 112

Application to Machine Learning/ 35/41

From i.i.d. to concentrated random vectors

Current Problem. Data models based on vectors of i.i.d. entries (or even only Gaussian). Good news. In RMT, exploitation of time and feature dimensions brings universality!, i.e., only first moments matter irrespective of distribution. The Solution?. Concentrated random vectors go a long way beyond!

Definition (Concentrated Random Vector)

x ∈ Rp is a concentrated random vector if, for all Lipschitz f : Rp → R, there exists mf ∈ R, wuch that P |f(x) − mf| > ε ≤ e−g(ε), g increasing function.

35 / 41

slide-113
SLIDE 113

Application to Machine Learning/ 35/41

From i.i.d. to concentrated random vectors

Current Problem. Data models based on vectors of i.i.d. entries (or even only Gaussian). Good news. In RMT, exploitation of time and feature dimensions brings universality!, i.e., only first moments matter irrespective of distribution. The Solution?. Concentrated random vectors go a long way beyond!

Definition (Concentrated Random Vector)

x ∈ Rp is a concentrated random vector if, for all Lipschitz f : Rp → R, there exists mf ∈ R, wuch that P |f(x) − mf| > ε ≤ e−g(ε), g increasing function.

Theorem ([Louart,C’18] [Seddik,C’19] Kernel Universality)

For xi ∼ L(µa, Ca) concentrated random vector, under the conditions of [C-Benaych’16], K − ˆ K

a.s.

− → 0, K ˆ K ≃ 1 pZZT + JAJT + ∗ with A only dependent on f(τ), f′(τ), f′′(τ), µ1, . . . , µk, C1, . . . , Ck.

35 / 41

slide-114
SLIDE 114

Application to Machine Learning/ 35/41

From i.i.d. to concentrated random vectors

Current Problem. Data models based on vectors of i.i.d. entries (or even only Gaussian). Good news. In RMT, exploitation of time and feature dimensions brings universality!, i.e., only first moments matter irrespective of distribution. The Solution?. Concentrated random vectors go a long way beyond!

Definition (Concentrated Random Vector)

x ∈ Rp is a concentrated random vector if, for all Lipschitz f : Rp → R, there exists mf ∈ R, wuch that P |f(x) − mf| > ε ≤ e−g(ε), g increasing function.

Theorem ([Louart,C’18] [Seddik,C’19] Kernel Universality)

For xi ∼ L(µa, Ca) concentrated random vector, under the conditions of [C-Benaych’16], K − ˆ K

a.s.

− → 0, K ˆ K ≃ 1 pZZT + JAJT + ∗ with A only dependent on f(τ), f′(τ), f′′(τ), µ1, . . . , µk, C1, . . . , Ck. ➫ Same result as [C-Benaych’16]. . . Universality of first two moments!

35 / 41

slide-115
SLIDE 115

Application to Machine Learning/ 36/41

  • Ok. . . so what?

36 / 41

slide-116
SLIDE 116

Application to Machine Learning/ 36/41

  • Ok. . . so what?

Key Finding. Real images are “almost” concentrated random vectors!

36 / 41

slide-117
SLIDE 117

Application to Machine Learning/ 36/41

  • Ok. . . so what?

Key Finding. Real images are “almost” concentrated random vectors! Example: GAN-generated images are concentrated random vectors!

36 / 41

slide-118
SLIDE 118

Application to Machine Learning/ 36/41

  • Ok. . . so what?

Key Finding. Real images are “almost” concentrated random vectors! Example: GAN-generated images are concentrated random vectors!

36 / 41

slide-119
SLIDE 119

Application to Machine Learning/ 37/41

  • Ok. . . so what?

37 / 41

slide-120
SLIDE 120

Application to Machine Learning/ 38/41

  • Ok. . . so what?

38 / 41

slide-121
SLIDE 121

Application to Machine Learning/ 39/41

  • Ok. . . so what? (2)
  • Results. [Seddik,C’19]

39 / 41

slide-122
SLIDE 122

Application to Machine Learning/ 39/41

  • Ok. . . so what? (2)
  • Results. [Seddik,C’19]

39 / 41

slide-123
SLIDE 123

Application to Machine Learning/ 40/41

Conclusion

Reminder of Takeaway messages:

40 / 41

slide-124
SLIDE 124

Application to Machine Learning/ 40/41

Conclusion

Reminder of Takeaway messages: The road ahead: ◮ getting away from GMM models and show universality results (concentration of measure arguments)

40 / 41

slide-125
SLIDE 125

Application to Machine Learning/ 40/41

Conclusion

Reminder of Takeaway messages: The road ahead: ◮ getting away from GMM models and show universality results (concentration of measure arguments) ◮ generalize the approach to problems having non-explicit solutions (such as convex

  • ptim problems)

40 / 41

slide-126
SLIDE 126

Application to Machine Learning/ 40/41

Conclusion

Reminder of Takeaway messages: The road ahead: ◮ getting away from GMM models and show universality results (concentration of measure arguments) ◮ generalize the approach to problems having non-explicit solutions (such as convex

  • ptim problems)

◮ deep learning, recurrent neural nets... are a very different story!

40 / 41

slide-127
SLIDE 127

Application to Machine Learning/ 41/41

The End

Thank you!

[C-Benaych’16] R. Couillet, Benaych-Georges, ”Kernel Spectral Clustering of Large Dimensional Data”, Electronic Journal of Statistics, vol. 10, no. 1, pp. 1393-1454, 2016. [article] [Mai,C’18] X. Mai, R. Couillet, ”A random matrix analysis and improvement of semi-supervised learning for large dimensional data”, (in Press) Journal of Machine Learning Research, 2017. [article] [Louart,C’18] C. Louart, Z. Liao, R. Couillet, ”A Random Matrix Approach to Neural Networks”, The Annals of Applied Probability, vol. 28, no. 2, pp. 1190-1248, 2018. [article] [Seddik,C’19] M. Seddik, M. Tamaazousti, R. Couillet, ”Kernel Random Matrices of Large Concentrated Data: The Example of GAN-Generated Image”, IEEE International Conference

  • n Acoustics, Speech and Signal Processing (ICASSP’19), Brighton, UK, 2019. [article]
  • H. Tiomoko Ali, R. Couillet, ”Improved spectral community detection in large heterogeneous

networks”, Journal of Machine Learning Research, vol. 18, no. 225, pp. 1-49, 2018. [article]

  • R. Couillet, M. Tiomoko, S. Zozor, E. Moisan, ”Random matrix-improved estimation of

covariance matrix distances”, (submitted to) Journal of Multivariate Analysis, 2018. [preprint]

  • Z. Liao, R. Couillet, ”A Large Dimensional Analysis of Least Squares Support Vector

Machines”, IEEE Transactions on Signal Processing, vol. 67, no.4, pp. 1065-1074, 2018. [article]

41 / 41