Generalisation error in learning with random features and the hidden - - PowerPoint PPT Presentation

generalisation error in learning with random
SMART_READER_LITE
LIVE PREVIEW

Generalisation error in learning with random features and the hidden - - PowerPoint PPT Presentation

Generalisation error in learning with random features and the hidden manifold model B. Loureiro F. Gerace M. Mzard L. Zdeborov F. Krzakala (IPhT) (IPhT) (ENS) (IPhT) (ENS) ICML 2020 TRAINING A NEURAL NET EXPECTATIONS error variance


slide-1
SLIDE 1

Generalisation error in learning with random features and the hidden manifold model

ICML 2020

  • B. Loureiro

(IPhT)

  • M. Mézard

(ENS)

  • L. Zdeborová

(IPhT)

  • F. Krzakala

(ENS)

  • F. Gerace

(IPhT)

slide-2
SLIDE 2

EXPECTATIONS

|bias| variance complexity error

TRAINING A NEURAL NET

slide-3
SLIDE 3

TRAINING A NEURAL NET

EXPECTATIONS REALITY

|bias| variance complexity error [Geiger et al. ’18]

See also [Geman et al. ’92; Opper ’95; Neyshabur, Tomyoka, Srebro, 2015; Advani-Saxe 2017; Belkin, Hsu, Ma, Soumik, Mandal 2019; Nakkiran et al. 2019]

slide-4
SLIDE 4

The usual suspects

slide-5
SLIDE 5

The usual suspects

architecture

slide-6
SLIDE 6

The usual suspects

architecture algorithms

slide-7
SLIDE 7

The usual suspects

architecture algorithms data

slide-8
SLIDE 8

DATA

What worst-case analysis think it looks like What typical-case analysis think it looks like What it really looks like

The two theory cultures

slide-9
SLIDE 9

Spoiler

Feature space Input space

slide-10
SLIDE 10

Spoiler

Feature space Input space

slide-11
SLIDE 11

Spoiler

Feature space Input space

slide-12
SLIDE 12

Spoiler

Feature space Input space

slide-13
SLIDE 13

Spoiler

slide-14
SLIDE 14

Worst-case vs. typical-case: A concrete example

slide-15
SLIDE 15

Concrete example

2 4 6 8 10

α

0.0 0.2 0.4 0.6 0.8 1.0

g

Rademacher Bayes Logistic sim. cross-val

10−1 100 101 102 10−2 10−1 100

# datapoints / # dimensions generalisation error

[Abbara, Aubin, Krzakala, Zdeborová ’19]

fθ(x) = sign (x ⋅ θ)

Radamacher bound for function class Out-of-the-box Logistic regression (Sklearn) Dataset with and labels

D = {xμ, yμ}n

μ=1

xμ ∼ 𝒪(0,Id) yμ = sign (xμ ⋅ θ0)

slide-16
SLIDE 16

Can we do better?

slide-17
SLIDE 17

Hidden Manifold Model

[Goldt, Mézard, Krzakala, Zdeborová ‘19]

Idea: dataset where both data points and labels only depend

  • n a subset of latent variables.

Feature space Input space

xμ = σ ( F⊤cμ d )

slide-18
SLIDE 18

Aim: study classification and regression tasks on this dataset

slide-19
SLIDE 19

The task

Learn the labels using a linear model with empirical risk minimisation where:

loss function ridge penalty

examples:

  • Ridge regression:

  • Logistic regression:

slide-20
SLIDE 20

Two alternative points of view

[Williams ‘98,‘07; Retch, Raimi ‘07; Montanari, Mei 19’]

Dataset D = {cμ, yμ}n

μ=1

slide-21
SLIDE 21

Two alternative points of view

[Williams ‘98,‘07; Retch, Raimi ‘07; Montanari, Mei 19’]

Dataset D = {cμ, yμ}n

μ=1

Feature map ΦF (c) = σ(F⊤c)

ΦF(c)ΦF(c′) →

p→∞ K(c, c′)

Mercer’s theorem

slide-22
SLIDE 22

Main result: Asymptotic generalisation error for arbitrary loss and projection F

slide-23
SLIDE 23

Consider the unique fixed point of the following system of equations

Definitions: In the high-dimensional limit:

̂ Vs = α

γ κ2 1𝔽ξ,y [𝒶 (y, ω0) ∂ωη(y, ω1) V

], ̂ qs = α

γ κ2 1𝔽ξ,y

𝒶 (y, ω0)

(η(y, ω1) − ω1)

2

V 2

, ̂ ms = α

γ κ1𝔽ξ,y [∂ω𝒶 (y, ω0) (η(y, ω1) − ω1) V

], ̂ Vw = ακ2

⋆𝔽ξ,y [𝒶 (y, ω0) ∂ωη(y, ω1) V

], ̂ qw = ακ2

⋆𝔽ξ,y

𝒶 (y, ω0)

(η(y, ω1) − ω1)

2

V 2

, Vs = 1

̂ Vs (1 − z gμ(−z)),

qs =

̂ m2

s + ̂

qs ̂ Vs

[1 − 2zgμ(−z) + z2g′

μ(−z)]

̂ qw (λ + ̂ Vw) ̂ Vs [−zgμ(−z) + z2g′ μ(−z)],

ms =

̂ ms ̂ Vs (1 − z gμ(−z)),

Vw =

γ λ + ̂ Vw [ 1 γ − 1 + zgμ(−z)],

qw = γ

̂ qw (λ + ̂ Vw)2 [ 1 γ − 1 + z2g′ μ(−z)],

+

̂ m2

s + ̂

qs (λ + ̂ Vw) ̂ Vs [−zgμ(−z) + z2g′ μ(−z)],

η(y, ω) = argmin

x∈ℝ

[

(x − ω)2 2V

+ ℓ(y, x)] 𝒶(y, ω) = ∫

dx 2πV 0 e−

1 2V 0 (x − ω)2δ (y − f 0(x))

where V = κ2

1Vs + κ2 ⋆Vw, V 0 = ρ − M2

Q , Q = κ2

1qs + κ2 ⋆qw, M = κ1ms, ω0 = M/

Qξ, ω1 = Qξ and gμis the Stieltjes transform of FFT

κ0 = 𝔽 [σ(z)], κ1 ≡ 𝔽 [zσ(z)], κ⋆ ≡ 𝔽 [σ(z)2] − κ2

0 − κ2 1, and

⃗ zμ ∼ 𝒪( ⃗ 0 , Ip)

ϵgen = 𝔽λ,ν [( f 0(ν) − ̂ f(λ))2]

with (ν, λ) ∼ 𝒪 ( 0), ( ρ M⋆ M⋆ Q⋆)

ℒtraining = λ 2α q⋆

w + 𝔽ξ,y [𝒶 (y, ω⋆ 0 ) ℓ (y, η(y, ω⋆ 1 ))]

with ω⋆

0 = M⋆/

Q⋆ξ, ω⋆

1 =

Q⋆ξ

Agrees with [Mei-Montanari ’19] who solved a particular case using random matrix theory: linear function f0, & Gaussian random weights F

ℓ(x, y) = ∥x − y∥2

2

slide-24
SLIDE 24

Technical note: replicated Gaussian Equivalence

An important step in the derivation of this result is the observation that the generalisation and training properties of the dataset are statistically equivalent to the following dataset with the same labels but:

{xμ, yμ}n

μ=1

{˜ xμ, yμ}μ

μ=1

˜ xμ = κ1 1 d F⊤cμ + κ⋆zμ zμ ∼ 𝒪(0,Ip) κ1 = 𝔽ξ [ξ σ(ξ)] κ2

⋆ = 𝔽ξ [σ(ξ)2] − κ2 1

where the coefficients are chosen to match

κ1, κ⋆

ξ ∼ 𝒪(0,1)

Generalisation of an observation in

[Mei, Montanari 19’; Goldt, Mézard, Krzakala, Zdeborová ‘19]

slide-25
SLIDE 25

Drawing the consequences

  • f our formula
slide-26
SLIDE 26

Learning in the HMM

# samples / # input dimensions generalisation error # samples / # input dimensions # latent / # input dimensions

Good generalisation performance for small latent space, even for small sample complexities

f 0 = ̂ f = sign

σ = erf

l(x, y) = 1 2 (x − y)2

d/p = 0.1

  • ptimal λ

Gaussian F

slide-27
SLIDE 27

Classification tasks

f 0 = ̂ f = sign

σ = sign

Gaussian F

slide-28
SLIDE 28

Random vs. orthogonal projections

Ridge regression Logistic regression

Fi

ρ ∼ 𝒪(0,1/d)

F = U⊤DV U, V ∼ Haar

First layer: random i.i.d. Gaussian Matrix First layer: subsampled Fourier matrix

slide-29
SLIDE 29

Random vs. orthogonal projections

Ridge regression Logistic regression

Fi

ρ ∼ 𝒪(0,1/d)

F = U⊤DV U, V ∼ Haar

First layer: random i.i.d. Gaussian Matrix First layer: subsampled Fourier matrix [NIPS, ’17]

slide-30
SLIDE 30

Separability transition in logistic regression

Cover theory ’65

l(x, y) = log (1 + e−xy)

σ = erf

f 0 = ̂ f = sign

slide-31
SLIDE 31

Separability transition in logistic regression

p/n snr

[Sur & Candes, ’18]

slide-32
SLIDE 32

Next steps

Learning F?

slide-33
SLIDE 33

Thank you for your attention!

Check our paper @ arXiv: 2002.09339 [mat.ST] contact: brloureiro@gmail.com

slide-34
SLIDE 34

References in this talk

  • F. Gerace, B. Loureiro, F. Krzakala, M. Mézard, L. Zdeborobá, “Generalisation error in learning with

random features and the hidden manifold model”, arXiv: 2002.09339

  • S. Goldt, M. Mézard, F. Krzakala, L. Zdeborobá, “Modelling the influence of data structure on

learning in neural networks: the hidden manifold model”, arXiv: 1909.11500

  • A. Abbara, B. Aubin, F. Krzakala, L. Zdeborobá, “Rademacher complexity and spin glasses: A link

between the replica and statistical theories of learning”, arXiv: 1912.02729

  • A. Rahimi, B. Recht, “Random Features for Large-Scale Kernel Machines”, NIPS 07’
  • S. Mei, A. Montanari, “The generalization error of random features regression: Precise asymptotics

and double descent curve”, arXiv: 1908.05355

  • C. Williams , “Computing with infinite networks”, NIPS 98’
  • K. Choromanski, M. Rowland, A. Weller, “The Unreasonable Effectiveness of Structured Random

Orthogonal Embeddings”, NIPS 07’

  • P. Sur and E.J. Candès, “A modern maximum-likelihood theory for high-dimensional logistic

regression”, PNAS 19’

  • M. Geiger, S. Spigler, S. d’Ascoli, L. Sagun, M. Baity-Jesi, G. Biroli and M. Wyart, “Jamming transition

as a paradigm to understand the loss landscape of deep neural networks”, Physical Review E, 100(1):012115