Generalisation error in learning with random features and the hidden - PowerPoint PPT Presentation

Generalisation error in learning with random features and the hidden manifold model B. Loureiro F. Gerace M. Mézard L. Zdeborová F. Krzakala (IPhT) (IPhT) (ENS) (IPhT) (ENS) ICML 2020

TRAINING A NEURAL NET EXPECTATIONS error variance |bias| complexity

TRAINING A NEURAL NET EXPECTATIONS REALITY [Geiger et al. ’18] error variance |bias| complexity See also [Geman et al. ’92; Opper ’95; Neyshabur, Tomyoka, Srebro, 2015; Advani-Saxe 2017; Belkin, Hsu, Ma, Soumik, Mandal 2019; Nakkiran et al. 2019]

The usual suspects

The usual suspects architecture

The usual suspects algorithms architecture

The usual suspects algorithms architecture data

The two theory cultures DATA What worst-case What typical-case What it really analysis analysis looks like think it looks like think it looks like

Spoiler Feature space Input space

Spoiler

Worst-case vs. typical-case: A concrete example

Concrete example x μ ∼ 𝒪 (0,I d ) y μ = sign ( x μ ⋅ θ 0 ) D = { x μ , y μ } n Dataset with and labels μ =1 1 . 0 Rademacher Radamacher bound Bayes Logistic sim. cross-val for function class 0 . 8 f θ ( x ) = sign ( x ⋅ θ ) generalisation error 10 0 0 . 6 g 10 − 1 0 . 4 Out-of-the-box 10 − 2 Logistic regression 0 . 2 (Sklearn) 10 − 1 10 0 10 1 10 2 0 . 0 2 4 6 8 10 α # datapoints / # dimensions [Abbara, Aubin, Krzakala, Zdeborová ’19]

Can we do better?

Hidden Manifold Model [Goldt, Mézard, Krzakala, Zdeborová ‘19] Idea: dataset where both data points and labels only depend on a subset of latent variables. x μ = σ ( d ) F ⊤ c μ Feature space Input space

Aim: study classification and regression tasks on this dataset

The task Learn the labels using a linear model with empirical risk minimisation where: loss function ridge penalty examples: • Ridge regression:   • Logistic regression:  

Two alternative points of view [Williams ‘98,‘07; Retch, Raimi ‘07; Montanari, Mei 19’] Dataset D = { c μ , y μ } n μ =1

Two alternative points of view [Williams ‘98,‘07; Retch, Raimi ‘07; Montanari, Mei 19’] Dataset D = { c μ , y μ } n μ =1 Feature map Φ F ( c ) = σ (F ⊤ c ) Φ F ( c ) Φ F ( c ′ � ) → p →∞ K ( c , c ′ � ) Mercer’s theorem

Main result: Asymptotic generalisation error for arbitrary loss and projection F ℓ

̂ ̂ ⃗ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂ ⃗ Definitions: Consider the unique fixed point of the following system of equations 1 𝔽 ξ , y [ 𝒶 ( y , ω 0 ) ] , ∂ ω η ( y , ω 1 ) V s ( 1 − z g μ ( − z ) ) , V s = 1 V s = α γ κ 2 V m 2 s + ̂ q s [ 1 − 2 zg μ ( − z ) + z 2 g ′ � μ ( − z ) ] 2 q s = ( η ( y , ω 1 ) − ω 1 ) 𝒶 ( y , ω 0 ) V s q s = α γ κ 2 1 𝔽 ξ , y , V 2 q w V s [ − zg μ ( − z ) + z 2 g ′ � μ ( − z ) ] , [ + ℓ ( y , x ) ] − ( x − ω ) 2 V w ) ̂ η ( y , ω ) = argmin ( λ + γ κ 1 𝔽 ξ , y [ ∂ ω 𝒶 ( y , ω 0 ) 2 V ] , ( η ( y , ω 1 ) − ω 1 ) x ∈ℝ m s V s ( 1 − z g μ ( − z ) ) , m s = α m s = V 2 V 0 ( x − ω ) 2 δ ( y − f 0 ( x ) ) 1 2 π V 0 e − dx 𝒶 ( y , ω ) = ∫ ⋆ 𝔽 ξ , y [ 𝒶 ( y , ω 0 ) V w [ γ − 1 + zg μ ( − z ) ] , γ ] , 1 ∂ ω η ( y , ω 1 ) V w = V w = ακ 2 λ + V V w ) 2 [ μ ( − z ) ] , q w 1 γ − 1 + z 2 g ′ � q w = γ 2 ( η ( y , ω 1 ) − ω 1 ) ( λ + 𝒶 ( y , ω 0 ) q w = ακ 2 ⋆ 𝔽 ξ , y , m 2 s + ̂ q s V s [ − zg μ ( − z ) + z 2 g ′ � μ ( − z ) ] , V 2 + V w ) ̂ ( λ + ⋆ V w , V 0 = ρ − M 2 where V = κ 2 1 V s + κ 2 Q , Q = κ 2 1 q s + κ 2 Q ξ and g μ is the Stieltjes transform of FF T ⋆ q w , M = κ 1 m s , ω 0 = M / Q ξ , ω 1 = z μ ∼ 𝒪 ( κ 0 = 𝔽 [ σ ( z ) ] , κ 1 ≡ 𝔽 [ z σ ( z ) ] , κ ⋆ ≡ 𝔽 [ σ ( z ) 2 ] − κ 2 0 − κ 2 0 , I p ) 1 , and In the high-dimensional limit: ϵ gen = 𝔽 λ , ν [ ( f 0 ( ν ) − ̂ ℒ training = λ f ( λ )) 2 ] w + 𝔽 ξ , y [ 𝒶 ( y , ω ⋆ 1 ) ) ] 0 ) ℓ ( y , η ( y , ω ⋆ 2 α q ⋆ 0 ) , ( with ( ν , λ ) ∼ 𝒪 ( Q ⋆ ) M ⋆ ρ 0 with ω ⋆ 0 = M ⋆ / Q ⋆ ξ , ω ⋆ Q ⋆ ξ M ⋆ 1 = Agrees with [Mei-Montanari ’19] who solved a particular case using random matrix theory: ℓ ( x , y ) = ∥ x − y ∥ 2 linear function f 0 , & Gaussian random weights F 2

Technical note: replicated Gaussian Equivalence An important step in the derivation of this result is the observation that the { x μ , y μ } n generalisation and training properties of the dataset are μ =1 x μ , y μ } μ statistically equivalent to the following dataset with the same {˜ μ =1 labels but : 1 x μ = κ 1 F ⊤ c μ + κ ⋆ z μ z μ ∼ 𝒪 (0,I p ) ˜ d where the coefficients are chosen to match κ 1 , κ ⋆ κ 1 = 𝔽 ξ [ ξ σ ( ξ ) ] ⋆ = 𝔽 ξ [ σ ( ξ ) 2 ] − κ 2 κ 2 ξ ∼ 𝒪 (0,1) 1 Generalisation of an observation in [Mei, Montanari 19’; Goldt, Mézard, Krzakala, Zdeborová ‘19]

Drawing the consequences of our formula

̂ Learning in the HMM l ( x , y ) = 1 f 0 = Gaussian F 2 ( x − y ) 2 σ = erf f = sign d / p = 0.1 optimal λ # latent / # input dimensions generalisation error # samples / # input dimensions # samples / # input dimensions Good generalisation performance for small latent space, even for small sample complexities

̂ Classification tasks Gaussian F f 0 = σ = sign f = sign

Random vs. orthogonal projections Ridge regression Logistic regression F i ρ ∼ 𝒪 (0,1/ d ) First layer: random i.i.d. Gaussian Matrix F = U ⊤ DV First layer: subsampled Fourier matrix U, V ∼ Haar

Random vs. orthogonal projections Ridge regression Logistic regression F i ρ ∼ 𝒪 (0,1/ d ) First layer: random i.i.d. Gaussian Matrix F = U ⊤ DV First layer: subsampled Fourier matrix [NIPS, ’17] U, V ∼ Haar

̂ Separability transition in logistic regression l ( x , y ) = log ( 1 + e − xy ) f 0 = σ = erf f = sign Cover theory ’65

Separability transition in logistic regression snr p/n [Sur & Candes, ’18]

Next steps Learning F?

Thank you for your attention! Check our paper @ arXiv: 2002.09339 [mat.ST] contact: brloureiro@gmail.com

References in this talk F. Gerace, B. Loureiro, F. Krzakala, M. Mézard, L. Zdeborobá, “ Generalisation error in learning with random features and the hidden manifold model”, arXiv: 2002.09339 M. Geiger, S. Spigler, S. d’Ascoli, L. Sagun, M. Baity-Jesi, G. Biroli and M. Wyart, “ Jamming transition as a paradigm to understand the loss landscape of deep neural networks”, Physical Review E, 100(1):012115 S. Goldt, M. Mézard, F. Krzakala, L. Zdeborobá, “ Modelling the influence of data structure on learning in neural networks: the hidden manifold model”, arXiv: 1909.11500 A. Abbara, B. Aubin, F. Krzakala, L. Zdeborobá, “ Rademacher complexity and spin glasses: A link between the replica and statistical theories of learning”, arXiv: 1912.02729 C. Williams , “ Computing with infinite networks”, NIPS 98’ A. Rahimi, B. Recht, “ Random Features for Large-Scale Kernel Machines”, NIPS 07’ S. Mei, A. Montanari, “ The generalization error of random features regression: Precise asymptotics and double descent curve”, arXiv: 1908.05355 K. Choromanski, M. Rowland, A. Weller, “ The Unreasonable Effectiveness of Structured Random Orthogonal Embeddings”, NIPS 07’ P. Sur and E.J. Candès, “ A modern maximum-likelihood theory for high-dimensional logistic regression”, PNAS 19’

Generalisation error in learning with random features and the hidden - PowerPoint PPT Presentation

Generalisation error in learning with random features and the hidden manifold model B. Loureiro F. Gerace M. Mzard L. Zdeborov F. Krzakala (IPhT) (IPhT) (ENS) (IPhT) (ENS) ICML 2020 TRAINING A NEURAL NET EXPECTATIONS error variance

Core question Romance conjugations Romance conjugations Generalisation Generalisation Elicited

On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the Matrix and the Generalisation

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

Uncertainty in Eddy Sources of Random Error Random Errors: . . . Covariance Measurements:

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

Generalisation Bounds for Neural Networks Pascale Gourdeau University of Oxford 15 November 2018

On a Generalisation of Dillons APN Permutation Lo Perrin Anne Canteaut Sbastien Duval

First attempts to Automatize Generalisation of Electronic Navigational Charts Weronika Socha

Introduction to Machine Learning Evaluation: Test Error Learning goals training error 0.06

Human Error and Human Error Identification Techniques adapted from an IE 545 presentaton by

An Overview of Human Error Drawn f rom J . Reason, Human Error , Cambridge, 1990 Aaron Brown CS

Questions From Chapter 1 Figure 1.1: Testing life cycle Ch 12 Error vocabulary 1

Error Detection Codes Error Detection Two types Nave scheme Error Detection Codes

llvm::Error Rich Error Handling in LLVM Error Handling History LLVMs APIs historically

Introduction to Machine Learning Evaluation: Training Error compstat-lmu.github.io/lecture_i2ml

SCIENCE @ RMPS 2020 Vision An inquirer with a passion for Science. Mission o To develop

Caribou Lake Property Owners Association AGENDA I. Introduction of Board Members, attendees,

Win/Win Heifer Grazing Hayden Dore Veterinarian Vet South 1 Win/Win Heifer Grazing Owner

1 Outline Introduction Present Status of MIR-FEL Present Status of THz-FEL

Wensum Working Group 2 nd March 2020 Agenda Stakeholder and Partnership Update KA Pollution

SCIENCE @ RMPS 2019 Vision An inquirer with a passion for Science. Mission o To develop

CS6200 Information Retrieval Jesse Anderton College of Computer and Information Science

Announcements Homework 3: Games Has been released, due Monday 9/17 at 11:59pm Electronic

Generalisation error in learning with random features and the hidden - PowerPoint PPT Presentation

Generalisation error in learning with random features and the hidden manifold model B. Loureiro F. Gerace M. Mzard L. Zdeborov F. Krzakala (IPhT) (IPhT) (ENS) (IPhT) (ENS) ICML 2020 TRAINING A NEURAL NET EXPECTATIONS error variance

Core question Romance conjugations Romance conjugations Generalisation Generalisation Elicited

On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the Matrix and the Generalisation

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

Uncertainty in Eddy Sources of Random Error Random Errors: . . . Covariance Measurements:

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

ERROR DETECTON &amp; CORRECTION Error Detection EDC= Error Detection and Correction bits

Generalisation Bounds for Neural Networks Pascale Gourdeau University of Oxford 15 November 2018

On a Generalisation of Dillons APN Permutation Lo Perrin Anne Canteaut Sbastien Duval

First attempts to Automatize Generalisation of Electronic Navigational Charts Weronika Socha

Introduction to Machine Learning Evaluation: Test Error Learning goals training error 0.06

Human Error and Human Error Identification Techniques adapted from an IE 545 presentaton by

An Overview of Human Error Drawn f rom J . Reason, Human Error , Cambridge, 1990 Aaron Brown CS

Questions From Chapter 1 Figure 1.1: Testing life cycle Ch 12 Error vocabulary 1

Error Detection Codes Error Detection Two types Nave scheme Error Detection Codes

llvm::Error Rich Error Handling in LLVM Error Handling History LLVMs APIs historically

Introduction to Machine Learning Evaluation: Training Error compstat-lmu.github.io/lecture_i2ml

SCIENCE @ RMPS 2020 Vision An inquirer with a passion for Science. Mission o To develop

Caribou Lake Property Owners Association AGENDA I. Introduction of Board Members, attendees,

Win/Win Heifer Grazing Hayden Dore Veterinarian Vet South 1 Win/Win Heifer Grazing Owner

1 Outline Introduction Present Status of MIR-FEL Present Status of THz-FEL

Wensum Working Group 2 nd March 2020 Agenda Stakeholder and Partnership Update KA Pollution

SCIENCE @ RMPS 2019 Vision An inquirer with a passion for Science. Mission o To develop

CS6200 Information Retrieval Jesse Anderton College of Computer and Information Science

Announcements Homework 3: Games Has been released, due Monday 9/17 at 11:59pm Electronic

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits