Generalisation Bounds for Neural Networks Pascale Gourdeau - PowerPoint PPT Presentation

Generalisation Bounds for Neural Networks Pascale Gourdeau University of Oxford 15 November 2018 Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 1 / 31

Overview Introduction 1 General Strategies to Obtain Generalisation Bounds 2 Survey of Generalisation Bounds for Neural Networks 3 A Compression Approach [Arora et al., 2018] 4 Conclusion, Research Directions 5 Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 2 / 31

What is generalisation? The ability to perform well on unseen data. Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 4 / 31

What is generalisation? The ability to perform well on unseen data. Assumption: the data (both for the training and testing) comes i.i.d. from a distribution D . Usually work in a distribution-agnostic setting. Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 4 / 31

What are generalisation bounds? Classification setting : input space X and output space Y := { 1 , . . . , k } with a distribution D on X × Y . Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 5 / 31

What are generalisation bounds? Classification setting : input space X and output space Y := { 1 , . . . , k } with a distribution D on X × Y . Goal: to learn a function f : X → Y from a sample S := { ( x i , y i ) } m i =1 ⊆ X × Y . Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 5 / 31

What are generalisation bounds? Classification setting : input space X and output space Y := { 1 , . . . , k } with a distribution D on X × Y . Goal: to learn a function f : X → Y from a sample S := { ( x i , y i ) } m i =1 ⊆ X × Y . Generalisation bounds: bounding the difference between the expected and empirical losses of f with high probability over S . Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 5 / 31

What are generalisation bounds? Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 6 / 31

What are generalisation bounds? For neural networks, we use the expected classification loss: � � L 0 ( f ) := P ( x , y ) ∼ D f ( x ) y ≤ max y ′ � = y f ( x ) y ′ , Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 6 / 31

What are generalisation bounds? For neural networks, we use the expected classification loss: � � L 0 ( f ) := P ( x , y ) ∼ D f ( x ) y ≤ max y ′ � = y f ( x ) y ′ , and the empirical margin loss: � � m � L γ ( f ) := 1 � f ( x ) y ≤ γ + max y ′ � = y f ( x ) y ′ . 1 m i =1 Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 6 / 31

Why are generalisation bounds useful? They allow us to quantify a given model’s expected generalisation performance. Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 7 / 31

Why are generalisation bounds useful? They allow us to quantify a given model’s expected generalisation performance. E.g.: With probability 95% over the training sample, the error is at most 1%. Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 7 / 31

Why are generalisation bounds useful? They allow us to quantify a given model’s expected generalisation performance. E.g.: With probability 95% over the training sample, the error is at most 1%. They can also: Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 7 / 31

Why are generalisation bounds useful? They allow us to quantify a given model’s expected generalisation performance. E.g.: With probability 95% over the training sample, the error is at most 1%. They can also: Provide insight on the ability of a model to generalise. Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 7 / 31

Why are generalisation bounds useful? They allow us to quantify a given model’s expected generalisation performance. E.g.: With probability 95% over the training sample, the error is at most 1%. They can also: Provide insight on the ability of a model to generalise. This is of particular interest for us: neural networks have many counter-intuitive properties. Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 7 / 31

Why are generalisation bounds useful? They allow us to quantify a given model’s expected generalisation performance. E.g.: With probability 95% over the training sample, the error is at most 1%. They can also: Provide insight on the ability of a model to generalise. This is of particular interest for us: neural networks have many counter-intuitive properties. Inspire new algorithms or regularisation techniques. Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 7 / 31

General Strategies Generalisation bounds (GB) for neural networks are usually obtained by 1 Defining a class H of functions computed by neural networks with certain properties (e.g., weight matrices with bounded norms, number of layers, etc.), Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 9 / 31

General Strategies Generalisation bounds (GB) for neural networks are usually obtained by 1 Defining a class H of functions computed by neural networks with certain properties (e.g., weight matrices with bounded norms, number of layers, etc.), 2 Deriving a generalisation bound in terms of a complexity measure M ( H ) (e.g. size of H , Rademacher complexity), Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 9 / 31

General Strategies Generalisation bounds (GB) for neural networks are usually obtained by 1 Defining a class H of functions computed by neural networks with certain properties (e.g., weight matrices with bounded norms, number of layers, etc.), 2 Deriving a generalisation bound in terms of a complexity measure M ( H ) (e.g. size of H , Rademacher complexity), 3 Upper bounding M ( H ) in terms of model parameters (e.g., norm of weight matrices, number of layers, etc.). Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 9 / 31

General Strategies: Rademacher Complexity Definition (Rademacher complexity) Let G be a family of functions from a set Z to R . Let σ 1 , . . . , σ m be Rademacher variables: P ( σ i = 1) = P ( σ i = − 1) = 1 / 2. The empirical Rademacher complexity of G w.r.t. to a sample S = { z i } m i =1 is Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 10 / 31

General Strategies: Rademacher Complexity Definition (Rademacher complexity) Let G be a family of functions from a set Z to R . Let σ 1 , . . . , σ m be Rademacher variables: P ( σ i = 1) = P ( σ i = − 1) = 1 / 2. The empirical Rademacher complexity of G w.r.t. to a sample S = { z i } m i =1 is � � m � 1 R S ( G ) = E σ sup σ i g ( z i ) . m g ∈ G i =1 Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 10 / 31

General Strategies: Rademacher Complexity Definition (Rademacher complexity) Let G be a family of functions from a set Z to R . Let σ 1 , . . . , σ m be Rademacher variables: P ( σ i = 1) = P ( σ i = − 1) = 1 / 2. The empirical Rademacher complexity of G w.r.t. to a sample S = { z i } m i =1 is � � m � 1 R S ( G ) = E σ sup σ i g ( z i ) . m g ∈ G i =1 Intuition: How much G correlates with random noise on S . Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 10 / 31

General Strategies: Rademacher Complexity Definition (Rademacher complexity) Let G be a family of functions from a set Z to R . Let σ 1 , . . . , σ m be Rademacher variables: P ( σ i = 1) = P ( σ i = − 1) = 1 / 2. The empirical Rademacher complexity of G w.r.t. to a sample S = { z i } m i =1 is � � m � 1 R S ( G ) = E σ sup σ i g ( z i ) . m g ∈ G i =1 Intuition: How much G correlates with random noise on S . Simple examples... Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 10 / 31

General Strategies: Rademacher Complexity Theorem Let G be a family of functions from Z to [0 , 1] , and let S be a sample of size m drawn from Z according to D. Let L ( g ) = E z ∼ D [ g ( z )] and � m � L ( g ) = 1 i =1 g ( z i ) . Then for any δ > 0 , with probability at least 1 − δ m over S, for all functions g ∈ G, �� log(1 /δ ) L ( g ) ≤ � L ( g ) + 2 R S ( G ) + O . m Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 11 / 31

Generalisation Bounds for Neural Networks Pascale Gourdeau - PowerPoint PPT Presentation

Generalisation Bounds for Neural Networks Pascale Gourdeau University of Oxford 15 November 2018 Pascale Gourdeau (University of Oxford) Generalisation Bounds for NNs 15 November 2018 1 / 31 Overview Introduction 1 General Strategies to

Core question Romance conjugations Romance conjugations Generalisation Generalisation Elicited

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Circuit Lower-bounds Lecture 24 Weak circuits are indeed weak 1 Circuit Lower-bounds 2

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the Matrix and the Generalisation

On a Generalisation of Dillons APN Permutation Lo Perrin Anne Canteaut Sbastien Duval

First attempts to Automatize Generalisation of Electronic Navigational Charts Weronika Socha

Intro to Context-Aware Computing Matthew Lee 05-899 Special Topics in Ubiquitous Computing

Whose Bad Choices? How Policy Precludes Prosperity and What We Can Do About It February 6, 2018

Inference and Representation David Sontag New York University Lecture 10, Nov. 17, 2015

54% of workers say that financial matters 66% of employers agreed that employees are

Attila Szegedi, Software Engineer @asz 1 Everything I ever learned about JVM performance

Measuring inflectional complexity: French and Mauritian Olivier Bonami 1 e 2 Fabiola Henri 3

Guide Cylinder/Compact Type Courtesy of CMA/Flodyne/Hydradyne Motion Control Hydraulic

Assignment # 2 So You Want to Write a Physically Based Motion Which is something you may wish