Foundations of Machine Learning Learning with Infinite Hypothesis - PowerPoint PPT Presentation

Foundations of Machine Learning Learning with Infinite Hypothesis Sets

Motivation With an infinite hypothesis set , the error bounds H of the previous lecture are not informative. Is efficient learning from a finite sample possible when is infinite? H Our example of axis-aligned rectangles shows that it is possible. Can we reduce the infinite case to a finite set? Project over finite samples? Are there useful measures of complexity for infinite hypothesis sets? page 2

This lecture Rademacher complexity Growth Function VC dimension Lower bound page 3

Empirical Rademacher Complexity Definition: • family of functions mapping from set to . [ a, b ] G Z • sample . S =( z 1 , . . . , z m ) • (Rademacher variables): independent uniform σ i s random variables taking values in . { − 1 , +1 }   σ 1 �  g ( z 1 ) ��  � m X 1 1 b . R S ( G ) = E sup = E sup σ i g ( z i ) . . . · . m m . . σ σ g ∈ G g ∈ G g ( z m ) σ m i =1 correlation with random noise page 4

Rademacher Complexity Definitions: let be a family of functions mapping G from to . [ a, b ] Z • Empirical Rademacher complexity of : G � � m � 1 � R S ( G ) = E sup σ i g ( z i ) , m σ g ∈ G i =1 where are independent uniform random variables σ i s taking values in and . S =( z 1 , . . . , z m ) { − 1 , +1 } • Rademacher complexity of : G S ∼ D m [ � R m ( G ) = E R S ( G )] . page 5

Rademacher Complexity Bound (Koltchinskii and Panchenko, 2002) Theorem: Let be a family of functions mapping G from to . Then, for any , with probability [0 , 1] δ > 0 Z at least , the following holds for all : g ∈ G 1 − δ ⇥ m log 1 E[ g ( z )] ≤ 1 � δ g ( z i ) + 2 R m ( G ) + 2 m . m i =1 ⇤ m � log 2 E[ g ( z )] ≤ 1 g ( z i ) + 2 ⇥ δ R S ( G ) + 3 2 m . m i =1 Proof: Apply McDiarmid’s inequality to E[ g ] − � Φ ( S ) = sup E S [ g ] . g ∈ G page 6

• Changing one point of changes by at most 1 Φ ( S ) S m . { E[ g ] − b { E[ g ] − b Φ ( S 0 ) − Φ ( S ) = sup E S 0 [ g ] } − sup E S [ g ] } g 2 G g 2 G {{ E[ g ] − b E S 0 [ g ] } − { E[ g ] − b ≤ sup E S [ g ] }} g 2 G { b E S [ g ] − b 1 m )) ≤ 1 m ( g ( z m ) − g ( z 0 = sup E S 0 [ g ] } = sup m . g 2 G g 2 G • Thus, by McDiarmid’s inequality, with probability at least 1 − δ 2 � log 2 Φ ( S ) ≤ E S [ Φ ( S )] + 2 m . δ • We are left with bounding the expectation. page 7

• Series of observations: ⇥ ⇤ E[ g ] − b E S [ Φ ( S )] = E sup E S ( g ) S g 2 G ⇥ ⇤ S 0 [ b E S 0 ( g ) − b = E sup E E S ( g )] S g 2 G ⇥ ⇤ E S 0 ( g ) − b b ( sub-add. of sup) ≤ E sup E S ( g ) S,S 0 g 2 G m X ⇥ ⇤ 1 ( g ( z 0 = E sup i ) − g ( z i )) m S,S 0 g 2 G i =1 m X ⇥ ⇤ 1 ( swap z i and z 0 σ i ( g ( z 0 i ) = E sup i ) − g ( z i )) m σ ,S,S 0 g 2 G i =1 m m X X ⇥ ⇤ ⇥ ⇤ 1 1 σ i g ( z 0 ( sub-additiv. of sup) ≤ E sup i ) + E sup − σ i g ( z i ) m m σ ,S 0 σ ,S g 2 G g 2 G i =1 i =1 m X ⇥ ⇤ 1 = 2 E sup σ i g ( z i ) = 2 R m ( G ) . m σ ,S g 2 G i =1 page 8

• Now, changing one point of makes vary by � R S ( G ) S at most . Thus, again by McDiarmid’s inequality, 1 m with probability at least , 1 − δ 2 ⇥ log 2 R m ( G ) ≤ � δ R S ( G ) + 2 m . • Thus, by the union bound, with probability at least , 1 − δ ⇥ log 2 Φ ( S ) ≤ 2 � δ R S ( G ) + 3 2 m . page 9

Loss Functions - Hypothesis Set Proposition: Let be a family of functions taking H values in , the family of zero-one loss { − 1 , +1 } G functions of : Then, � � H G = ( x, y ) �� 1 h ( x ) � = y : h � H . R m ( G ) = 1 2 R m ( H ) . m Proof: 1 � � � R m ( G ) = E sup σ i 1 h ( x i ) � = y i m S, σ h � H i =1 m 1 1 � � � = E sup 2(1 − y i h ( x i )) σ i m S, σ h � H i =1 m = 1 1 � � � 2 E sup − σ i y i h ( x i ) m S, σ h � H i =1 m = 1 1 � � � 2 E sup σ i h ( x i ) . m S, σ h � H i =1 page 10

Generalization Bounds - Rademacher Corollary: Let be a family of functions taking H values in . Then, for any , with { − 1 , +1 } δ > 0 probability at least , for any , 1 − δ h ∈ H ⇥ log 1 R ( h ) ≤ � δ R ( h ) + R m ( H ) + 2 m . ⇥ log 2 R ( h ) ≤ � R ( h ) + � δ R S ( H ) + 3 2 m . page 11

Remarks First bound distribution-dependent, second data- dependent bound, which makes them attractive. But, how do we compute the empirical Rademacher complexity? Computing requires � m 1 E σ [sup h ∈ H i =1 σ i h ( x i )] m solving ERM problems, typically computationally hard. Relation with combinatorial measures easier to compute? page 12

Growth Function Definition: the growth function for a Π H : N → N hypothesis set is defined by H ⇧ ⌅⇧ ⇤� ⇥ ∀ m ∈ N , Π H ( m ) = max h ( x 1 ) , . . . , h ( x m ) : h ∈ H ⇧ . ⇧ ⇧ ⇧ { x 1 ,...,x m } ⊆ X Thus, is the maximum number of ways Π H ( m ) m points can be classified using . H page 14

Massart’s Lemma (Massart, 2000) Theorem: Let be a finite set, with , x ∈ A � x � 2 A ⊆ R m R =max then, the following holds: � 1 m ⌅ ⇥ 2 log | A | ≤ R ⇤ σ i x i . E m sup m σ x ∈ A i =1 � � m �� m �� Proof: � � exp t E sup ≤ E exp t sup ( Jensen’s ineq. ) σ i x i σ i x i σ x ∈ A σ x ∈ A i =1 i =1 � � m �� = E sup exp t σ i x i σ x ∈ A i =1 � � m �� m � � � � E exp = E σ (exp [ t σ i x i ]) t σ i x i ≤ σ x ∈ A i =1 x ∈ A i =1 �� m i =1 t 2 (2 | x i | ) 2 � �� t 2 R 2 � 2 . ( Hoeffding’s ineq. ) ≤ exp ≤ | A | e 8 x ∈ A page 15

• Taking the log yields: m + tR 2 � ⇥ ≤ log | A | ⇤ σ i x i 2 . E sup t σ x ∈ A i =1 • Minimizing the bound by choosing √ 2 log | A | t = R gives m � ⇥ ⇤ ⌅ 2 log | A | . σ i x i ≤ R E sup σ x ∈ A i =1 page 16

Growth Function Bound on Rad. Complexity Corollary: Let be a family of functions taking G values in , then the following holds: { − 1 , +1 } � 2 log Π G ( m ) R m ( G ) ≤ . m Proof: � � σ 1 � � g ( z 1 ) �� 1 . . . � . R S ( G ) = E sup · . . m σ g ∈ G g ( z m ) σ m � √ m 2 log |{ ( g ( z 1 ) , . . . , g ( z m )): g ∈ G }| ( Massart’s Lemma ) ≤ m � � √ m 2 log Π G ( m ) 2 log Π G ( m ) = . ≤ m m page 17

Generalization Bound - Growth Function Corollary: Let be a family of functions taking H values in . Then, for any , with { − 1 , +1 } δ > 0 probability at least , for any , 1 − δ h ∈ H ⇤ ⇥ log 1 2 log Π H ( m ) R ( h ) ≤ � δ R ( h ) + + 2 m . m But, how do we compute the growth function? Relationship with the VC-dimension (Vapnik- Chervonenkis dimension). page 18

VC Dimension (Vapnik & Chervonenkis, 1968-1971; Vapnik, 1982, 1995, 1998) Definition: the VC-dimension of a hypothesis set H is defined by VCdim( H ) = max { m : Π H ( m ) = 2 m } . Thus, the VC-dimension is the size of the largest set that can be fully shattered by . H Purely combinatorial notion. page 20

Examples In the following, we determine the VC dimension for several hypothesis sets. To give a lower bound for , it suffices VCdim( H ) d to show that a set of cardinality can be S d shattered by . H To give an upper bound, we need to prove that no set of cardinality can be shattered by , S d +1 H which is typically more difficult. page 21

Intervals of The Real Line Observations: • Any set of two points can be shattered by four intervals - - + - - + + + • No set of three points can be shattered since the following dichotomy “+ - +” is not realizable (by definition of intervals): + - + • Thus, . VCdim( intervals in R )=2 page 22

Hyperplanes Observations: • Any three non-collinear points can be shattered: - + + • Unrealizable dichotomies for four points: + + - - - + + + • Thus, . VCdim( hyperplanes in R d )= d +1 page 23

Axis-Aligned Rectangles in the Plane Observations: • The following four points can be shattered: + + + + - - + - - - - + + + - - • No set of five points can be shattered: label negatively the point that is not near the sides. + + + - + • Thus, . VCdim( axis-aligned rectangles )=4 page 24

Foundations of Machine Learning Learning with Infinite Hypothesis - PowerPoint PPT Presentation

Foundations of Machine Learning Learning with Infinite Hypothesis Sets Motivation With an infinite hypothesis set , the error bounds H of the previous lecture are not informative. Is efficient learning from a finite sample possible when

recap to this point foundations foundations foundations foundations genetics =

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Foundations of Tidy Machine Learning Dmitriy (Dima) Gorenshteyn Lead Data Scientist, Memorial

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

The Reverse Cuthill-McKee Algorithm in Distributed-Memory Ariful Azad Lawrence Berkeley Na.onal

June 2020 Employment Report Doug Walls, Labor Market Information Director Types of Employment

Extending the Reach and Scope of Hosted CEs OSG All Hands Meeting March 20, 2018 Suchandra

Uniqueness for a class of linear quadratic mean field games with common noise Foguen Tchuendom

Basics and Random Graphs Social and Technological Networks Rik Sarkar University of Edinburgh,

Academic Educa>on of So@ware Engineering Prac>ces Towards

Coordinated Entry System WEDNESDAY, JULY 27, 2016 CES Progress Report CES Module & Working

11/8/2018 Heapy Engineering is a Registered Provider with The American Institute of Architects

Foundations of Machine Learning Learning with Infinite Hypothesis - PowerPoint PPT Presentation

Foundations of Machine Learning Learning with Infinite Hypothesis Sets Motivation With an infinite hypothesis set , the error bounds H of the previous lecture are not informative. Is efficient learning from a finite sample possible when

recap to this point foundations foundations foundations foundations genetics =

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Foundations of Tidy Machine Learning Dmitriy (Dima) Gorenshteyn Lead Data Scientist, Memorial

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

The Reverse Cuthill-McKee Algorithm in Distributed-Memory Ariful Azad Lawrence Berkeley Na.onal

June 2020 Employment Report Doug Walls, Labor Market Information Director Types of Employment

Extending the Reach and Scope of Hosted CEs OSG All Hands Meeting March 20, 2018 Suchandra

Uniqueness for a class of linear quadratic mean field games with common noise Foguen Tchuendom

Basics and Random Graphs Social and Technological Networks Rik Sarkar University of Edinburgh,

Academic Educa&gt;on of So@ware Engineering Prac&gt;ces Towards

Coordinated Entry System WEDNESDAY, JULY 27, 2016 CES Progress Report CES Module &amp; Working

11/8/2018 Heapy Engineering is a Registered Provider with The American Institute of Architects

Academic Educa>on of So@ware Engineering Prac>ces Towards

Coordinated Entry System WEDNESDAY, JULY 27, 2016 CES Progress Report CES Module & Working