machine learning theory
play

Machine learning theory Regression Hamid Beigy Sharif university - PowerPoint PPT Presentation

Machine learning theory Regression Hamid Beigy Sharif university of technology June 1, 2020 Table of contents 1. Introduction 2. Generalization bounds 3. Pseudo-dimension bounds 4. Regression algorithms 5. Summary 1/35 Introduction The


  1. Machine learning theory Regression Hamid Beigy Sharif university of technology June 1, 2020

  2. Table of contents 1. Introduction 2. Generalization bounds 3. Pseudo-dimension bounds 4. Regression algorithms 5. Summary 1/35

  3. Introduction

  4. The problem of regression ◮ Let X denote the input space and Y a measurable subset of R and D be a distribution over X × Y . ◮ Learner receives sample S = { ( x 1 , y m ) , . . . , ( x m , y m ) } ∈ ( X × Y ) m drawn i.i.d. according to D . ◮ Let L : X × Y �→ R + be the loss function used to measure the magnitude of error. ◮ The most used loss function is ◮ L 2 defined as L ( y , y ′ ) = | y ′ − y | 2 for all y , y ′ ∈ Y , ◮ or more generally L p defined as L ( y , y ′ ) = | y ′ − y | p for all p ≥ 1 and y , y ′ ∈ Y , ◮ The regression problem is defined as Definition (Regression problem) Given a hypothesis set H = { h : X �→ Y | h ∈ H } , regression problem consists of using labeled sample S to find a hypothesis h ∈ H with small generalization error R ( h ) respect to target f : R ( h ) = ( x , y ) ∼D [ L ( h ( x ) , y )] E The empirical loss or error of h ∈ H is denoted by m R ( h ) = 1 ˆ � L ( h ( x i ) , y i ) m i =1 ◮ If L ( y , y ) ≤ M for all y , y ′ ∈ Y , problem is called bounded regression problem. 2/35

  5. Generalization bounds

  6. Finite hypothesis sets Theorem (Generalization bounds for finite hypothesis sets) Let L ≤ M be a bounded loss function and the hypothesis set H is finite. Then, for any δ > 0 , with probability at least (1 − δ ) , the following inequality holds for all h ∈ H � � log | H | + log 1 � � δ R ( h ) ≤ ˆ R ( h ) + M . 2 m Proof (Generalization bounds for finite hypothesis sets). By Hoeffding’s inequality, since L ∈ [0 , M ], for any h ∈ H , the following holds − 2 m ǫ 2 � � � � R ( h ) − ˆ R ( h ) > ǫ ≤ exp . P M 2 Thus, by the union bound, we can write � � � � � � R ( h ) − ˆ � R ( h ) − ˆ ∃ h ∈ H R ( h ) > ǫ ≤ R ( h ) > ǫ P � P h ∈ H − 2 m ǫ 2 � � ≤ | H | exp . M 2 Setting the right-hand side to be equal to δ , the theorem will proved. 3/35

  7. Rademacher complexity bounds Theorem (Rademacher complexity of µ -Lipschitz loss functions) Let L ≤ M be a bounded loss function such that for any fixed y ′ ∈ Y , L ( y , y ′ ) is µ -Lipschitz for some µ > 0 . Then for any sample S = { ( x 1 , y m ) , . . . , ( x m , y m ) } , the upper bound of the Rademacher complexity of the family G = { ( x , y ) �→ L ( h ( x ) , y ) | h ∈ H } is R ( G ) ≤ µ ˆ ˆ R ( H ) . Proof (Rademacher complexity of µ -Lipschitz loss functions). Since for any fixed y i , L ( y , y ′ ) is µ -Lipschitz for some µ > 0, by Talagrand’s Lemma, we can write � m � R ( G ) = 1 � ˆ σ i L ( h ( x i ) , y i ) m E σ i =1 � m � ≤ 1 � σ i µ h ( x i ) m E σ i =1 = µ ˆ R ( H ) . 4/35

  8. Rademacher complexity bounds Theorem (Rademacher complexity of L p loss functions) Let p ≥ 1 and G = { x �→ | h ( x ) − f ( x ) | p | h ∈ H } and | h ( x ) − f ( x ) | ≤ M for all x ∈ X and h ∈ H . Then for any sample S = { ( x 1 , y m ) , . . . , ( x m , y m ) } , the following inequality holds R ( G ) ≤ pM p − 1 ˆ ˆ R ( H ) . Proof (Rademacher complexity of L p loss functions). Let φ p : x �→ | x | p , then G = { φ p ◦ h | h ∈ H ′ } where H ′ = { x �→ h ( x ) − f ( x ) | h ∈ H ′ } . Since φ p is pM p − 1 -Lipschitz over [ − M , M ], we can apply Talagrand’s Lemma, R ( G ) ≤ pM p − 1 ˆ ˆ R ( H ′ ) . Now, ˆ R ( H ′ ) can be expressed as � m � R ( H ′ ) = 1 ˆ � sup ( σ i h ( x i ) + σ i f ( x i )) m E σ h ∈ H i =1 � m � � � m = 1 + 1 � � = ˆ sup σ i h ( x i ) σ i f ( x i ) R ( H ) . m E m E σ σ h ∈ H i =1 i =1 �� m = � m Since E σ i =1 σ i f ( x i ) � i =1 E σ [ σ i ] f ( x i ) = 0 . 5/35

  9. Rademacher complexity regression bounds Theorem (Rademacher complexity regression bounds) Let 0 ≤ L ≤ M be a bounded loss function such that for any fixed y ′ ∈ Y , L ( y , y ′ ) is µ -Lipschitz for some µ > 0 . Then, � � log 1 � m � ( x , y ) ∼D [ L ( h ( x ) , y )] ≤ 1 δ � L ( h ( x i ) , y i ) + 2 µ R m ( H ) + M E m 2 m i =1 � � log 1 � m � ( x , y ) ∼D [ L ( h ( x ) , y )] ≤ 1 δ � L ( h ( x i ) , y i ) + 2 µ ˆ R ( H ) + 3 M 2 m . E m i =1 Proof (Rademacher complexity of µ -Lipschitz loss functions). Since for any fixed y i , L ( y , y ′ ) is µ -Lipschitz for some µ > 0, by Talagrand’s Lemma, we can write � m � R ( G ) = 1 ˆ � σ i L ( h ( x i ) , y i ) m E σ i =1 � m � ≤ 1 � = µ ˆ σ i µ h ( x i ) R ( H ) . m E σ i =1 Combining this inequality with general Rademacher complexity learning bound completes proof. 6/35

  10. Pseudo-dimension bounds

  11. Shattering ◮ VC dimension is a measure of complexity of a hypothesis set. ◮ We define shattering for families of real-valued functions. ◮ Let G be a family of loss functions associated to some hypothesis set H , where G = { z = ( x , y ) �→ L ( h ( x ) , y ) | h ∈ H } Definition (Shattering) Let G be a family of functions from a set Z to R . A set { z 1 , . . . , z m } ∈ ( X × Y ) is said to be shattered by G if there exists t 1 , . . . , t m ∈ R such that �  �  �  sgn ( g ( z 1 ) − t 1 )  � � �   �  �  �  sgn ( g ( z 2 ) − t 2 )  �    �  �   = 2 m �   � � g ∈ G . �   � � . �  .  � �   �    �  �   �  �  � sgn ( g ( z m ) − t m )   � � � When they exist, the threshold values t 1 , . . . , t m are said to witness the shattering. In other words, S is shattered by G , if there are real numbers t 1 , . . . , t m such that for b ∈ { 0 , 1 } m , there is a function g b ∈ G with sgn ( g b ( x i ) − t i ) = b i for all 1 ≤ i ≤ m . 7/35

  12. Shattering ◮ Thus, { z 1 , . . . , z m } is shattered if for some witnesses t 1 , . . . , t m , the family of functions G is rich enough to contain a function going 1. above a subset A of the set of points J = { ( z i , t i ) | 1 ≤ i ≤ m } and 2. below the others J − A , for any choice of the subset A . t 1 t 2 z 1 z 2 ◮ For any g ∈ G , let B g be the indicator function of the region below or on the graph of g , that is B g ( x , y ) = sgn ( g ( x ) − y ) . ◮ Let B G = { B g | g ∈ G} . 8/35

  13. Pseudo-dimension ◮ The notion of shattering naturally leads to definition of pseudo-dimension. Definition (Pseudo-dimension) Let G be a family of functions from Z to R . Then, the pseudo-dimension of G , denoted by Pdim ( G ), is the size of the largest set shattered by G . If no such maximum exists, then Pdim ( G ) = ∞ . ◮ Pdim ( G ) coincides with VC of the corresponding thresholded functions mapping X to { 0 , 1 } . Pdim ( G ) = VC ( { ( x , t ) �→ I [( g ( x ) − t ) > 0] | g ∈ G} ) L ( h ( x ) , y ) t 1.5 1 L ( h ( x ) ,y ) >t 1.0 Loss 0.5 0.0 -2 -1 0 1 2 z ◮ Thus Pdim ( G ) = d , if there are real numbers t 1 , . . . , t d and 2 d functions g b that achieves all possible below/above combinations w.r.t t i . 9/35

  14. Properties of Pseudo-dimension Theorem (Composition with non-decreasing function) Suppose G is a class of real-valued functions and σ : R �→ R is a non-decreasing function. Let σ ( G ) denote the class { σ ◦ g | g ∈ G} . Then Pdim ( σ ( G )) ≤ Pdim ( G ) . Proof (Pseudo-dimension of hyperplanes). 1. For d ≤ Pdim ( σ ( G )), suppose � � b ∈ { 0 , 1 } d � � σ ◦ g b ⊆ σ ( G ) � shatters a set { x 1 , . . . , x d } ⊆ X witnessed by ( t 1 , . . . , t d ). 2. By suitably relabeling g b , for all { 0 , 1 } d and 1 ≤ i ≤ d , we have sgn ( σ ( g b ( x i )) − t i ) = b i . 3. For all 1 ≤ i ≤ d , take � � � σ ( g b ( x i )) ≥ t i , b ∈ { 0 , 1 } d � y i = min g b ( x i ) � 4. Since σ is non-decreasing, it is straightforward to verify that sgn ( g b ( x i ) − t i ) = b i for all { 0 , 1 } d and 1 ≤ i ≤ d 10/35

  15. Pseudo-dimension of vector spaces ◮ A class G of real-valued functions is a vector space if for all g 1 , g 2 ∈ G and any numbers λ, µ ∈ R , we have λ g 1 + µ g 2 ∈ G . Theorem (Pseudo-dimension of vector spaces) If G is a vector space of real-valued functions, then Pdim ( G ) = dim ( G ) . Proof (Pseudo-dimension of vector spaces). 1. Let B G be the class of below th graph indicator functions, we have Pdim ( G ) = VC ( B G ). 2. But B G = { ( x , y ) �→ sgn ( g ( x ) − y ) | g ∈ G} . 3. Hence, the functions B G are of the form sgn ( g 1 + g 2 ), where ◮ g 1 = g is a function from vector space ◮ g 2 is the fixed function g 2 ( x , y ) = − y . 4. Then, Theorem (Pseudo-dimension of vector spaces) shows that Pdim ( G ) = dim ( G ). ◮ Functions that map into some bounded range are not vector space. Corollary If G is a subset of a vector space G ′ of real valued functions then Pdim ( G ) ≤ dim ( G ′ ) 11/35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend