Machine learning theory Regression Hamid Beigy Sharif university - PowerPoint PPT Presentation

Machine learning theory Regression Hamid Beigy Sharif university of technology June 1, 2020

Table of contents 1. Introduction 2. Generalization bounds 3. Pseudo-dimension bounds 4. Regression algorithms 5. Summary 1/35

Introduction

The problem of regression ◮ Let X denote the input space and Y a measurable subset of R and D be a distribution over X × Y . ◮ Learner receives sample S = { ( x 1 , y m ) , . . . , ( x m , y m ) } ∈ ( X × Y ) m drawn i.i.d. according to D . ◮ Let L : X × Y �→ R + be the loss function used to measure the magnitude of error. ◮ The most used loss function is ◮ L 2 defined as L ( y , y ′ ) = | y ′ − y | 2 for all y , y ′ ∈ Y , ◮ or more generally L p defined as L ( y , y ′ ) = | y ′ − y | p for all p ≥ 1 and y , y ′ ∈ Y , ◮ The regression problem is defined as Definition (Regression problem) Given a hypothesis set H = { h : X �→ Y | h ∈ H } , regression problem consists of using labeled sample S to find a hypothesis h ∈ H with small generalization error R ( h ) respect to target f : R ( h ) = ( x , y ) ∼D [ L ( h ( x ) , y )] E The empirical loss or error of h ∈ H is denoted by m R ( h ) = 1 ˆ � L ( h ( x i ) , y i ) m i =1 ◮ If L ( y , y ) ≤ M for all y , y ′ ∈ Y , problem is called bounded regression problem. 2/35

Generalization bounds

Finite hypothesis sets Theorem (Generalization bounds for finite hypothesis sets) Let L ≤ M be a bounded loss function and the hypothesis set H is finite. Then, for any δ > 0 , with probability at least (1 − δ ) , the following inequality holds for all h ∈ H � � log | H | + log 1 � � δ R ( h ) ≤ ˆ R ( h ) + M . 2 m Proof (Generalization bounds for finite hypothesis sets). By Hoeffding’s inequality, since L ∈ [0 , M ], for any h ∈ H , the following holds − 2 m ǫ 2 � � � � R ( h ) − ˆ R ( h ) > ǫ ≤ exp . P M 2 Thus, by the union bound, we can write � � � � � � R ( h ) − ˆ � R ( h ) − ˆ ∃ h ∈ H R ( h ) > ǫ ≤ R ( h ) > ǫ P � P h ∈ H − 2 m ǫ 2 � � ≤ | H | exp . M 2 Setting the right-hand side to be equal to δ , the theorem will proved. 3/35

Rademacher complexity bounds Theorem (Rademacher complexity of µ -Lipschitz loss functions) Let L ≤ M be a bounded loss function such that for any fixed y ′ ∈ Y , L ( y , y ′ ) is µ -Lipschitz for some µ > 0 . Then for any sample S = { ( x 1 , y m ) , . . . , ( x m , y m ) } , the upper bound of the Rademacher complexity of the family G = { ( x , y ) �→ L ( h ( x ) , y ) | h ∈ H } is R ( G ) ≤ µ ˆ ˆ R ( H ) . Proof (Rademacher complexity of µ -Lipschitz loss functions). Since for any fixed y i , L ( y , y ′ ) is µ -Lipschitz for some µ > 0, by Talagrand’s Lemma, we can write � m � R ( G ) = 1 � ˆ σ i L ( h ( x i ) , y i ) m E σ i =1 � m � ≤ 1 � σ i µ h ( x i ) m E σ i =1 = µ ˆ R ( H ) . 4/35

Rademacher complexity bounds Theorem (Rademacher complexity of L p loss functions) Let p ≥ 1 and G = { x �→ | h ( x ) − f ( x ) | p | h ∈ H } and | h ( x ) − f ( x ) | ≤ M for all x ∈ X and h ∈ H . Then for any sample S = { ( x 1 , y m ) , . . . , ( x m , y m ) } , the following inequality holds R ( G ) ≤ pM p − 1 ˆ ˆ R ( H ) . Proof (Rademacher complexity of L p loss functions). Let φ p : x �→ | x | p , then G = { φ p ◦ h | h ∈ H ′ } where H ′ = { x �→ h ( x ) − f ( x ) | h ∈ H ′ } . Since φ p is pM p − 1 -Lipschitz over [ − M , M ], we can apply Talagrand’s Lemma, R ( G ) ≤ pM p − 1 ˆ ˆ R ( H ′ ) . Now, ˆ R ( H ′ ) can be expressed as � m � R ( H ′ ) = 1 ˆ � sup ( σ i h ( x i ) + σ i f ( x i )) m E σ h ∈ H i =1 � m � � � m = 1 + 1 � � = ˆ sup σ i h ( x i ) σ i f ( x i ) R ( H ) . m E m E σ σ h ∈ H i =1 i =1 �� m = � m Since E σ i =1 σ i f ( x i ) � i =1 E σ [ σ i ] f ( x i ) = 0 . 5/35

Rademacher complexity regression bounds Theorem (Rademacher complexity regression bounds) Let 0 ≤ L ≤ M be a bounded loss function such that for any fixed y ′ ∈ Y , L ( y , y ′ ) is µ -Lipschitz for some µ > 0 . Then, � � log 1 � m � ( x , y ) ∼D [ L ( h ( x ) , y )] ≤ 1 δ � L ( h ( x i ) , y i ) + 2 µ R m ( H ) + M E m 2 m i =1 � � log 1 � m � ( x , y ) ∼D [ L ( h ( x ) , y )] ≤ 1 δ � L ( h ( x i ) , y i ) + 2 µ ˆ R ( H ) + 3 M 2 m . E m i =1 Proof (Rademacher complexity of µ -Lipschitz loss functions). Since for any fixed y i , L ( y , y ′ ) is µ -Lipschitz for some µ > 0, by Talagrand’s Lemma, we can write � m � R ( G ) = 1 ˆ � σ i L ( h ( x i ) , y i ) m E σ i =1 � m � ≤ 1 � = µ ˆ σ i µ h ( x i ) R ( H ) . m E σ i =1 Combining this inequality with general Rademacher complexity learning bound completes proof. 6/35

Pseudo-dimension bounds

Shattering ◮ VC dimension is a measure of complexity of a hypothesis set. ◮ We define shattering for families of real-valued functions. ◮ Let G be a family of loss functions associated to some hypothesis set H , where G = { z = ( x , y ) �→ L ( h ( x ) , y ) | h ∈ H } Definition (Shattering) Let G be a family of functions from a set Z to R . A set { z 1 , . . . , z m } ∈ ( X × Y ) is said to be shattered by G if there exists t 1 , . . . , t m ∈ R such that �  �  �  sgn ( g ( z 1 ) − t 1 )  � � �   �  �  �  sgn ( g ( z 2 ) − t 2 )  �    �  �   = 2 m �   � � g ∈ G . �   � � . �  .  � �   �    �  �   �  �  � sgn ( g ( z m ) − t m )   � � � When they exist, the threshold values t 1 , . . . , t m are said to witness the shattering. In other words, S is shattered by G , if there are real numbers t 1 , . . . , t m such that for b ∈ { 0 , 1 } m , there is a function g b ∈ G with sgn ( g b ( x i ) − t i ) = b i for all 1 ≤ i ≤ m . 7/35

Shattering ◮ Thus, { z 1 , . . . , z m } is shattered if for some witnesses t 1 , . . . , t m , the family of functions G is rich enough to contain a function going 1. above a subset A of the set of points J = { ( z i , t i ) | 1 ≤ i ≤ m } and 2. below the others J − A , for any choice of the subset A . t 1 t 2 z 1 z 2 ◮ For any g ∈ G , let B g be the indicator function of the region below or on the graph of g , that is B g ( x , y ) = sgn ( g ( x ) − y ) . ◮ Let B G = { B g | g ∈ G} . 8/35

Pseudo-dimension ◮ The notion of shattering naturally leads to definition of pseudo-dimension. Definition (Pseudo-dimension) Let G be a family of functions from Z to R . Then, the pseudo-dimension of G , denoted by Pdim ( G ), is the size of the largest set shattered by G . If no such maximum exists, then Pdim ( G ) = ∞ . ◮ Pdim ( G ) coincides with VC of the corresponding thresholded functions mapping X to { 0 , 1 } . Pdim ( G ) = VC ( { ( x , t ) �→ I [( g ( x ) − t ) > 0] | g ∈ G} ) L ( h ( x ) , y ) t 1.5 1 L ( h ( x ) ,y ) >t 1.0 Loss 0.5 0.0 -2 -1 0 1 2 z ◮ Thus Pdim ( G ) = d , if there are real numbers t 1 , . . . , t d and 2 d functions g b that achieves all possible below/above combinations w.r.t t i . 9/35

Properties of Pseudo-dimension Theorem (Composition with non-decreasing function) Suppose G is a class of real-valued functions and σ : R �→ R is a non-decreasing function. Let σ ( G ) denote the class { σ ◦ g | g ∈ G} . Then Pdim ( σ ( G )) ≤ Pdim ( G ) . Proof (Pseudo-dimension of hyperplanes). 1. For d ≤ Pdim ( σ ( G )), suppose � � b ∈ { 0 , 1 } d � � σ ◦ g b ⊆ σ ( G ) � shatters a set { x 1 , . . . , x d } ⊆ X witnessed by ( t 1 , . . . , t d ). 2. By suitably relabeling g b , for all { 0 , 1 } d and 1 ≤ i ≤ d , we have sgn ( σ ( g b ( x i )) − t i ) = b i . 3. For all 1 ≤ i ≤ d , take � � � σ ( g b ( x i )) ≥ t i , b ∈ { 0 , 1 } d � y i = min g b ( x i ) � 4. Since σ is non-decreasing, it is straightforward to verify that sgn ( g b ( x i ) − t i ) = b i for all { 0 , 1 } d and 1 ≤ i ≤ d 10/35

Pseudo-dimension of vector spaces ◮ A class G of real-valued functions is a vector space if for all g 1 , g 2 ∈ G and any numbers λ, µ ∈ R , we have λ g 1 + µ g 2 ∈ G . Theorem (Pseudo-dimension of vector spaces) If G is a vector space of real-valued functions, then Pdim ( G ) = dim ( G ) . Proof (Pseudo-dimension of vector spaces). 1. Let B G be the class of below th graph indicator functions, we have Pdim ( G ) = VC ( B G ). 2. But B G = { ( x , y ) �→ sgn ( g ( x ) − y ) | g ∈ G} . 3. Hence, the functions B G are of the form sgn ( g 1 + g 2 ), where ◮ g 1 = g is a function from vector space ◮ g 2 is the fixed function g 2 ( x , y ) = − y . 4. Then, Theorem (Pseudo-dimension of vector spaces) shows that Pdim ( G ) = dim ( G ). ◮ Functions that map into some bounded range are not vector space. Corollary If G is a subset of a vector space G ′ of real valued functions then Pdim ( G ) ≤ dim ( G ′ ) 11/35

Machine learning theory Regression Hamid Beigy Sharif university - PowerPoint PPT Presentation

Machine learning theory Regression Hamid Beigy Sharif university of technology June 1, 2020 Table of contents 1. Introduction 2. Generalization bounds 3. Pseudo-dimension bounds 4. Regression algorithms 5. Summary 1/35 Introduction The

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Basis of Neural Networks School of Data Science, Fudan

GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof.

Week 3: Linear Regression Instructor: Sergey Levine 1 The regression problem We saw how we can

Linear Models Machine Learning 1 Checkpoint: The bigger picture Supervised learning:

L ECTURE 4: L INEAR CLASSIFIERS Prof. Julia Hockenmaier juliahmr@illinois.edu Announcements

Statistical methods in bioinformatics Brief introduction, statistical models, dimension

Announcements Homework 1: Due today Office hours Come to office hours before your presentation!

Learning From Data Lecture 2 The Perceptron The Learning Setup A Simple Learning Algorithm: PLA