Learning may work Matthieu R. Bloch 1. A dataset D { ( x 1 , y 1 ) - PDF document

1 Note that we do not specify a specific algorithm yet as we will be focusing on a more abstract learning (4) In addition observe that Observe that the empirical risk in (2) is a random variable since it is a function of the data set, which Tie first question we raised was the possibility of generalizing a hypothesis. Mathe- Generalizing (3) We will discuss this in more details later, but it is very natural for learning algorithms to attempt and the empirical risk becomes (1) x operation. (2) analysis. We will talk about the more general setting later in the semester. We consider the supervised learning model that consists of the following. Now that we have introduced a complete model for supervised learning, our objective is to show that some of the questions raised earlier have a chance of being answered. We proceed by analyzing a simplified model, which still captures the essence of the problem but is more easily amenable to ECE 6254 - Spring 2020 - Lecture 3 v1.2 - revised March 21, 2020 Learning may work Matthieu R. Bloch 1. A dataset D ≜ { ( x 1 , y 1 ) , · · · , ( x N , y N ) } • { x i } N i =1 drawn i.i.d. from an unknown probability distribution P x on X ; • { y i } N i =1 with Y = { 0 , 1 } (binary classification). 2. An unknown f : X → Y , no noise. 3. A finite set of hypotheses H , |H| = M < ∞ , denoted H ≜ { h i } M i =1 . 4. A binary loss function ℓ : Y × Y → R + : ( y 1 , y 2 ) �→ 1 { y 1 � = y 2 } . For this model and any hypothesis h ∈ H , the true risk simplifies as � � R ( h ) ≜ E x y ( 1 { h ( x ) � = y } ) = p x ,y ( x , y ) 1 { h ( x ) � = y } = P x y ( h ( x ) � = y ) . y � N R N ( h ) = 1 � 1 { h ( x i ) � = y i } . N i =1 to minimize the empirical risk and look for a hypothesis h ∗ that ensures a minimal risk h ∗ = argmin � R N ( h ) . h ∈H 1 Sample complexity matically, for a specific hypothesis h j ∈ H , this means assessing how � R N ( h j ) compares to R ( h j ) . is a random variable. More specifically, since every x i is generated independent and identically dis- tributed (i.i.d.), the empirical risk is actually the sample average of N i.i.d. variables 1 { h ( x i ) � = y } . � � � N � N = 1 E ( 1 { h ( x i ) � = y i } ) = 1 � P x ,y ( h ( x ) � = y i ) = R ( h j ) E R N ( h j ) N N i =1 i =1

2 (8) Tiat was a clean and fast proof, but you may be more comfortable going back to the definition Tie start of most if not all concentration inequalities is Markov’s lemma. fundamental ideas behind these bounds. applied probability and are known as concentration inequalities . We will now review some of the (7) is the probability that sample average of i.i.d. with the intuition that it is unlikely that a random variable takes a value very far away from its mean. (5) In spite of its relative simplicity, Markov’s inequality is a powerful tool because it can be “boosted.” (9) one. Applying Markov’s inequality we obtain (10) which is potentially a better bound than (5). Of course, the difficulty is in choosing the appropriate to Chebyshev’s inequality . (11) (6) ECE 6254 - Spring 2020 - Lecture 3 v1.2 - revised March 21, 2020 �� R N ( h j ) − R ( h j ) � > ϵ Tierefore, the quantity P random variables differ from their mean by more than ϵ . Such bounds are extremely common in Lemma 1.1. Let X be a non-negative real-valued random variable. Tien for all t > 0 P ( X ⩾ t ) ⩽ E ( X ) . t Proof. For t > 0 , let 1 { X ⩾ t } be the indicator function of the event { X ⩾ t } . Tien, E [ X ] ⩾ E [ X 1 { X ⩾ t } ] ⩾ t P [ X ⩾ t ] , where the first inequality follows because the indicator function is { 0 , 1 } -valued and X is non- negative; the second because X ⩾ t whenever 1 { X ⩾ t } = 1 and 0 else. of E ( X ) to prove the result. Note that � ∞ � t � ∞ � ∞ ( a ) ⩾ t E ( X ) = xp X ( x ) dx = xp X ( x ) dx + xp X ( x ) dx p X ( x ) dx 0 0 t t � �� ⩾ 0 = t P ( X ⩾ t ) where ( a ) follows from the fact that x ⩾ t in the second integral. Note that the non-negative nature of X is crucial to lower bound the first integral. ■ By choosing t = ϵ E ( X ) for ϵ > 0 in (5), we obtain P ( X ⩾ ϵ E ( X )) ⩽ 1 ϵ , which is consistent For X ∈ X ⊂ R , consider ϕ : X → R + non-decreasing on X such that E ( | ϕ ( X ) | ) < ∞ . Tien, P [ X ⩾ t ] = E [ 1 { X ⩾ t } ] = E [ 1 { X ⩾ t } 1 { ϕ ( X ) ⩾ ϕ ( t ) } ] ⩽ P [ ϕ ( X ) ⩾ ϕ ( t )] , where we have used the definition of ϕ and the fact that an indicator function is upper bounded by P [ X ⩾ t ] ⩽ E [ ϕ ( X )] , ϕ ( t ) function ϕ to make the result meaningful. Tie most well-known application of this concept leads Lemma 1.2 (Chebyshev’s inequality) . Let X ∈ R . Tien, P [ | X − E ( X ) | ⩾ t ] ⩽ Var ( X ) . t 2

3 (12) (13) Tierefore, (14) (15) As an application of Chebyshev’s inequality, we derive the weak law of large numbers. Tie weak law of large numbers is essentially stating that Let us now go back to our learning problem. Applying (15), we know that Proof. First observe that (16) rather arbitrary way. that we obtain (17) inequality we obtain want the empirical risk to be, the more samples we need. and ECE 6254 - Spring 2020 - Lecture 3 v1.2 - revised March 21, 2020 Proof. Define Y ≜ | X − E ( X ) | and ϕ : R + → R + : t �→ t 2 . Tien, by the boosted Markov’s P [ | X − E ( X ) | ⩾ t ] = P [ Y ⩾ t ] ⩽ E [ Y 2 ] = Var ( X ) . t 2 t 2 ■ Lemma 1.3 (Weak law of large numbers) . Let X i ∼ p X i be independent with E [ | X i | ] < ∞ and � N Var ( X i ) < σ 2 for some σ 2 ∈ R + . Define Z = 1 i =1 X i for N ∈ N ∗ . Tien Z converges in N � N probability to 1 i =1 E ( X i ) . N N N � � E [ Z ] = 1 1 E [ X i ] Var ( Z ) = Var ( X i ) . N 2 N i =1 i =1   �� 2 � � � � N N N N � � � � 1 X i − 1 1 X i − 1 � � � �  ⩾ ϵ 2  � ⩾ ϵ E [ X i ] = P E [ X i ] P � � � � � N N � N N � i =1 i =1 i =1 i =1 � N < σ 2 Var ( X i ) ⩽ Nϵ 2 . N 2 ϵ 2 i =1 ■ � N 1 i =1 X i concentrates around its N average. Note, however, that the convergence we proved in (15) is rather slow, on the order of 1/ N . �� ⩽ Var ( 1 { h j ( x 1 ) � = y 1 } ) 1 � � � � � ⩾ ϵ ⩽ ∀ ϵ > 0 R N ( h j ) − R ( h j ) P { ( x i ,y i ) } Nϵ 2 , Nϵ 2 where the last inequality comes from the observation that Var ( 1 { h j ( x 1 ) � = y } ) ⩽ 1 since the indicator function is a { 0 , 1 } -valued function. Notice that the bound that we obtain is universal in that it does not depend on P x anymore. Tiis is particularly pleasing because we introduced P x in a We can now compute the sample complexity for generalizing h j , defined as the number of samples � � � � � � � ⩽ ϵ with probability at least 1 − δ . From (16), note N ϵ,δ required to achieve R N ( h j ) − R ( h j ) 1 N ϵ,δ ⩾ δϵ 2 . Tie sample complexity behavior with δ and ϵ is consistent with our intuition, the more precise we

Learning may work Matthieu R. Bloch 1. A dataset D { ( x 1 , y 1 ) - PDF document

1 Note that we do not specify a specific algorithm yet as we will be focusing on a more abstract learning (4) In addition observe that Observe that the empirical risk in (2) is a random variable since it is a function of the data set, which Tie

The Learning Tree Workshop: The Learning Tree Workshop: Experience-based Learning Series on

WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK Matthieu R Bloch Thrusday

WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK Matthieu R Bloch Tuesday,

Key Issues on Mentoring in Key Issues on Mentoring in Work- -Based Learning Based Learning

May May 36% 20% 25% 19% May May May May 36% 20% 25% 19%

Machine Learning 11 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 11 1 11 Machine Learning

What is mobile learning, mobile learning policies and technologies Dr. Mohamed Ally Learning

Amazing Android How We Built Square Friday, May 14, 2010 Friday, May 14, 2010 Friday, May 14,

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

The Work Environment Act and The Work Environment Ordinance The Work Environment Act and The

Year 7 Learning Evening 2017 W elcome! Year 7 Learning Evening 2017 Year 7 Learning Evening

Learning is a never-ending process Tasks come and go, but learning is forever Learn more e ff

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

A Gentle Introduction to Machine Learning Supervised learning, unsupervised learning (very

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Learning From Data Lecture 2 The Perceptron The Learning Setup A Simple Learning Algorithm: PLA

Computing central values of twisted L-functions of higher degree Nathan Ryan Computational

Scientific Visualization Dr. Ronald Peikert SciVis 2008 - Introduction Spring 2008 Ronald

Method Combinators Conclusion Perfs Alt. MCs CGFs Combinators SBCL e ELS 2018 E Introduction

CaGe A virtual environment for studying some special classes of plane graphs N. Van Cleemput G.

for 3D perception Chris Choy, Ph.D. candidate @ Stanford Vision and Learning Lab 1 The Success

Symmetry in Shapes Theory and Practice Niloy Mitra Maksim Ovsjanikov Mark Pauly

Modular Dataflow Analysis Aivar Annamaa Feb. 23 rd , 2010 Based on: Rountev, Sharp, Xu, 2008

Complexity and Character of Human Languages Chomsky Hierarchy Informatics 2A: Lecture 21 The