The Fundamental Theorem prof. dr Arno Siebes Algorithmic Data - PowerPoint PPT Presentation

The Fundamental Theorem prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

PAC Learnability We have seen that H is ◮ PAC learnable if H is finite ◮ not PAC learnable if VC ( H ) = ∞ Today we will characterize exactly what it takes to be PAC learnable: H is PAC learnable if and only if VC ( H ) is finite This is known as the fundamental theorem . Moreover, we will provide bounds ◮ on sample complexity ◮ and error for hypothesis classes of finite VC complexity ◮ also known as classes of small effective size .

By Bad Samples We already have seen a few of such proofs ◮ proving that finite hypothesis sets are PAC learnable They all have the same main idea ◮ prove that the probability of getting a ‘bad’ sample is small Not surprisingly, that is what we’ll do again But first we’ll discuss (and prove) a technical detail which we’ll need in our proof ◮ Jensen’s inequality

Convex Functions Jensen’s inequality – in as far as we need it – is about expectations and convex functions. So we first recall what a convex function is. A function f : R n → R is convex iff ◮ for all x 1 , x 2 ∈ R n and λ ∈ [0 , 1] ◮ we have that f ( λ x 1 + (1 − λ ) x 2 ) ≤ λ f ( x 1 ) + (1 − λ ) f ( x 2 ) When n = 1, i.e., f : R → R , this means that if we draw the graph of f and choose two points on that graph, the line that connects these two points is always above the graph of f .

Convex Examples With the intuition given it is easy to see that, e.g., ◮ x → | x | , ◮ x → x 2 and ◮ x → e x are convex functions; with a little high school math, you can, of course, also prove this If you draw the graph of x → √ x or x → log x , ◮ you’ll see that if you connect two points by a line, this line is always under the graph Functions for which f ( λ x 1 + (1 − λ ) x 2 ) ≥ λ f ( x 1 ) + (1 − λ ) f ( x 2 ) are known as concave functions

Larger Sums If we have λ 1 , . . . , λ m ∈ [0 , 1] : � m i =1 λ i = 1, natural induction proves that for x 1 , . . . , x m we have � m � m � � f λ i x i ≤ λ i f ( x i ) i =1 i =1 At least one of the λ i > 0, say, λ 1 . then we have � n +1 � � n +1 � � � f λ i x i = f λ 1 x 1 + λ i x i i =1 i =2 � n +1 � λ i � = f λ 1 x 1 + (1 − λ 1 ) x i 1 − λ 1 i =2 � n +1 � λ i � ≤ λ 1 f ( x 1 ) + (1 − λ 1 ) f x i 1 − λ 1 i =2 n +1 n +1 λ i � � ≤ λ 1 f ( x 1 ) + (1 − λ 1 ) f ( x i ) = λ i f ( x i ) 1 − λ 1 i =2 i =1

Jensen’s Inequality A special case of the previous result is when all the λ i = 1 m then we have: � m � m x i f ( x i ) � � ≤ f m m i =1 i =1 That is, the value of f at the average of the x i is smaller than the average of the f ( x i ). The average is an example of an expectation. Jensen’s inequality tells us that the above inequality holds for the expectation in general, i.e., for a convex f we have f ( E ( X )) ≤ E ( f ( X )) We already saw that x → | x | is a convex function. ◮ the same is true for taking the supremum This follows from the fact that taking the supremum is a monotone function: A ⊂ B → sup( A ) ≤ sup( B )

Proof by Uniform Convergence To prove the fundamental theorem, we prove that classes of small effective size have the uniform convergence property. ◮ which is sufficient as we have seen that classes with the uniform convergence property are agnostically PAC learnable Recall: A hypothesis class H has the uniform convergence property wrt domain Z and loss function l if H : (0 , 1) 2 → N ◮ there exists a function m UC ◮ such that for all ( ǫ, δ ) ∈ (0 , 1) 2 ◮ and for any probability distribution D on Z If D is an i.i.d. sample according to D over Z of size m ≥ m UC H ( ǫ, δ ). Then D is ǫ -representative with probability of at least 1 − δ .

To Prove Uniform Convergence Now recall that D is ǫ -representative wrt Z , H , l and D if ∀ h ∈ H : | L D ( h ) − L D ( h ) | ≤ ǫ Hence, we have devise a bound on | L D ( h ) − L D ( h ) | that is for almost all D ∼ D m small. Markov’s inequality (lecture 2) tells us that P ( X ≥ a ) ≤ E ( X ) a So, one way to prove uniform convergence is by considering E D ∼D m | L D ( h ) − L D ( h ) | Or, more precisely since it should be small for all h ∈ H : � � sup | L D ( h ) − L D ( h ) | E D ∼D m h ∈H The supremum as H may be infinite and a maximimum doesn’t have to exist

The First Step The first step to derive a bound on � � sup | L D ( h ) − L D ( h ) | E D ∼D m h ∈H is to recall that L D ( h ) is itself defined as the expectation of the loss on a sample, i.e., L D ( h ) = E D ′ ∼D m ( L D ′ ( h )) So, we want to derive a bound on � � sup | E D ′ ∼D m ( L D ( h ) − L D ′ ) | E D ∼D m h ∈H We can manipulate this expression further using Jensen’s inequality

Second Step Combining the result of the first step with the result on the previous page, we have: � � � � sup | L D ( h ) − L D ( h ) | ≤ E D , D ′ ∼D m sup | L D ( h ) − L D ′ ( h ) | E D ∼D m h ∈H h ∈H By definition, the right hand side of this inequality can be rewritten to: � m � � � �� 1 � � � ( l ( h , z i ) − l ( h , z ′ sup i )) E D , D ′ ∼D m � � m � � h ∈H � � i =1 i ∈ D ′ and both D and D ′ are i.i.d samples of with z i ∈ D and z ′ size m sampled according to the distribution D

An Observation Both D and D ′ are i.i.d samples of size m ◮ it could be that the D and D ′ we draw today ◮ are the D ′ and D we drew yesterday that is ◮ a z i of today was a z ′ i yesterday ◮ an a z ′ i of today was a z i yesterday If we have this – admittedly highly improbable – coincidence ◮ a term ( l ( h , z i ) − l ( h , z ′ i )) of today ◮ was − ( l ( h , z i ) − l ( h , z ′ i )) yesterday because of the switch ◮ and the expectation doesn’t change! This is true whether we switch 1, 2, or all elements of D and D ′ . That is, for every σ ∈ {− 1 , 1 } m : � m � � � �� 1 � � � ( l ( h , z i ) − l ( h , z ′ sup i )) E D , D ′ ∼D m � � m � � h ∈H � � i =1 � � � m � �� 1 � � � σ i ( l ( h , z i ) − l ( h , z ′ = E D , D ′ ∼D m sup i )) � � m � � h ∈H � � i =1

Observing Further Since this equality holds for any σ ∈ {− 1 , 1 } m , it also holds if we sample a vector from {− 1 , 1 } m . So, also if we sample each − 1 / + 1 entry in the vector at random under the uniform distribution, denoted by U ± . That is, � � � � m �� 1 � � � ( l ( h , z i ) − l ( h , z ′ E D , D ′ ∼D m sup i )) � � m � � h ∈H � i =1 � � � � m � �� 1 � � � σ i ( l ( h , z i ) − l ( h , z ′ = E σ ∼ U m sup i )) pm E D , D ′ ∼D m � � m � � h ∈H � � i =1 And since E is a linear operation, this equals � � � m � �� 1 � � � σ I ( l ( h , z i ) − l ( h , z ′ sup i )) E D , D ′ ∼D m E σ ∼ U m � � ± m � � h ∈H � � i =1

From Infinite to Finite In computing the inner expectation of � m � � � �� 1 � � � σ i ( l ( h , z i ) − l ( h , z ′ E D , D ′ ∼D m E σ ∼ U m sup i )) � � m ± � � h ∈H � � i =1 both D and D ′ are fixed, they vary for the outer expectation computation ◮ just like nested loops So, if we denote C = D ∪ D ′ , then we do not range over the (possibly) infinite set H , but just over the finite set H C . That is � � � m � �� 1 � � � σ i ( l ( h , z i ) − l ( h , z ′ sup i )) E σ ∼ U m � � ± m � � h ∈H � � i =1 � � � � m �� 1 � � � σ i ( l ( h , z i ) − l ( h , z ′ = E σ ∼ U m max i )) � � m ± � � h ∈H C � � i =1

Step 3 For h ∈ H C define the random variable θ h by m θ h = 1 � σ i ( l ( h , z i ) − l ( h , z ′ i )) m i =1 Now note that ◮ E ( θ h ) = 0 ◮ θ h is the average of independent variables, taking values in [ − 1 , 1] Hence, we can apply Hoeffding’s inequality. Hence, ∀ ρ > 0 P ( | θ h | > ρ ) ≤ 2 e − 2 m ρ 2 Applying the union bound we have: P ( ∀ h ∈ H C : | θ h | > ρ ) ≤ 2 |H C | e − 2 m ρ 2 Which implies that: | θ h | > ρ ) ≤ 2 |H C | e − 2 m ρ 2 P ( max h ∈H C

The Fundamental Theorem prof. dr Arno Siebes Algorithmic Data - PowerPoint PPT Presentation

The Fundamental Theorem prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht PAC Learnability We have seen that H is PAC learnable if H is finite not PAC learnable if

31. Stokes Theorem Stokes theorem is to Greens theorem, for the work done, as the

MATH 12002 - CALCULUS I 4.4: Fundamental Theorem of Calculus Professor Donald L. White

The Weak Fundamental Theorem of Algebra Robert Lubarsky Fred Richman Florida Atlantic University

Arrows Impossibility Theorem Lecture 12 Arrows Impossibility Theorem Lecture 12, Slide 1

Ch04. Maximum Theorem, Implicit Function Theorem and Envelope Theorem Ping Yu Faculty of

1.1 Hellys Theorem and its Applications One of the fundamental theorems on convexity is

Generalized Intermediate Value Theorem Intermediate Value Theorem Theorem Intermediate Value

Fundamental Principle of Counting Theorem 1 (Fundamental Principle of Counting) . If we have to

I : Algebra : prove goal First Theorem of Algebra ) them ( Fundamental root have a . non

II : bustling :i Galois theory ) ( Fundamental Theorem of Thur - ids.ba , The and f

Math 3230 Abstract Algebra I Sec 4.3: The fundamental homomorphism theorem Slides created by M.

Data Screening and Missing Value Analysis James H. Steiger Theorem (The Fundamental Theorem of

Arrows Impossibility Theorem Lecture 12 Arrows Impossibility Theorem Lecture 12, Slide 1

PCP Theorem [PCP Theorem is] the most important result in complexity theory since Cooks

Green's Theorem is a special case of Stoke's 1 Some examples for Stoke's Theorem 2 3 4 5 6

The Replacement Theorem Theorem (Theorem 1.10) Let V be a vector space and suppose G and L are

Randomness in Computing L ECTURE 21 Last time Probabilistic method Sample and Modify

Entropy, Relative Entropy, Cross Entropy Entropy Entropy, H(x) is a measure of the uncertainty of

Note on von Neumann and R enyi entropies of a graph Jephian C.-H. Lin Department of

Computing and Communications 2. Information Theory -Entropy Ying Cui Department of Electronic

Wasserstein barycenters over Riemannian manifolds Brendan Pass (joint work with Y.H. Kim (UBC))

Algorithms for Distributed Functional Monitoring

Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net

CS480/680 Machine Learning Lecture 5: January 21 st , 2020 Information Theory Zahra Sheikhbahaee