foundations of machine learning
play

Foundations of Machine Learning Learning with Finite Hypothesis - PowerPoint PPT Presentation

Foundations of Machine Learning Learning with Finite Hypothesis Sets Motivation Some computational learning questions What can be learned efficiently? What is inherently hard to learn? A general model of learning? Complexity


  1. Foundations of Machine Learning Learning with Finite Hypothesis Sets

  2. Motivation Some computational learning questions • What can be learned efficiently? • What is inherently hard to learn? • A general model of learning? Complexity • Computational complexity: time and space. • Sample complexity: amount of training data needed to learn successfully. • Mistake bounds: number of mistakes before learning successfully. page Foundations of Machine Learning 2

  3. This lecture PAC Model Sample complexity, finite H , consistent case Sample complexity, finite H , inconsistent case page Foundations of Machine Learning 3

  4. Definitions and Notation : set of all possible instances or examples, e.g., X the set of all men and women characterized by their height and weight. : the target concept to learn; can be c : X → { 0 , 1 } identified with its support . { x ∈ X : c ( x )=1 } : concept class, a set of target concepts . C c : target distribution, a fixed probability D distribution over . Training and test examples are X drawn according to . D page 4

  5. Definitions and Notation : training sample. S : set of concept hypotheses, e.g., the set of all H linear classifiers. The learning algorithm receives sample and S selects a hypothesis from approximating . h S H c page 5

  6. Errors True error or generalization error of with h respect to the target concept and distribution : D c R ( h ) = Pr x � D [ h ( x ) � = c ( x )] = x � D [1 h ( x ) � = c ( x ) ] . E Empirical error: average error of on the training h sample drawn according to distribution , D S m � [1 h ( x ) � = c ( x ) ] = 1 � R S ( h ) = Pr [ h ( x ) � = c ( x )] = E 1 h ( x i ) � = c ( x i ) . m x � b x � b D D i =1 � � Note: 
 � R ( h ) = E R S ( h ) . S ∼ D m page 6

  7. PAC Model (Valiant, 1984) PAC learning: Probably Approximately Correct learning. Definition: concept class is PAC-learnable if there C exists a learning algorithm such that: L • for all and all distributions , c ∈ C, ⇥ > 0 , � > 0 , D S ∼ D m [ R ( h S ) ≤ � ] ≥ 1 − � , Pr • for samples of size for a m = poly (1 / ⇥ , 1 / � ) S fixed polynomial. page 7

  8. Remarks Concept class is known to the algorithm. C Distribution-free model: no assumption on . D Both training and test examples drawn . ∼ D Probably: confidence . 1 − δ Approximately correct: accuracy . 1 − � Efficient PAC-learning: runs in time . poly (1 / � , 1 / � ) L What about the cost of the representation of ? c ∈ C page 8

  9. PAC Model - New Definition Computational representation: • cost for in . O ( n ) x ∈ X • cost for in . O ( size ( c )) c ∈ C Extension: running time. O ( poly (1 / ⇥ , 1 / � )) − → O ( poly (1 / ⇥ , 1 / � , n, size ( c ))) . page 9

  10. Example - Rectangle Learning Problem: learn unknown axis-aligned rectangle R using as small a labeled sample as possible. R’ R Hypothesis: rectangle R’. In general, there may be false positive and false negative points. page 10

  11. Example - Rectangle Learning Simple method: choose tightest consistent rectangle R’ for a large enough sample. How large a sample? Is this class PAC-learnable? R’ R What is the probability that ? R ( R � ) > � page 11

  12. Example - Rectangle Learning Fix and assume (otherwise the result Pr D [ R ] > � � > 0 is trivial). Let be four smallest rectangles along r 1 , r 2 , r 3 , r 4 the sides of such that . Pr D [ r i ] ≥ � R 4 r 1 R =[ l, r ] × [ b, t ] R’ r 4 =[ l, s 4 ] × [ b, t ] r 4 r 2 � � s 4 =inf { s : Pr [ l, s ] × [ b, t ] 4 } ≥ � R � � Pr [ l, s 4 [ × [ b, t ] < � r 3 4 D page 12

  13. Example - Rectangle Learning Errors can only occur in . Thus (geometry), R − R � misses at least one region . R ( R � ) > � ⇒ R � r i Therefore, i =1 { R � misses r i } ] Pr[ R ( R � ) > � ] ≤ Pr[ ∪ 4 4 Pr[ { R � misses r i } ] � ≤ i =1 4 ) m ≤ 4 e � m � 4 . ≤ 4(1 − � r 1 R’ r 4 r 2 R r 3 13 page

  14. Example - Rectangle Learning Set to match the upper bound: δ > 0 4 e − m � 4 ≤ δ ⇔ m ≥ 4 � log 4 � . Then, for , with probability at least , m ≥ 4 � log 4 1 − δ � R ( R � ) ≤ � . r 1 R’ r 4 r 2 R r 3 page 14

  15. Notes Infinite hypothesis set, but simple proof. Does this proof readily apply to other similar concepts classes? Geometric properties: • key in this proof. • in general non-trivial to extend to other classes, e.g., non-concentric circles (see HW2, 2006) . Need for more general proof and results. page 15

  16. This lecture PAC Model Sample complexity, finite H , consistent case Sample complexity, finite H , inconsistent case page 16

  17. Learning Bound for Finite H - Consistent Case Theorem: let be a finite set of functions from H X to and an algorithm that for any target { 0 , 1 } L concept and sample returns a consistent S c ∈ H hypothesis : . Then, for any , with b R S ( h S )=0 δ > 0 h S probability at least , 1 − δ R ( h S ) ≤ 1 m (log | H | + log 1 δ ) . page 17

  18. Learning Bound for Finite H - Consistent Case Proof: for any , define . ✏ > 0 H ✏ = { h ∈ H : R ( h ) > ✏ } Then, h i ∃ h ∈ H ✏ : b Pr R S ( h ) = 0 h i R S ( h 1 ) = 0 ∨ · · · ∨ b b = Pr R S ( h | H ✏ | ) = 0 h i X b Pr R S ( h ) = 0 ( union bound ) ≤ h ∈ H ✏ X (1 − ✏ ) m ≤ | H | (1 − ✏ ) m ≤ | H | e − m ✏ . ≤ h ∈ H ✏ page 18

  19. Remarks The algorithm can be ERM if problem realizable. Error bound linear in and only logarithmic in . 1 1 δ m is the number of bits used for the log 2 | H | representation of . H Bound is loose for large . | H | Uninformative for infinite . | H | page 19

  20. Conjunctions of Boolean Literals Example for . n =6 Algorithm: start with and rule x 1 ∧ x 1 ∧ · · · ∧ x n ∧ x n out literals incompatible with positive examples. 0 1 1 0 1 1 + 0 1 1 1 1 1 + 0 0 1 1 0 1 - 0 1 1 1 1 1 + 1 0 0 1 1 0 - 0 1 0 0 1 1 + 0 1 ? ? 1 1 x 1 ∧ x 2 ∧ x 5 ∧ x 6 . page 20

  21. Conjunctions of Boolean Literals Problem: learning class of conjunctions of C n boolean literals with at most variables (e.g., n for , ). n =3 x 1 ∧ x 2 ∧ x 3 Algorithm: choose consistent with . h S • Since , sample complexity: | H | = | C n | =3 n m ≥ 1 ⇥ ((log 3) n + log 1 � ) . � = . 02 , ⇥ = . 1 , n =10 , m ≥ 149 . • Computational complexity: polynomial, since algorithmic cost per training example is in . O ( n ) page 21

  22. This lecture PAC Model Sample complexity, finite H , consistent case Sample complexity, finite H , inconsistent case page 22

  23. Inconsistent Case No is a consistent hypothesis. h ∈ H The typical case in practice: difficult problems, complex concept class. But, inconsistent hypotheses with a small number of errors on the training set can be useful. Need a more powerful tool: Hoeffding’s inequality. page 23

  24. Hoeffding’s Inequality Corollary: for any and any hypothesis h : X → { 0 , 1 } � > 0 the following inequalities holds: R ( h ) ≥ � ] ≤ e − 2 m � 2 Pr[ R ( h ) − � R ( h ) − R ( h ) ≥ � ] ≤ e − 2 m � 2 . Pr[ � Combining these one-sided inequalities yields R ( h ) | ≥ � ] ≤ 2 e − 2 m � 2 . Pr[ | R ( h ) − � page 24

  25. Application to Learning Algorithm? Can we apply that bound to the hypothesis h S returned by our learning algorithm when training on sample ? S No, because is not a fixed hypothesis, it depends h S on the training sample. Note also that E[ � R ( h S )] is not a simple quantity such as . R ( h S ) Instead, we need a bound that holds simultaneously for all hypotheses , a uniform convergence h ∈ H bound. page 25

  26. Generalization Bound - Finite H Theorem: let be a finite hypothesis set, then, for H any , with probability at least , δ > 0 1 − δ � log | H | + log 2 ∀ h ∈ H, R ( h ) ≤ � δ R S ( h ) + . 2 m Proof: By the union bound, � � � � � R ( h ) − � � > � Pr max R S ( h ) h ∈ H �� � � � � � R ( h 1 ) − � � R ( h | H | ) − � � > � ∨ . . . ∨ � > � = Pr R S ( h 1 ) R S ( h | H | ) �� � � � � R ( h ) − � � > � Pr R S ( h ) ≤ h ∈ H ≤ 2 | H | exp( − 2 m � 2 ) . page 26

  27. Remarks Thus, for a finite hypothesis set, whp, �� � log | H | ∀ h ∈ H, R ( h ) ≤ � R S ( h ) + O . m Error bound in (quadratically worse). 1 O ( √ m ) can be interpreted as the number of bits log 2 | H | needed to encode . H Occam’s Razor principle (theologian William of Occam): “plurality should not be posited without necessity”. page 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend