CS485/685 Lecture 15: Feb 28, 2012 Probably Approximately Correct - PDF document

28/02/2012 CS485/685 Lecture 15: Feb 28, 2012 Probably Approximately Correct Learning [BDSS] Chapter 1 CS485/685 (c) 2012 P. Poupart 1 Quick Recap • Tom Mitchell (1998): A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. – Experience: – Task: – Performance measure: CS485/685 (c) 2012 P. Poupart 2 1

28/02/2012 Performance Measure • So far, we measured the performance of algorithms empirically – Train with training set and measure performance with a separate test set – K ‐ fold cross validation: • Can reuse the data for training and testing • Average performance over multiple splits of the data to improve statistical reliability • Open questions: – How much data do we need to learn a task? – When is a task learnable? CS485/685 (c) 2012 P. Poupart 3 Computational Complexity • Computational Complexity : branch of the theory of computation that focuses on classifying computational problems based on their inherent difficulty – Time complexity – Space complexity • In machine learning, we also consider – Data complexity (a.k.a. sample complexity) CS485/685 (c) 2012 P. Poupart 4 2

28/02/2012 Computational Complexity • Time/space complexity – How do time/space requirements vary with the size of the input? • Data complexity – How do data requirements (size of the input) vary with the performance level? • Problem: we can’t guarantee a performance level because the training data is usually different from the data that the algorithm will encounter in the future • Idea: study data requirements as a function of a probabilistic performance level CS485/685 (c) 2012 P. Poupart 5 Formal Model (Supervised Classification) 1. The learner’s input Domain set � (e.g., possible emails in spam filtering) a. Label set � (e.g., ��, ~�� ) b. For convenience assume that � � �0,1� or ��1,1� Training data � � � � � , � � , � � , � � , … , � � , � � � c. sequence of pairs in � � � 2. The learner’s output: hypothesis or prediction rule �: � → � e.g., decision tree, k ‐ NN rule, linear separator CS485/685 (c) 2012 P. Poupart 6 3

28/02/2012 Formal Model (Supervised Classification) 3. Data generation model training and testing data is sampled independently and identically (i.i.d.) from an unknown distribution �. � � , � � ~ � ∀� 4. Performance measure: probability of error � � � � � � �� ∑ � � �, � �� ,� true loss, but � is unknown CS485/685 (c) 2012 P. Poupart 7 Empirical Risk Minimization • � is unknown, but � is known � � � � ∑ � � � � � � � � � • Empirical risk minimization (ERM): Find � � that minimizes � � �� • How good is ERM? – It can be pretty bad (due to overfitting) CS485/685 (c) 2012 P. Poupart 8 4

28/02/2012 Papaya example • Consider a papaya prediction problem CS485/685 (c) 2012 P. Poupart 9 Papaya example • Hypothesis � � : if a papaya is identical to a previously tasted papaya, predict the same taste. Otherwise, assume that it tastes bad. • Let � � � � �� if ∃� such that � � � � 0 otherwise • Then � � � � � but � � � � � • This is an example of poor generalization (overfitting) CS485/685 (c) 2012 P. Poupart 10 5

28/02/2012 Generalization • How does the accuracy of � � vary with the amount of data? – As |�| ↑ , then |� � � � � � � | ↓ • How much data do we need to make sure that the hypothesis � � found by ERM is not much worse than the best hypothesis � ∗ most of the time? � � � �� ∈� � � �� ∗ � �� ∈� � � �� CS485/685 (c) 2012 P. Poupart 11 Assumptions 1. Finite Hypothesis class Assume � is finite (and chosen before receiving �� – 2. Realizable assumption: there exists a perfect hypothesis � ∗ ∈ � i.e., ∃� ∗ ∈ � such that � � � ∗ � 0 – This implies that for any training set � , � � � ∗ � 0 – Since � ∗ is deterministic, this implies that � � ��|�� is – deterministic 3. i.i.d. assumption: Data is independently and identically distributed from � – CS485/685 (c) 2012 P. Poupart 12 6

28/02/2012 Analysis • Find sample size � � |�| such that � � � � � � – Here � is a bound on the true loss • Problem: since � is obtained by a random process, � � and � � �� are random. • Instead: find sample size � � |�| such that Pr � � � � � � � � – Here � is a bound on the probability that we obtain a sample � for which � � is bad (i.e., � � � � � � ) – Hence 1 � � is our confidence in the bound � CS485/685 (c) 2012 P. Poupart 13 Bound Corollary: Let � be finite, � ∈ �0,1� , � � 0 and � log � � � � then for any � (for which the realizable assumption holds), with probability at least 1 � � we have that � � � � � � CS485/685 (c) 2012 P. Poupart 14 7

28/02/2012 Proof Proof: we need to show that � �~� � � � � � � � � � • Let � � � �� ∈ �|� � � � �� be the set of bad hypotheses • By the realizable assumption, � � � � � 0 . • This implies that � � � � � � can only happen if for some � ∈ � � we have � � � � 0 . • Hence �|� � � � � � ⊆ ��|∃� ∈ � � , � � � � 0� ⟹ �|� � � � � � ⊆ ∪ �∈� � ��|� � � � 0� CS485/685 (c) 2012 P. Poupart 15 Proof (continued) • Bound the learning failure � � � �|� � � � � � � � � � ∪ �∈� � � � � � � 0 � ∑ � � � � � � � � � 0 � by the union bound �∈� � Union bound: � � ∪ � � � � � �� CS485/685 (c) 2012 P. Poupart 16 8

28/02/2012 Proof (continued) � �~� � � � � � � � � ∑ � �~� � �� 0� �∈� � � ∑ � �~� � �∀�, � � � � � � � �∈� � � � ∑ ∏ �� i.i.d. assumption �∈� � �� ∑ ∏ �1 � �� ∈� � �� 1 � � � � � � �� since 1 � � � � �� since � � � CS485/685 (c) 2012 P. Poupart 17 Probably Approximately Correct (PAC) Learning • Definition : A hypothesis class � is PAC learnable if for any � � 0 , � ∈ �0,1� there exists a function � � � � , � � and a learning algorithm such that for � any distribution � over � � � which satisfies the realizability assumption, when running the algorithm on � i.i.d examples it returns � ∈ � such that with probability at least 1 � � , � � � � � . • By Corollary 1, finite hypothesis classes are PAC learnable CS485/685 (c) 2012 P. Poupart 18 9

CS485/685 Lecture 15: Feb 28, 2012 Probably Approximately Correct - PDF document

28/02/2012 CS485/685 Lecture 15: Feb 28, 2012 Probably Approximately Correct Learning [BDSS] Chapter 1 CS485/685 (c) 2012 P. Poupart 1 Quick Recap Tom Mitchell (1998): A computer program is said to learn from experience E with respect to some

CS485/685 Lecture 7: Jan 24, 2012 Perceptrons, Neural Networks [B]: Sections 4.1.7, 5.1 CS485/685

CS485/685 Lecture 16: March 1, 2012 Agnostic Learning [BDSS] Chapters 2, 3 CS485/685 (c) 2012 P.

March 2018 Progress Report March Feb Anderson March Feb Anderson March Feb Anderson March

35 30 33 20 10 10 8 7 0 Feb 10 Aug 10 Feb 11 Aug 11 Feb 12 Aug 12 Feb 13 Aug 13

CS485/540 Software Engineering Project Details and Team Roles Cengiz Gnay Fall 2012 Gnay

CS485/540 Software Engineering Demo Guidelines Cengiz Gnay Dept. Math & CS, Emory

Deployment Tools and Techniques Cengiz Gnay CS485/540 Software Engineering Fall 2014, some

Alexander Volya 2016, Feb. GGI Lecture notes www.volya.net Alexander Volya 2016, Feb. GGI

19 th ,20 th Feb 2010 Feb 2010 1 19 th ,20 th Feb 2010 Feb 2010 2 Contents Importance of

1 21-Feb-17 2 21-Feb-17 3 21-Feb-17

Banburismus Banburismus Monday Feb 23 and Wednesday Feb 25 Monday Feb 23 and Wednesday Feb

H1 2012 Results Main results Key figures H1 2012 H1 2011 Q2 2012 Q1 2012 Q2 2011 Q1 2011

Eglinton West/Allen Road Station Preliminary Design Online Consultation Feb 2, 2012 Feb 17,

Personal Leadership Personal Leadership Philosophy Philosophy Sydney 03.14.2018 Overview

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Tailored 685 Third Avenue Technologies LLC New York, NY 10017 Tel: (212) 503-6300 Date:

Introduction to Machine Learning CMU-10701 10. Risk Minimization Barnabs Pczos 10. Risk

9.54 class 8 Supervised learning Optimization, regularization, kernels Shimon Ullman + Tomaso

On the Local Minima of the Empirical Risk Chi Jin * 1 , Lydia T. Liu* 1 , Rong Ge 2 , Michael I.

Empirical Risk Minimization October 29, 2015 Outline Empirical risk minimization view

The landscape of empirical risk for non-convex losses Song Mei ICME, Stanford December 3, 2016

Introduction to Machine Learning CART: Splitting Criteria compstat-lmu.github.io/lecture_i2ml

Provably Robust Boosted Decision Stumps and Trees against Adversarial Attacks Maksym

Risk bounds for cl classification and re regre ression rules that interpolate Daniel Hsu

CS485/685 Lecture 15: Feb 28, 2012 Probably Approximately Correct - PDF document

28/02/2012 CS485/685 Lecture 15: Feb 28, 2012 Probably Approximately Correct Learning [BDSS] Chapter 1 CS485/685 (c) 2012 P. Poupart 1 Quick Recap Tom Mitchell (1998): A computer program is said to learn from experience E with respect to some

CS485/685 Lecture 7: Jan 24, 2012 Perceptrons, Neural Networks [B]: Sections 4.1.7, 5.1 CS485/685

CS485/685 Lecture 16: March 1, 2012 Agnostic Learning [BDSS] Chapters 2, 3 CS485/685 (c) 2012 P.

March 2018 Progress Report March Feb Anderson March Feb Anderson March Feb Anderson March

35 30 33 20 10 10 8 7 0 Feb 10 Aug 10 Feb 11 Aug 11 Feb 12 Aug 12 Feb 13 Aug 13

CS485/540 Software Engineering Project Details and Team Roles Cengiz Gnay Fall 2012 Gnay

CS485/540 Software Engineering Demo Guidelines Cengiz Gnay Dept. Math &amp; CS, Emory

Deployment Tools and Techniques Cengiz Gnay CS485/540 Software Engineering Fall 2014, some

Alexander Volya 2016, Feb. GGI Lecture notes www.volya.net Alexander Volya 2016, Feb. GGI

19 th ,20 th Feb 2010 Feb 2010 1 19 th ,20 th Feb 2010 Feb 2010 2 Contents Importance of

1 21-Feb-17 2 21-Feb-17 3 21-Feb-17

Banburismus Banburismus Monday Feb 23 and Wednesday Feb 25 Monday Feb 23 and Wednesday Feb

H1 2012 Results Main results Key figures H1 2012 H1 2011 Q2 2012 Q1 2012 Q2 2011 Q1 2011

Eglinton West/Allen Road Station Preliminary Design Online Consultation Feb 2, 2012 Feb 17,

Personal Leadership Personal Leadership Philosophy Philosophy Sydney 03.14.2018 Overview

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Tailored 685 Third Avenue Technologies LLC New York, NY 10017 Tel: (212) 503-6300 Date:

Introduction to Machine Learning CMU-10701 10. Risk Minimization Barnabs Pczos 10. Risk

9.54 class 8 Supervised learning Optimization, regularization, kernels Shimon Ullman + Tomaso

On the Local Minima of the Empirical Risk Chi Jin * 1 , Lydia T. Liu* 1 , Rong Ge 2 , Michael I.

Empirical Risk Minimization October 29, 2015 Outline Empirical risk minimization view

The landscape of empirical risk for non-convex losses Song Mei ICME, Stanford December 3, 2016

Introduction to Machine Learning CART: Splitting Criteria compstat-lmu.github.io/lecture_i2ml

Provably Robust Boosted Decision Stumps and Trees against Adversarial Attacks Maksym

Risk bounds for cl classification and re regre ression rules that interpolate Daniel Hsu

CS485/540 Software Engineering Demo Guidelines Cengiz Gnay Dept. Math & CS, Emory