Computational Learning Theory: The Theory of Generalization Machine - PowerPoint PPT Presentation

Computational Learning Theory: The Theory of Generalization Machine Learning 1 Slides based on material from Dan Roth, Avrim Blum, Tom Mitchell and others

Checkpoint: The bigger picture • Supervised learning: instances, concepts, and hypotheses • Specific learners Learning Hypothesis/ Labeled algorithm Model h – Decision trees data – Perceptron – Winnow New example Prediction h • General ML ideas – Features as high dimensional vectors – Overfitting – Mistake-bound: One way of asking “Can my problem be learned?” 2

Checkpoint: The bigger picture • Supervised learning: instances, concepts, and hypotheses • Specific learners Learning Hypothesis/ Labeled algorithm Model h – Decision trees data – Perceptron New example Prediction h • General ML ideas – Features as high dimensional vectors – Overfitting – Mistake-bound: One way of asking “Can my problem be learned?” 3

Checkpoint: The bigger picture • Supervised learning: instances, concepts, and hypotheses • Specific learners Learning Hypothesis/ Labeled algorithm Model h – Decision trees data – Perceptron New example Prediction h • General ML ideas – Features as high dimensional vectors – Overfitting – Mistake-bound: One way of asking “Can my problem be learned?” 4

Checkpoint: The bigger picture • Supervised learning: instances, concepts, and hypotheses • Specific learners Learning Hypothesis/ Labeled algorithm Model h – Decision trees data – Perceptron New example Prediction h • General ML ideas – Features as high dimensional vectors – Overfitting – Mistake-bound: One way of asking “Can my problem be learned?” Questions? 5

Computational Learning Theory • The Theory of Generalization • Probably Approximately Correct (PAC) learning • Positive and negative learnability results • Agnostic Learning • Shattering and the VC dimension 6

This lecture: Computational Learning Theory • The Theory of Generalization – When can be trust the learning algorithm? – Errors of hypotheses – Batch Learning • Probably Approximately Correct (PAC) learning • Positive and negative learnability results • Agnostic Learning • Shattering and the VC dimension 7

Computational Learning Theory Are there general “laws of nature” related to learnability? We want theory that can relate – Probability of successful Learning – Number of training examples – Complexity of hypothesis space – Accuracy to which target concept is approximated – Manner in which training examples are presented 8

How good is our learning algorithm? Learning Conjunctions Some random source (nature) provides training examples Teacher (Nature) provides the labels (f(x)) – <(1,1,1,1,1,1,…,1,1), 1> – <(1,1,1,0,0,0,…,0,0), 0> – <(1,1,1,1,1,0,...0,1,1), 1> Notation: <example, label> – <(1,0,1,1,1,0,...0,1,1), 0> – <(1,1,1,1,1,0,...0,0,1), 1> – <(1,0,1,0,0,0,...0,1,1), 0> – <(1,1,1,1,1,1,…,0,1), 1> – <(0,1,0,1,0,0,...0,1,1), 0> 9

How good is our learning algorithm? Learning Conjunctions Some random source (nature) provides training examples Teacher (Nature) provides the labels (f(x)) – <(1,1,1,1,1,1,…,1,1), 1> For a reasonable learning algorithm (by – <(1,1,1,0,0,0,…,0,0), 0> elimination ), the final hypothesis will be – <(1,1,1,1,1,0,...0,1,1), 1> – <(1,0,1,1,1,0,...0,1,1), 0> – <(1,1,1,1,1,0,...0,0,1), 1> – <(1,0,1,0,0,0,...0,1,1), 0> – <(1,1,1,1,1,1,…,0,1), 1> – <(0,1,0,1,0,0,...0,1,1), 0>

How good is our learning algorithm? Learning Conjunctions Some random source (nature) provides training examples Teacher (Nature) provides the labels (f(x)) – <(1,1,1,1,1,1,…,1,1), 1> For a reasonable learning algorithm (by – <(1,1,1,0,0,0,…,0,0), 0> elimination ), the final hypothesis will be – <(1,1,1,1,1,0,...0,1,1), 1> – <(1,0,1,1,1,0,...0,1,1), 0> – <(1,1,1,1,1,0,...0,0,1), 1> – <(1,0,1,0,0,0,...0,1,1), 0> Whenever the output is 1, x 1 is present – <(1,1,1,1,1,1,…,0,1), 1> – <(0,1,0,1,0,0,...0,1,1), 0> 11

How good is our learning algorithm? Learning Conjunctions Some random source (nature) provides training examples Teacher (Nature) provides the labels (f(x)) – <(1,1,1,1,1,1,…,1,1), 1> For a reasonable learning algorithm (by – <(1,1,1,0,0,0,…,0,0), 0> elimination ), the final hypothesis will be – <(1,1,1,1,1,0,...0,1,1), 1> – <(1,0,1,1,1,0,...0,1,1), 0> – <(1,1,1,1,1,0,...0,0,1), 1> – <(1,0,1,0,0,0,...0,1,1), 0> Whenever the output is 1, x 1 is present – <(1,1,1,1,1,1,…,0,1), 1> – <(0,1,0,1,0,0,...0,1,1), 0> With the given data, we only learned an approximation to the true concept. Is it good enough? 12

Two Directions for How good is our learning algorithm? • Can analyze the probabilistic intuition – Never saw x 1 =0 in positive examples, maybe we’ll never see it – And if we do, it will be with small probability, so the concepts we learn may be pretty good • Pretty good: In terms of performance on future data – PAC framework • Mistake Driven Learning algorithms – Update your hypothesis only when you make mistakes – Define good in terms of how many mistakes you make before you stop 13

Two Directions for How good is our learning algorithm? • Can analyze the probabilistic intuition – Never saw x 1 =0 in positive examples, maybe we’ll never see it – And if we do, it will be with small probability, so the concepts we learn may be pretty good • Pretty good: In terms of performance on future data – PAC framework • Mistake Driven Learning algorithms – Update your hypothesis only when you make mistakes – Define good in terms of how many mistakes you make before you stop 14

The mistake bound approach • The mistake bound model is a theoretical approach – We may be able to determine the number of mistakes the learning algorithm can make before converging • But no answer to “ How many examples do you need before converging to a good hypothesis? ” • Because the mistake-bound model makes no assumptions about the order or distribution of training examples – Both a strength and a weakness of the mistake bound model 15

PAC learning • A model for batch learning – Train on a fixed training set – Then deploy it in the wild • How well will your learning algorithm do on future instances? 16

The setup Instance Space: 𝑌 , the set of examples • Concept Space: 𝐷 , the set of possible target functions: 𝑔 ∈ 𝐷 is the hidden • target function – Eg: all 𝑜 -conjunctions; all 𝑜 -dimensional linear functions, … Hypothesis Space: H, the set of possible hypotheses • This is the set that the learning algorithm explores – Training instances: S £ {-1,1}: positive and negative examples of the target • concept. (S is a finite subset of X) < > < > < > x , f ( x ) , x , f ( x ) ,... x , f ( x ) 1 1 2 2 n n What we want: A hypothesis h Î H such that h(x) = f(x) • A hypothesis h Î H such that h(x) = f(x) for all x Î S ? – A hypothesis h Î H such that h(x) = f(x) for all x Î X ? – 17

The setup Instance Space: 𝑌 , the set of examples • Concept Space: 𝐷 , the set of possible target functions: 𝑔 ∈ 𝐷 is the hidden • target function – Eg: all 𝑜 -conjunctions; all 𝑜 -dimensional linear functions, … Hypothesis Space: 𝐼 , the set of possible hypotheses • This is the set that the learning algorithm explores – Training instances: 𝑇×{−1,1} : positive and negative examples of the • target concept. ( 𝑇 is a finite subset of 𝑌 ) < 𝑦 ! , 𝑔 𝑦 ! > , 𝑦 " , 𝑔 𝑦 " < , ⋯ , 𝑦 # , 𝑔 𝑦 # > < > x , f ( x ) , x , f ( x ) ,... x , f ( x ) 1 1 2 2 n n What we want: A hypothesis h ∈ 𝐼 such that ℎ 𝑦 = 𝑔(𝑦) • A hypothesis h ∈ 𝐼 such that ℎ 𝑦 = 𝑔(𝑦) for all 𝑦 ∈ 𝑇 ? – A hypothesis h ∈ 𝐼 such that ℎ 𝑦 = 𝑔(𝑦) for all 𝑦 ∈ 𝑌 ? – 18

The setup Instance Space: 𝑌 , the set of examples • Concept Space: 𝐷 , the set of possible target functions: 𝑔 ∈ 𝐷 is the hidden • target function – Eg: all 𝑜 -conjunctions; all 𝑜 -dimensional linear functions, … Hypothesis Space: 𝐼 , the set of possible hypotheses • This is the set that the learning algorithm explores – Training instances: 𝑇×{−1,1} : positive and negative examples of the • target concept. ( 𝑇 is a finite subset of 𝑌 ) < 𝑦 ! , 𝑔 𝑦 ! > , 𝑦 " , 𝑔 𝑦 " < , ⋯ , 𝑦 # , 𝑔 𝑦 # > < > x , f ( x ) , x , f ( x ) ,... x , f ( x ) 1 1 2 2 n n What we want: A hypothesis h ∈ 𝐼 such that ℎ 𝑦 = 𝑔(𝑦) • A hypothesis h ∈ 𝐼 such that ℎ 𝑦 = 𝑔(𝑦) for all 𝑦 ∈ 𝑇 ? – A hypothesis h ∈ 𝐼 such that ℎ 𝑦 = 𝑔(𝑦) for all 𝑦 ∈ 𝑌 ? – 19

Computational Learning Theory: The Theory of Generalization Machine - PowerPoint PPT Presentation

Computational Learning Theory: The Theory of Generalization Machine Learning 1 Slides based on material from Dan Roth, Avrim Blum, Tom Mitchell and others Checkpoint: The bigger picture Supervised learning: instances, concepts, and

Deep learning: Challenges in learning and generalization Tomas Mikolov, Facebook AI What is

Computational Learning Theory: Probably Approximately Correct (PAC) Learning Machine Learning 1

Computational Learning Theory: Agnostic Learning Machine Learning 1 Slides based on material

Computational Learning Theory: Positive and negative learnability results Machine Learning 1

Computational Learning Theory: Occams Razor Machine Learning 1 Slides based on material from

Computational Learning Theory: Shattering and VC Dimensions Machine Learning 1 Slides based on

Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning Slides

Local Substitutability for Sequence Generalization Fran cois Coste , Ga elle Garet , Jacques

Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and

Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity

CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 /

VC GENERALIZATION BOUND VC GENERALIZATION BOUND Matthieu Bloch March 12, 2020 1 LOGISTICS (AND

Generalization of Cycle-Covering Heuristics Clemens B uchner Department of Mathematics and

Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 6 February, 23

Learning From Data Lecture 7 Approximation Versus Generalization The VC Dimension Approximation

Assessing Generalization in Deep Reinforcement Learning Soo Jung Jang Background Before (ex:

Generalization of Deep Learning 1 Yuan YAO HKUST Some Theories are limited but help:

Incentives in Crowdsourcing: A Game-theoretic Approach ARPITA GHOSH Cornell University NIPS

Game Theoretic Security Framework for Quantum Key Distribution Walter O. Krawec Fei Miao

On the Generalization Benefjt of Noise in Stochastic Gradient Descent Samuel L. Smith, Erich

The Effect of Network Width on Stochastic Gradient Descent and Generalization Daniel S. Park

Michael Spece Departments of Machine Learning and Statistics Carnegie Mellon University June 11,

Lecture 4.5: Generalized Fourier series Matthew Macauley Department of Mathematical Sciences

Video to Text Description Jia Chen 1 , Shizhe Chen 2 , Qin Jin 2 , Alexander Hauptmann 1 Carnegie