computational learning theory probably approximately
play

Computational Learning Theory: Probably Approximately Correct (PAC) - PowerPoint PPT Presentation

Computational Learning Theory: Probably Approximately Correct (PAC) Learning Machine Learning 1 Slides based on material from Dan Roth, Avrim Blum, Tom Mitchell and others Computational Learning Theory The Theory of Generalization


  1. Computational Learning Theory: Probably Approximately Correct (PAC) Learning Machine Learning 1 Slides based on material from Dan Roth, Avrim Blum, Tom Mitchell and others

  2. Computational Learning Theory • The Theory of Generalization • Probably Approximately Correct (PAC) learning • Positive and negative learnability results • Agnostic Learning • Shattering and the VC dimension 2

  3. Where are we? • The Theory of Generalization • Probably Approximately Correct (PAC) learning • Positive and negative learnability results • Agnostic Learning • Shattering and the VC dimension 3

  4. This section 1. Define the PAC model of learning 2. Make formal connections to the principle of Occam’s razor 4

  5. This section 1. Define the PAC model of learning 2. Make formal connections to the principle of Occam’s razor 5

  6. Recall: The setup Instance Space: 𝑌 , the set of examples • Concept Space: 𝐷 , the set of possible target functions: 𝑔 ∈ 𝐷 is the hidden • target function – Eg: all 𝑜 -conjunctions; all 𝑜 -dimensional linear functions, … Hypothesis Space: 𝐼 , the set of possible hypotheses • This is the set that the learning algorithm explores – Training instances: 𝑇×{−1,1} : positive and negative examples of the target • concept. ( 𝑇 is a finite subset of 𝑌 ) – Training instances are generated by a fixed unknown probability distribution 𝐸 over 𝑌 What we want: A hypothesis h ∈ 𝐼 such that ℎ 𝑦 = 𝑔(𝑦) • – Evaluate h on subsequent examples 𝑦 ∈ 𝑌 drawn according to 𝐸 6

  7. Formulating the theory of prediction All the notation we have so far on one slide In the general case, we have – 𝑌 : instance space, 𝑍 : output space = {+1, -1} – 𝐸 : an unknown distribution over 𝑌 – 𝑔 : an unknown target function X → 𝑍 , taken from a concept class 𝐷 – ℎ : a hypothesis function X → 𝑍 that the learning algorithm selects from a hypothesis class 𝐼 – 𝑇 : a set of m training examples drawn from 𝐸 , labeled with f – err ! ℎ : The true error of any hypothesis ℎ – err " ℎ : The empirical error or training error or observed error of ℎ 7

  8. Theoretical questions • Can we describe or bound the true error (err D ) given the empirical error (err S )? • Is a concept class C learnable? • Is it possible to learn C using only the functions in H using the supervised protocol? • How many examples does an algorithm need to guarantee good performance? 8

  9. Expectations of learning • We cannot expect a learner to learn a concept exactly – There will generally be multiple concepts consistent with the available data (which represent a small fraction of the available instance space) – Unseen examples could potentially have any label – Let’s “agree” to misclassify uncommon examples that do not show up in the training set • We cannot always expect to learn a close approximation to the target concept – Sometimes (hopefully only rarely) the training set will not be representative (will contain uncommon examples) 9

  10. Expectations of learning • We cannot expect a learner to learn a concept exactly – There will generally be multiple concepts consistent with the available data (which represent a small fraction of the available instance space) – Unseen examples could potentially have any label – Let’s “agree” to misclassify uncommon examples that do not show up in the training set • We cannot always expect to learn a close approximation to the target concept – Sometimes (hopefully only rarely) the training set will not be representative (will contain uncommon examples) 10

  11. Expectations of learning • We cannot expect a learner to learn a concept exactly – There will generally be multiple concepts consistent with the available data (which represent a small fraction of the available instance space) – Unseen examples could potentially have any label The only realistic expectation of a good learner is – Let’s “agree” to misclassify uncommon examples that do not that with high probability it will learn a close show up in the training set approximation to the target concept • We cannot always expect to learn a close approximation to the target concept – Sometimes (hopefully only rarely) the training set will not be representative (will contain uncommon examples) 11

  12. Probably approximately correctness The only realistic expectation of a good learner is that with high probability it will learn a close approximation to the target concept • In Probably Approximately Correct (PAC) learning, one requires that – given small parameters ² and ±, – With probability at least 1 - ±, a learner produces a hypothesis with error at most ² • The only reason we can hope for this is the consistent distribution assumption 12

  13. Probably approximately correctness The only realistic expectation of a good learner is that with high probability it will learn a close approximation to the target concept • In Probably Approximately Correct (PAC) learning, one requires that – given small parameters 𝜗 and 𝜀 , – With probability at least 1 − 𝜀 , a learner produces a hypothesis with error at most 𝜗 • The only reason we can hope for this is the consistent distribution assumption 13

  14. Probably approximately correctness The only realistic expectation of a good learner is that with high probability it will learn a close approximation to the target concept • In Probably Approximately Correct (PAC) learning, one requires that – given small parameters 𝜗 and 𝜀 , – With probability at least 1 − 𝜀 , a learner produces a hypothesis with error at most 𝜗 • The only reason we can hope for this is the consistent distribution assumption 14

  15. PAC Learnability Consider a concept class 𝐷 defined over an instance space 𝑌 (containing instances of length 𝑜 ), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷 , for all distribution 𝐸 over 𝑌 , and fixed 0 < 𝜗, 𝜀 < 1 , given 𝑛 examples sampled independently according to 𝐸 , with probability at least (1 − 𝜀) , the algorithm 𝑀 produces a hypothesis ℎ ∈ 𝐼 that has error at most 𝜗 , where 𝑛 is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼) . recall that Err D (h) = Pr D [f(x) ≠ h(x)] The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼) . 15

  16. PAC Learnability Consider a concept class 𝐷 defined over an instance space 𝑌 (containing instances of length 𝑜 ), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷 , for all distribution 𝐸 over 𝑌 , and fixed 0 < 𝜗, 𝜀 < 1 , given 𝑛 examples sampled independently according to 𝐸 , with probability at least (1 − 𝜀) , the algorithm 𝑀 produces a hypothesis ℎ ∈ 𝐼 that has error at most 𝜗 , where 𝑛 is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼) . recall that Err D (h) = Pr D [f(x) ≠ h(x)] The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼) . 16

  17. PAC Learnability Consider a concept class 𝐷 defined over an instance space 𝑌 (containing instances of length 𝑜 ), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷 , for all distribution 𝐸 over 𝑌 , and fixed 0 < 𝜗, 𝜀 < 1 , given 𝑛 examples sampled independently according to 𝐸 , with probability at least (1 − 𝜀) , the algorithm 𝑀 produces a hypothesis ℎ ∈ 𝐼 that has error at most 𝜗 , where 𝑛 is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼) . recall that Err D (h) = Pr D [f(x) ≠ h(x)] The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼) . 17

  18. PAC Learnability Consider a concept class 𝐷 defined over an instance space 𝑌 (containing instances of length 𝑜 ), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷 , for all distribution 𝐸 over 𝑌 , and fixed 0 < 𝜗, 𝜀 < 1 , given 𝑛 examples sampled independently according to 𝐸 , with probability at least (1 − 𝜀) , the algorithm 𝑀 produces a hypothesis ℎ ∈ 𝐼 that has error at most 𝜗 , where 𝑛 is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼) . recall that Err D (h) = Pr D [f(x) ≠ h(x)] The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼) . 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend