Learning Theory CE-717: Machine Learning Sharif University of - PowerPoint PPT Presentation

Learning Theory CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016

Topics  Feasibility of learning  PAC learning  VC dimension  Structural Risk Minimization (SRM) 2

Feasibility of learning  Does the training set 𝒠 tell us anything out of 𝒠 ?  𝒠 does not tells us something certain about 𝑔 outside of 𝒠  However, it can tell us something likely about 𝑔 outside of 𝒠  Probability helps us to find learning theory 3

Feasibility of learning  These two questions:  Can we make sure 𝐹 𝑢𝑠𝑣𝑓 (𝑔) is close to 𝐹 𝑢𝑠𝑏𝑗𝑜 (𝑔) ?  Can we make 𝐹 𝑢𝑠𝑏𝑗𝑜 (𝑔) small enough? 4

Generalizability of Learning  Generalization error is important to us  Why should doing well on the training set tell us anything about generalization error?  Can we relate error on training set to generalization error?  Which are conditions under which we can actually prove that learning algorithms will work well? 5

A related example Pr picking a red marble = 𝜈 𝜈 Pr picking a green marble = 1 − 𝜈  Value of 𝜈 is unknown to us  We pick 𝑂 marbles independently  The fraction of red marbles in sample = 𝜈 6

Does 𝜈 say anything about 𝜈 ?  No:  Samples can be mostly green while bin is mostly red  Yes:  Sample frequency 𝜈 is likely close to bin frequency 𝜈 7

What does 𝜈 say about 𝜈 ?  In a big sample (large 𝑂 ), 𝜉 is probably close to 𝜈 (within 𝜗 ): 𝜈 − 𝜈 > 𝜗 ≤ 2𝑓 −2𝜗 2 𝑂 Pr Hoeffding ’ s Inequality  Valid for all 𝑂 and 𝜗  Bound does not depend on 𝜈  Tradeoff: 𝑂 , 𝜗 , and the bound  In the other words, “ 𝜈 = 𝜈 ” is Probably Approximately Correct (PAC) 8

Recall: Learning diagram 𝑑 : 𝑦 1 , … , 𝑦 𝑂 𝑦 1 , 𝑧 (1) , … , 𝑦 𝑂 , 𝑧 (𝑂) 𝑑 ≈ g We assume that some random process proposes instances, and teacher labels them (i.e., instances drawn i.i.d. according to a distribution 𝑄(𝒚 )) [Y.S. Abou Mostafa, et. al, “ Learning From Data ” , 2012] 9

Learning: Problem settings  Set of all instances 𝒴  Set of hypotheses ℋ  Set of possible target functions 𝐷 = {𝑑: 𝒴 → 𝒵} 𝑂  Sequence of 𝑂 training instances 𝒠 = 𝒚 (𝑜) , 𝑑 𝒚 (𝑜) 𝑜=1  𝒚 drawn at random from unknown distribution 𝑄 𝒚  Teacher provides noise-free label 𝑑(𝒚) for it  Learner observes a set of training examples 𝒠 for target function 𝑑 and outputs a hypothesis ℎ ∈ ℋ estimating 𝑑 10

Connection of Hoeffding inequality to learning  In the bin example, the unknown is 𝜈  In the learning problem the unknown is a function 𝑑: 𝒴 → 𝒵 11

Two notions of error  Training error of 𝒊 : how often ℎ(𝒚) ≠ 𝑑(𝒚) on training instances 𝐸 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ ≡ 𝐹 𝒚~𝒠 𝐽 ℎ(𝒚) ≠ 𝑑(𝒚) = 1 𝒠 𝐽 ℎ 𝒚 ≠ 𝑑 𝒚 𝒚∈𝒠 Training data  T est error of 𝒊 : how often ℎ(𝒚) ≠ 𝑑(𝒚) over future instances drawn at random from 𝑄(𝑌) 𝐹 𝑢𝑠𝑣𝑓 (ℎ) ≡ 𝐹 𝒚~𝑄(𝑌) 𝐽 ℎ(𝒚) ≠ 𝑑(𝒚) Probability distribution 12

Notation for learning  Both 𝜈 and 𝜈 depend on which hypothesis ℎ 𝜈 is “ in sample ” denoted by 𝐹 𝑢𝑠𝑏𝑗𝑜 (ℎ)   𝜈 is “ out of sample ” denoted by 𝐹 𝑢𝑠𝑣𝑓 (ℎ) 𝐹 𝑢𝑠𝑣𝑓 (ℎ)  The Hoeffding inequality becomes: > 𝜗 ≤ 2𝑓 −2𝜗 2 𝑂 Pr 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ − 𝐹 𝑢𝑠𝑣𝑓 ℎ 13 𝐹 𝑢𝑠𝑏𝑗𝑜 (ℎ)

Are we done?  We cannot use this bound for the learned 𝑔 from data.  Indeed, ℎ is assumed fixed in this inequality and for this ℎ , 𝐹 𝑢𝑠𝑏𝑗𝑜 (ℎ) generalizes to 𝐹 𝑢𝑠𝑣𝑓 (ℎ) .  “ verification ” of ℎ , not learning  We need to choose from multiple ℎ 's and 𝑔 is not fixed and instead is found according to the samples. 14

Hypothesis space as multiple bins  Generalizing the bin model to more than one hypothesis: 15

Hypothesis space: Coin example  Question: if you toss a fair coin 10 times, what is the probability that it will get 10 heads?  Answer: ≈ 0.1%  Question: if you toss 1000 fair coins 10 times, what is the probability that some of them will get 10 heads?  Answer: ≈ 63% 16

A bound for the learning problem: Using Hoeffding inequality Pr 𝐹 𝑢𝑠𝑣𝑓 𝑔 − 𝐹 𝑢𝑠𝑏𝑗𝑜 𝑔 > 𝜗 𝐹 𝑢𝑠𝑣𝑓 ℎ 1 − 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 1 > 𝜗 or 𝐹 𝑢𝑠𝑣𝑓 ℎ 2 − 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 2 > 𝜗 ≤ Pr … or 𝐹 𝑢𝑠𝑣𝑓 ℎ 𝑁 − 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 𝑁 > 𝜗 𝑁 ≤ Pr 𝐹 𝑢𝑠𝑣𝑓 ℎ 𝑗 − 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 𝑗 > 𝜗 𝑗=1 𝑁 2𝑓 −2𝜗 2 𝑂 ≤ 𝑗=1 ≤ 2 ℋ 𝑓 −2𝜗 2 𝑂 ℋ = 𝑁 17

PAC bound: Using Hoeffding inequality > 𝜗 ≤ 2 ℋ 𝑓 −2𝜗 2 𝑂 = 𝜀 Pr 𝐹 𝑢𝑠𝑣𝑓 ℎ − 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ ⇒ Pr 𝐹 𝑢𝑠𝑣𝑓 ℎ − 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ ≤ 𝜗 ≥ 1 − 𝜀  With probability at least ( 1 − 𝜀 ) every ℎ satisfies ln2 ℋ + ln 1 𝜀 𝐹 𝑢𝑠𝑣𝑓 ℎ < 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ + 2𝑂 Thus, we can we bound 𝐹 𝑢𝑠𝑣𝑓 ℎ − 𝐹 𝑢𝑠𝑏𝑗𝑜 (ℎ) that shows the amount of overfiting 18

Sample complexity  How many training examples suffice?  Given 𝜗 and 𝜀 , yields sample complexity: 𝑂 ≥ 1 2𝜗 2 ln 2 ℋ + ln 1 𝜀  Thus, we found a theory that relates  Number of training examples  Complexity of hypothesis space  Accuracy to which target function is approximated  Probability that learner outputs a successful hypothesis 19

An other problem setting  Finite number of possible hypothesis (e.g., decision trees of depth 𝑒 0 )  A learner finds a hypothesis ℎ that is consistent with training data  𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ = 0  What is the probability that the true error of ℎ will be more than 𝜗 ?  𝐹 𝑢𝑠𝑣𝑓 ℎ ≥ 𝜗 20

True error of a hypothesis Target 𝑑(𝒚)  True error of ℎ : probability that it will misclassify an example drawn at random from 𝑄 𝒚 𝐹 𝑢𝑠𝑣𝑓 (ℎ) ≡ 𝐹 𝒚~𝑄(𝑌) 𝐽 ℎ(𝒚) ≠ 𝑑(𝒚) 21

How likely is a consistent learner to pick a bad hypothesis?  Bound on the probability that any consistent learner will output ℎ with 𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗  Theorem [Haussler, 1988]: For target concept 𝑑 , ∀ 0 ≤ 𝜗 ≤ 1  If 𝐼 is finite and 𝒠 contains 𝑂 ≥ 1 independent random samples Pr ∃ℎ ∈ ℋ, 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ = 0 ∧ 𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗 ≤ ℋ 𝑓 −𝜗𝑂 22

Haussler bound: Proof  What does the theorem mean? Pr ∃ℎ ∈ ℋ, 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ = 0 ∧ 𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗 ≤ ℋ 𝑓 −𝜗𝑂  For a fixed ℎ , how likely is a bad hypothesis (i.e., 𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗 ) to label 𝑂 training data points right?  Pr(ℎ labels one data point correctly|𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗) ≤ (1 − 𝜗)  Pr(ℎ labels 𝑂 i. i. d data points correctly|𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗) ≤ 1 − 𝜗 𝑂 23

Haussler bound: Proof (Cont ’ d)  There may be many bad hypotheses ℎ 1 , … , ℎ 𝑙 (i.e., 𝐹 𝑢𝑓𝑡𝑢 ℎ 1 > 𝜗 , … , 𝐹 𝑢𝑓𝑡𝑢 ℎ 𝑙 > 𝜗 ) that are consistent with 𝑂 training data  𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 1 = 0 , 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 2 = 0 , … , 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 𝑙 = 0  How likely is the learner pick a bad hypothesis ( 𝐹 𝑢𝑓𝑡𝑢 ℎ > 𝜗 ) among consistent ones {ℎ 1 , … , ℎ 𝑙 } ?  Pr ∃ℎ ∈ 𝐼, 𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗 ∧ 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ = 0  = Pr 𝐹 𝑢𝑠𝑣𝑓 ℎ 1 > 𝜗 ∧ 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 1 = 0 or … or 𝐹 𝑢𝑠𝑣𝑓 ℎ 𝑙 > 𝜗 ∧ 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 𝑙 = 0 𝑙  ≤ 𝑗=1 Pr(𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 𝑗 = 0 ∧ 𝐹 𝑢𝑠𝑣𝑓 ℎ 𝑗 > 𝜗) [𝑄 A ∪ 𝐶 ≤ 𝑄 𝐵 + 𝑄 𝐶 ] 𝑙 𝑙 1 − 𝜗 𝑂  ≤ 𝑗=1 Pr(𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 𝑗 = 0|𝐹 𝑢𝑠𝑣𝑓 ℎ 𝑗 > 𝜗) ≤ 𝑗=1  ≤ ℋ 1 − 𝜗 𝑂 [𝑙 ≤ ℋ ]  ≤ ℋ 𝑓 −𝜗𝑂 [1 − 𝜗 ≤ 𝑓 −𝜗 0 ≤ 𝜗 ≤ 1] 24

Haussler PAC Bound  Theorem [Haussler ’ 88]: Consider finite hypothesis space 𝐼 , training set 𝐸 with m i.i.d. samples, 0 < 𝜗 < 1 : Pr ∃ℎ ∈ ℋ, 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ = 0|𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗 ≤ ℋ 𝑓 −𝜗𝑂 ≤ 𝜀 Suppose we want this probability to be at most 𝜀 .  For any learned hypothesis ℎ ∈ ℋ that is consistent on the training set 𝒠 (i.e., 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ = 0 ), with probability at least (1 − 𝜀) : 𝐹 𝑢𝑠𝑣𝑓 (ℎ) ≤ ϵ 25

Learning Theory CE-717: Machine Learning Sharif University of - PowerPoint PPT Presentation

Learning Theory CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Feasibility of learning PAC learning VC dimension Structural Risk Minimization (SRM) 2 Feasibility of learning Does the

Chapter 2- -3 3 Chapter 2 Definition of Theory: A theory is a systematic Definition of

Dennis Ryan Clark County School District Health Occupations ryandl@nv.ccsd.net Learning Theory

Computational Learning Theory: Probably Approximately Correct (PAC) Learning Machine Learning 1

Computational Learning Theory: Agnostic Learning Machine Learning 1 Slides based on material

Lectures on learning theory G abor Lugosi ICREA and Pompeu Fabra University Barcelona what

Game Theory and Nuclear Weapons Game Theory and Nuclear Weapons Game Theory and Nuclear Warfare

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

SOCIOLOGICAL THEORY: A SCIENTIFIC APPROACH What is a theory? ! What does a theory consist of?

Applied Hodge Theory: Social Choice, Crowdsourced Ranking, and Game Theory Yuan Yao HKUST

SOCIOLOGICAL THEORY: A SCIENTIFIC APPROACH What is a theory? What does a theory consist of?

General motivations Model theory Recursion theory Lambda calculus Set theory

Supervised Learning Part 1 Theory Sven Krippendorf Workshop on Big Data in String Theory

Introduction to Machine Learning CMU-10701 11. Learning Theory Barnabs Pczos Learning

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik

Computational Learning Theory: Positive and negative learnability results Machine Learning 1

Planning and Optimization December 16, 2019 G8. Monte-Carlo Tree Search Algorithms (Part II)

Assume we are reading a stream of n distinct integers in { 1 , . . . , n + 1 } .

Learning sums of ridge functions in high dimension: a nonlinear compressed sensing model Massimo

Algorithms at Scale (Week 2) Puzzle of the Day: A bag contains a collection of blue and red

Data Stream Classification using Random Feature Functions and Novel Method Combinations Jesse

Expectations or Guarantees? I Want It All! A Crossroad between Games and MDPs V. Bruy` ere

SAMOA: A Platform for Mining Big Data Streams Nicolas Kourtellis Associate Researcher

Last week 1. We introduced the L p spaces: f is A -measurable L p = f :