Advanced Introduction to Machine Learning, CMU-10715 Vapnik - PowerPoint PPT Presentation

Advanced Introduction to Machine Learning, CMU-10715 Vapnik – Chervonenkis Theory Barnabás Póczos

Learning Theory We have explored many ways of learning from data But… – How good is our classifier, really? – How much data do we need to make it “good enough”? 2

Review of what we have learned so far 3

Notation This is what the learning algorithm produces We will need these definitions, please copy it! 4

Big Picture Ultimate goal: Estimation error Approximation error Bayes risk Bayes risk Estimation error Approximation error 5

Big Picture: Illustration of Risks Upper bound Goal of Learning: 6

Learning Theory 7

Outline From Hoeffding’s inequality, we have seen that Theorem: These results are useless if N is big, or infinite. (e.g. all possible hyper-planes) Today we will see how to fix this with the Shattering coefficient and VC dimension 8

Outline From Hoeffding’s inequality, we have seen that Theorem: After this fix, we can say something meaningful about this too: This is what the learning algorithm produces and its true risk 9

Hoeffding inequality Theorem: Definition: Observation! 10

McDiarmid’s Bounded Difference Inequality It follows that 11

Bounded Difference Condition Our main goal is to bound Lemma : Proof: Let g denote the following function: Observation: ) McDiarmid can be applied for g! 12

Bounded Difference Condition Corollary: The Vapnik-Chervonenkis inequality does that with the shatter coefficient (and VC dimension)! 13

Concentration and Expected Value 14

Vapnik-Chervonenkis inequality Our main goal is to bound We already know: Vapnik-Chervonenkis inequality: Corollary: Vapnik-Chervonenkis theorem: 15

Shattering 16

How many points can a linear boundary classify exactly in 1D? 2 pts 3 pts - + + - + - - + - + - + - ?? There exists placement s.t. all labelings can be classified The answer is 2 17

How many points can a linear boundary classify exactly in 2D? - + 3 pts 4 pts - + + - + - + - + - + ?? - There exists placement s.t. all labelings can be classified The answer is 3 18

How many points can a linear boundary classify exactly in 3D? The answer is 4 + + - tetraeder - How many points can a linear boundary classify exactly in d-dim? The answer is d+1 19

Growth function, Shatter coefficient 0 0 0 0 1 0 1 1 1 1 0 0 Definition 0 1 1 (=5 in this example) 0 1 0 1 1 1 Growth function, Shatter coefficient maximum number of behaviors on n points 20

Growth function, Shatter coefficient - Definition + Growth function, Shatter coefficient + maximum number of behaviors on n points Example: Half spaces in 2D - + + 21

VC-dimension # behaviors Definition Growth function, Shatter coefficient maximum number of behaviors on n points Definition: VC-dimension Definition: Shattering Note: 22

VC-dimension # behaviors Definition 23

VC-dimension - + + - 24

Examples 25

VC dim of decision stumps ( axis aligned linear separator) in 2d What’s the VC dim. of decision stumps in 2d? - - + + + + + - - There is a placement of 3 pts that can be shattered ) VC dim ≥ 3 26

VC dim of decision stumps ( axis aligned linear separator) in 2d What’s the VC dim. of decision stumps in 2d? If VC dim = 3, then for all placements of 4 pts, there exists a labeling that can’t be shattered 1 in convex hull quadrilateral 3 collinear of other 3 - - + - + - - + + - - 27

VC dim. of axis parallel rectangles in 2d What’s the VC dim. of axis parallel rectangles in 2d? - - + + + - There is a placement of 3 pts that can be shattered ) VC dim ≥ 3 28

VC dim. of axis parallel rectangles in 2d There is a placement of 4 pts that can be shattered ) VC dim ≥ 4 29

VC dim. of axis parallel rectangles in 2d What’s the VC dim. of axis parallel rectangles in 2d? If VC dim = 4, then for all placements of 5 pts, there exists a labeling that can’t be shattered pentagon 4 collinear 2 in convex hull - 1 in convex hull - + - - - - + - + + - + - - - + - - 30

Sauer’s Lemma [Exponential in n ] We already know that Sauer’s lemma: The VC dimension can be used to upper bound the shattering coefficient. [Polynomial in n ] Corollary: 31

Proof of Sauer’s Lemma Write all different behaviors on a sample (x 1 ,x 2 ,…x n ) in a matrix : 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 1 0 0 1 1 1 0 1 0 1 0 0 1 1 1 0 1 1 0 1 1 32

Proof of Sauer’s Lemma Shattered subsets of columns: 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 We will prove that Therefore, 33

Proof of Sauer’s Lemma Shattered subsets of columns: 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 Lemma 1 In this example: 6 · 1+3+3=7 Lemma 2 for any binary matrix with no repeated rows. In this example: 5 · 6 34

Proof of Lemma 1 Shattered subsets of columns: 0 0 0 0 1 0 1 1 1 In this example: 6 · 1+3+3=7 1 0 0 0 1 1 Lemma 1 Proof 35

Proof of Lemma 2 Lemma 2 for any binary matrix with no repeated rows. Proof Induction on the number of columns Base case: A has one column. There are three cases: ) 1 · 1 ) 1 · 1 ) 2 · 2 36

Proof of Lemma 2 Inductive case: A has at least two columns. We have, 0 0 0 0 1 0 1 1 1 1 0 0 By induction (less columns) 0 1 1 37

Proof of Lemma 2 because 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 38

Vapnik-Chervonenkis inequality [We don’t prove this] Vapnik-Chervonenkis inequality: From Sauer’s lemma: Since Therefore, Estimation error 39

Linear (hyperplane) classifiers We already know that Estimation error Estimation error Estimation error 40

Vapnik-Chervonenkis Theorem We already know from McDiarmid: Vapnik-Chervonenkis inequality: Corollary: Vapnik-Chervonenkis theorem: [We don’t prove them] Hoeffding + Union bound for finite function class: 41

PAC Bound for the Estimation Error VC theorem: Inversion: Estimation error 42

Structoral Risk Minimization Estimation error Approximation error Bayes risk Ultimate goal: Estimation error Approximation error So far we studied when estimation error ! 0, but we also want approximation error ! 0 Many different variants… penalize too complex models to avoid overfitting 43

What you need to know Complexity of the classifier depends on number of points that can be classified exactly Finite case – Number of hypothesis Infinite case – Shattering coefficient, VC dimension PAC bounds on true error in terms of empirical/training error and complexity of hypothesis space Empirical and Structural Risk Minimization 44

Thanks for your attention  45

Advanced Introduction to Machine Learning, CMU-10715 Vapnik - PowerPoint PPT Presentation

Advanced Introduction to Machine Learning, CMU-10715 Vapnik Chervonenkis Theory Barnabs Pczos Learning Theory We have explored many ways of learning from data But How good is our classifier, really? How much data do we need to

Advanced Introduction to Machine Learning, CMU-10715 Manifold Learning Barnabs Pczos

Advanced Introduction to Machine Learning CMU-10715 Gaussian Processes Barnabs Pczos

Advanced Introduction to Machine Learning CMU-10715 Principal Component Analysis Barnabs

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

ADVANCED MACHINE LEARNING Kernel PCA 11 ADVANCED MACHINE LEARNING Overview Todays Lecture

ADVANCED MACHINE LEARNING Non-linear regression techniques 1 1 ADVANCED MACHINE LEARNING

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

FACT: A Diagnostic for Group Fairness Trade-offs Joon Kim, CMU (joonsikk@cs.cmu.edu ) Jiahao Chen,

The bluetides simulation Tiziana DiMatteo (CMU ) Yu Feng (Berkeley), Rupert Croft (CMU ), Aklant

A New Boosting Algorithm Using Input-Dependent Regularizer Rong Jin rong+@cs.cmu.edu Yan Liu

against Web Tracking Nataliia Bielova (Inria INDES) with Frederic

Continuous Non-Intrusive Hybrid WCET Estimation Using Waypoint Graphs Boris Dreyer, Christian

Parallel Programming with SystemC for Loosely Timed Models: A Non-Intrusive Approach Matthieu Moy

DeepNilm: A deep learning approach to non-intrusive load monitoring Nikolaus Starzacher, CEO

Separable elements in Weyl groups Yibo Gao Joint work with: Christian Gaetz Massachusetts

Laboratorio di Apprendimento Automatico Fabio Aiolli Universit di Padova What is clustering?

PVMD Delft University of Technology Learning objectives Historical introduction III-V

Streaming algorithms Jeremy Gibbons University of Oxford APPSEM II, April 2004 Streaming