PAC-learning, VC Dimension and Margin-based Bounds Machine - - PowerPoint PPT Presentation

pac learning vc dimension and margin based bounds
SMART_READER_LITE
LIVE PREVIEW

PAC-learning, VC Dimension and Margin-based Bounds Machine - - PowerPoint PPT Presentation

More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based Bounds Machine Learning


slide-1
SLIDE 1

2006 Carlos Guestrin

  • PAC-learning, VC

Dimension and Margin-based Bounds

Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University March 6th, 2006

More details:

General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz

slide-2
SLIDE 2

2006 Carlos Guestrin

  • Announcements 1

Midterm on Wednesday

  • pen book, texts, notes,…

no laptops bring a calculator

slide-3
SLIDE 3

2006 Carlos Guestrin

  • Announcements 2
  • Final project details are out!!!

http://www.cs.cmu.edu/~guestrin/Class/10701/projects.html Great opportunity to apply ideas from class and learn more Example project:

  • Take a dataset
  • Define learning task
  • Apply learning algorithms
  • Design your own extension
  • Evaluate your ideas

many of suggestions on the webpage, but you can also do your own

  • Boring stuff:

Individually or groups of two students It’s worth 20% of your final grade You need to submit a one page proposal on Wed. 3/22 (just after the break) A 5-page initial write-up (milestone) is due on 4/12 (20% of project grade) An 8-page final write-up due 5/8 (60% of the grade) A poster session for all students will be held on Friday 5/5 2-5pm in NSH atrium (20% of the

grade)

You can use late days on write-ups, each student in team will be charged a late day per day.

  • MOST IMPORTANT:
slide-4
SLIDE 4

2006 Carlos Guestrin

  • What now…

We have explored many ways of learning from

data

But…

How good is our classifier, really? How much data do I need to make it “good enough”?

slide-5
SLIDE 5

2006 Carlos Guestrin

  • How likely is learner to pick a bad

hypothesis

  • Prob. h with errortrue(h) ε gets m data points right

There are k hypothesis consistent with data

How likely is learner to pick a bad one?

slide-6
SLIDE 6

2006 Carlos Guestrin

  • Union bound

P(A or B or C or D or …)

slide-7
SLIDE 7

2006 Carlos Guestrin

  • How likely is learner to pick a bad

hypothesis

  • Prob. h with errortrue(h) ε gets m data points right

There are k hypothesis consistent with data

How likely is learner to pick a bad one?

slide-8
SLIDE 8

2006 Carlos Guestrin

  • Review: Generalization error in

finite hypothesis spaces [Haussler ’88]

Theorem: Hypothesis space H finite, dataset D

with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data:

slide-9
SLIDE 9

2006 Carlos Guestrin

  • Using a PAC bound

Typically, 2 use cases:

1: Pick ε and δ, give you m 2: Pick m and δ, give you ε

slide-10
SLIDE 10

2006 Carlos Guestrin

  • Review: Generalization error in

finite hypothesis spaces [Haussler ’88]

Theorem: Hypothesis space H finite, dataset D

with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data:

Even if h makes zero errors in training data, may make errors in test

slide-11
SLIDE 11

2006 Carlos Guestrin

  • Limitations of Haussler ‘88 bound

Consistent classifier Size of hypothesis space

slide-12
SLIDE 12

2006 Carlos Guestrin

  • Simpler question: What’s the

expected error of a hypothesis?

The error of a hypothesis is like estimating the

parameter of a coin!

Chernoff bound: for m i.d.d. coin flips, x1,…,xm,

where xi {0,1}. For 0<ε<1:

slide-13
SLIDE 13

2006 Carlos Guestrin

  • But we are comparing many

hypothesis: Union bound

For each hypothesis hi: What if I am comparing two hypothesis, h1 and h2?

slide-14
SLIDE 14

2006 Carlos Guestrin

  • Generalization bound for |H|

hypothesis

Theorem: Hypothesis space H finite, dataset D

with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h:

slide-15
SLIDE 15

2006 Carlos Guestrin

  • PAC bound and Bias-Variance

tradeoff

Important: PAC bound holds for all h,

but doesn’t guarantee that algorithm finds best h!!!

  • r, after moving some terms around,

with probability at least 1-δ δ δ δ: : : :

slide-16
SLIDE 16

2006 Carlos Guestrin

  • What about the size of the

hypothesis space?

How large is the hypothesis space?

slide-17
SLIDE 17

2006 Carlos Guestrin

  • Boolean formulas with n binary features
slide-18
SLIDE 18

2006 Carlos Guestrin

  • Number of decision trees of depth k

Recursive solution Given n attributes Hk = Number of decision trees of depth k H0 =2 Hk+1 = (#choices of root attribute) * (# possible left subtrees) * (# possible right subtrees) = n * Hk * Hk Write Lk = log2 Hk L0 = 1 Lk+1 = log2 n + 2Lk So Lk = (2k-1)(1+log2 n) +1

slide-19
SLIDE 19

2006 Carlos Guestrin

  • PAC bound for decision trees of

depth k

Bad!!!

Number of points is exponential in depth!

But, for m data points, decision tree can’t get too big…

Number of leaves never more than number data points

slide-20
SLIDE 20

2006 Carlos Guestrin

  • Number of decision trees with k leaves

Hk = Number of decision trees with k leaves H0 =2

Reminder: Loose bound:

slide-21
SLIDE 21

2006 Carlos Guestrin

  • PAC bound for decision trees with k

leaves – Bias-Variance revisited

slide-22
SLIDE 22

2006 Carlos Guestrin

  • What did we learn from decision trees?

Bias-Variance tradeoff formalized Moral of the story:

Complexity of learning not measured in terms of size hypothesis space, but in maximum number of points that allows consistent classification

Complexity m – no bias, lots of variance Lower than m – some bias, less variance

slide-23
SLIDE 23

2006 Carlos Guestrin

  • What about continuous hypothesis

spaces?

Continuous hypothesis space:

|H| = Infinite variance???

As with decision trees, only care about the

maximum number of points that can be classified exactly!

slide-24
SLIDE 24

2006 Carlos Guestrin

  • How many points can a linear

boundary classify exactly? (1-D)

slide-25
SLIDE 25

2006 Carlos Guestrin

  • How many points can a linear

boundary classify exactly? (2-D)

slide-26
SLIDE 26

2006 Carlos Guestrin

  • How many points can a linear

boundary classify exactly? (d-D)

slide-27
SLIDE 27

2006 Carlos Guestrin

  • Shattering a set of points
slide-28
SLIDE 28

2006 Carlos Guestrin

  • VC dimension
slide-29
SLIDE 29

2006 Carlos Guestrin

  • PAC bound using VC dimension

Number of training points that can be

classified exactly is VC dimension!!!

Measures relevant size of hypothesis space, as

with decision trees with k leaves

Bound for infinite dimension hypothesis spaces:

slide-30
SLIDE 30

2006 Carlos Guestrin

  • Examples of VC dimension

Linear classifiers:

VC(H) = d+1, for d features plus constant term b

Neural networks

VC(H) = #parameters Local minima means NNs will probably not find best

parameters

1-Nearest neighbor?

slide-31
SLIDE 31

2006 Carlos Guestrin

  • Another VC dim. example

What’s the VC dim. of decision stumps in 2d?

slide-32
SLIDE 32

2006 Carlos Guestrin

  • PAC bound for SVMs

SVMs use a linear classifier

For d features, VC(H) = d+1:

slide-33
SLIDE 33

2006 Carlos Guestrin

  • VC dimension and SVMs: Problems!!!

What about kernels?

Polynomials: num. features grows really fast = Bad bound Gaussian kernels can classify any set of points exactly

Doesn’t take margin into account

n – input features p – degree of polynomial

slide-34
SLIDE 34

2006 Carlos Guestrin

  • Margin-based VC dimension

H: Class of linear classifiers: w.Φ(x) (b=0)

Canonical form: minj |w.Φ(xj)| = 1

VC(H) = R2 w.w

Doesn’t depend on number of features!!! R2 = maxj Φ(xj).Φ(xj) – magnitude of data R2 is bounded even for Gaussian kernels bounded VC

dimension

Large margin, low w.w, low VC dimension – Very cool!

slide-35
SLIDE 35

2006 Carlos Guestrin

  • Applying margin VC to SVMs?

VC(H) = R2 w.w

R2 = maxj Φ(xj).Φ(xj) – magnitude of data, doesn’t depend on choice of w

SVMs minimize w.w SVMs minimize VC dimension to get best bound? Not quite right:

  • Bound assumes VC dimension chosen before looking at data

Would require union bound over infinite number of possible VC

dimensions…

But, it can be fixed!

slide-36
SLIDE 36

2006 Carlos Guestrin

  • Structural risk minimization theorem

For a family of hyperplanes with margin γ>0

w.w 1

SVMs maximize margin γ + hinge loss

Optimize tradeoff training error (bias) versus margin γ

(variance)

slide-37
SLIDE 37

2006 Carlos Guestrin

  • Reality check – Bounds are loose

Bound can be very loose, why should you care?

There are tighter, albeit more complicated, bounds Bounds gives us formal guarantees that empirical studies can’t provide Bounds give us intuition about complexity of problems and

convergence rate of algorithms

ε

m (in 105) d=2000 d=200 d=20 d=2

slide-38
SLIDE 38

2006 Carlos Guestrin

  • What you need to know

Finite hypothesis space

Derive results Counting number of hypothesis Mistakes on Training data

Complexity of the classifier depends on number of

points that can be classified exactly

Finite case – decision trees Infinite case – VC dimension

Bias-Variance tradeoff in learning theory Margin-based bound for SVM Remember: will your algorithm find best classifier?

slide-39
SLIDE 39

2006 Carlos Guestrin

  • Big Picture

Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University March 6th, 2006

slide-40
SLIDE 40

2006 Carlos Guestrin

  • What you have learned thus far

Learning is function approximation Point estimation Regression Naïve Bayes Logistic regression Bias-Variance tradeoff Neural nets Decision trees Cross validation Boosting Instance-based learning SVMs Kernel trick PAC learning VC dimension Margin bounds Mistake bounds

slide-41
SLIDE 41

2006 Carlos Guestrin

  • Review material in terms of…

Types of learning problems Hypothesis spaces Loss functions Optimization algorithms

slide-42
SLIDE 42

2006 Carlos Guestrin

  • Text Classification

Company home page vs Personal home page vs Univeristy home page vs …

slide-43
SLIDE 43

2006 Carlos Guestrin

  • Function fitting
  • Temperature data
slide-44
SLIDE 44

2006 Carlos Guestrin

  • Monitoring a complex system

Reverse water gas shift system (RWGS) Learn model of system from data Use model to predict behavior and detect faults

slide-45
SLIDE 45

2006 Carlos Guestrin

  • Types of learning problems

Classification Regression Density estimation

20 40 60 80 100 10 20 30 40 18 20 22 24 26 28

Input – Features Output?

slide-46
SLIDE 46

2006 Carlos Guestrin

  • The learning problem

Data

<x1,…,xn,y>

Learning task

Features/Function approximator Loss function Optimization algorithm Learned function

slide-47
SLIDE 47

2006 Carlos Guestrin

  • Comparing learning algorithms

Hypothesis space Loss function Optimization algorithm

slide-48
SLIDE 48

2006 Carlos Guestrin

  • Naïve Bayes versus Logistic

regression

Naïve Bayes Logistic regression

slide-49
SLIDE 49

2006 Carlos Guestrin

  • Naïve Bayes versus Logistic regression –

Classification as density estimation

Choose class with highest probability In addition to class, we get certainty measure

slide-50
SLIDE 50

2006 Carlos Guestrin

  • Logistic regression versus Boosting

Boosting Logistic regression

Log-loss Classifier Exponential-loss

slide-51
SLIDE 51

2006 Carlos Guestrin

  • Linear classifiers – Logistic

regression versus SVMs

w . x + b =

slide-52
SLIDE 52

2006 Carlos Guestrin

  • What’s the difference between SVMs and

Logistic Regression? (Revisited again)

Almost always no! Often yes! Solution sparse Type of learning Yes! Yes! High dimensional features with kernels Loss function Log-loss Hinge loss

Logistic Regression SVMs

slide-53
SLIDE 53

2006 Carlos Guestrin

  • SVMs and instance-based learning

Classify as

SVMs

<x1,…,xn,y>

Classify as

Instance based learning

Data

slide-54
SLIDE 54

2006 Carlos Guestrin

  • Instance-based learning versus

Decision trees

1-Nearest neighbor Decision trees

slide-55
SLIDE 55

2006 Carlos Guestrin

  • Logistic regression versus Neural nets

Logistic regression Neural Nets

slide-56
SLIDE 56

2006 Carlos Guestrin

  • Linear regression versus Kernel

regression

Linear Regression Kernel regression Kernel-weighted linear regression

slide-57
SLIDE 57

2006 Carlos Guestrin

  • Kernel-weighted linear regression
slide-58
SLIDE 58

2006 Carlos Guestrin

  • SVM regression

w . x + b w . x + b + ε w . x + b

  • ε
slide-59
SLIDE 59

2006 Carlos Guestrin

  • BIG PICTURE

(a few points of comparison)

Naïve Bayes Logistic regression Neural Nets Boosting SVMs Instance-based Learning SVM regression kernel regression linear regression Decision trees

Log-loss/MLE LL Margin-based Mrg Regression Reg Squared error RMS Classification Cl density estimation DE learning task loss function

DE, LL DE, LL DE,Cl,Reg,RMS Cl, exp-loss DE,Cl,Reg DE,Cl,Reg Cl, Mrg Reg, Mrg Reg, RMS Reg, RMS

This is a very incomplete view!!!