Statistical Learning Theory Real-World Process and PAC-Learning - - PowerPoint PPT Presentation

statistical learning theory
SMART_READER_LITE
LIVE PREVIEW

Statistical Learning Theory Real-World Process and PAC-Learning - - PowerPoint PPT Presentation

Learning Classifiers Statistical Learning Theory Real-World Process and PAC-Learning Training Set New Examples CS678 Advanced Topics in Machine Learning Thorsten Joachims Classifier Learner Spring


slide-1
SLIDE 1

StatisticalLearningTheory andPAC-Learning

CS678AdvancedTopicsinMachineLearning ThorstenJoachims Spring2003 Outline:

  • Whatisthetrue(prediction)errorofclassificationruleh?
  • Howtoboundthetrueerrorgiventhetrainingerror?
  • Finitehypothesisspaceandzerotrainingerror
  • Finitehypothesisspaceandnon-zerotrainingerror
  • Infinitehypothesisspaces:VC-DimensionandGrowthFunction

LearningClassifiers

Goal:

  • Learnerusestrainingsettofindclassifierwithlowpredictionerror.

TrainingSet NewExamples Learner Classifier Real-World Process

LearningClassifiersfromExamples(Scenario)

Scenario:

  • Generator:Generatesdescriptions accordingtodistribution

.

  • Teacher:Assignsavalue toeachdescription basedondistribution

. Given:

  • Trainingexamples
  • SetHofclassificationrulesh(hypotheses)thatmapdescriptions to

values ( ). GoalofLearner:

  • ClassificationrulehfromHthatclassifiesnewexamples(againfrom

)withlowerrorrate!

x

P x ( )

y x

P y x ( )

x1 y1 , ( ) … xn yn , ( ) , , P x y , ( ) xi ℜN y ∈

i

∼ 1 1 – { , } ∈ x y h x y → ;

P x y , ( )

P h x ( ) y ≠ ( ) ∆ h x ( ) y ≠ ( ) P x y , ( ) d

  • ErrP h

( ) = =

Principle:EmpiricalRiskMinimization(ERM)

LearningPrinciple: Findthedecisionrule forwhichthetrainingerrorisminimal: TrainingError: ==>Numberofmisclassificationsontrainingexamples.

h° H ∈

h° minh

H ∈

ErrS h ( ) { } arg = ErrS h ( ) 1 n

  • yi

h xi ( ) ≠ ( ) ∆

i 1 = n

  • =

CentralProblemofStatisticalLearningTheory: Whendoesalowtrainingerrorleadtoalowgeneralizationerror?

slide-2
SLIDE 2

SourcesofVariation

LearningTask:

  • Generator:Generatesdescriptions accordingtodistribution

.

  • Teacher:Assignsavalue toeachdescription basedon

. =>LearningTask: Process:

  • Selecttask
  • TrainingsampleS(dependson

)

  • TrainlearningalgorithmA(e.g.randomizedsearch)
  • TestsampleV(dependson

)

  • Applyclassificationruleh(e.g.randomizedprediction)

x P x ( ) y x P y x ( ) P x y , ( ) P y x ( )P x ( ) = P x y , ( ) P x y , ( ) P x y , ( )

Whatisthetrueerrorofclassificationruleh?

Includesvariationfromdifferenttestsets. ProblemSetting:

  • givenruleh
  • given(independent)testsample
  • fsizek

estimate

  • Approach:measureerrorofhontestset

S x1 y1 , ( ) = … xk yk , ( ) , , P h x ( ) y ≠ ( ) ∆ h x ( ) y ≠ ( ) P x y , ( ) d

  • ErrP h

( ) = = ErrV h ( ) 1 k

  • yi

h xi ( ) ≠ ( ) ∆

i 1 = n

  • =

BinomialDistribution

Theprobabilityofobservingxheadsinasampleofnindependentcoin tosses,whentheprobabilityofheadsispineachtoss,is Confidenceinterval: Givenxobservedheads,withatleat95%confidencethetruevalueofp fulfills

P X x = p n , ( ) n! r! n r – ( )!

  • pr 1

p – ( )n

r –

= P X x ≥ p n , ( ) 0.025 ≥ and P X x ≤ p n , ( ) 0.025 ≥

Cross-ValidationEstimation

Given:

  • trainingsetSofsizen

Method:

  • partitionSintomsubsetsofequalsize
  • forifrom1tom
  • trainlearneronallsubsetsexceptthei’ th
  • testlearneroni’ thsubset
  • recorderrorratesontestset

=>Result:averageoverrecordederrorrates Biasofestimate:seeleave-one-out Warning:Testsetsareindependent,butnotthetrainingsets! =>nostrictlyvalidhypothesistestisknownforgenerallearning algorithms(see[Dietterich/97])

slide-3
SLIDE 3

PsychicGame

  • Iguessa4bitcode
  • Youallguessa4bitcode

=>Thestudentwhoguessesmycodeclearlyhastelepathicabilities- right!?

HowcanYouConvinceMeofYourPsychic Abilities?

Setting:

  • nbits
  • |H|players

Question:Forwhichnand|H|ispredictionofzero-errorplayer significantlydifferentfromrandom( )withprobability ? =>Hypothesistestfor

  • r

p 0.5 = 1 δ – P h1correct … h H correct allnonpsychic , ∨ ∨ ( ) δ < P h H E ; ∈ ∃ rrs h ( ) h H E ; ∈ ∀ rrP h ( ) 0.5 = , = ( ) δ <

PACLearning

Definition:

  • C=classofconcepts

(functionstobelearned)

  • H=classofhypotheses

(functionsusedbylearnerA)

  • S=trainingset(ofsizen)
  • =desirederrorrateoflearnedhypothesis
  • =probability,withwhichthelearnerAisallowedtofail

CisPAC-learnablebyAlgorithmAusingHandnexamples,if forall , , ,andP(X)sothatArunsinpolynomialtime dependenton , ,thesizeofthetrainingexamplesandthesizeofthe concepts. =>onlypolynomiallymanytrainingexamplesallowed.

c X 1 1 – , { } → ; h X 1 1 – , { } → ; ε δ P Err hA S

( )

( ) ε ≤ ( ) 1 δ – ( ) ≥ c C ∈ ε δ ε δ

Case:FiniteH,ZeroError

  • ThehypothesisspaceHisfinite
  • Thereisalwayssomehwithzerotrainingerror(Areturnsonesuchh)
  • Probabilitythata(single)hwith

hastrainingerrorofzero

  • ProbabilitythatthereexistshinHwith

thathastraining errorofzero

ErrP h ( ) ε ≥ 1 ε – ( )n ErrP h ( ) ε ≥ P h H E ; ∈ ∃ rrs h ( ) 0 ErrP h ( ) ε > , = ( ) H 1 ε – ( )n H e

εn –

≤ ≤

slide-4
SLIDE 4

Case:FiniteH,Non-ZeroError

Goal: <=

  • Probabilitythatforafixedh,trainingerrorandtesterrordifferby

morethan (Hoeffding/ChernoffBound)

  • ProbabilityoverallhinH:unionbound=>multiplyby|H|

P ErrS hA S

( )

( ) ErrD hA S

( )

( ) – ε ≤ ( ) 1 δ – ( ) ≥ P supH ErrS hi ( ) ErrS hi ( ) – ε ≤ ( ) 1 δ – ( ) ≥ ε P 1 n

  • xi

i 1 = n

  • p

– ε >

  • 2e

2 – nε2

Case:InfiniteH

  • unionbounddoesnolongerwork.
  • maybenotallhypothesesarereallydifferent?!

HowManyDichotomiesforFixedSample?

  • SampleSofsizen
  • HypothesisclassH

Definition:HshattersS,if (i.e.hypothesesfromHcan splitSinallpossibleways).

ΠH S ( ) h x1 ( ) h x2 ( ) … h xn ( ) , , , ( ) h H ∈ ; { } = ΠH S ( ) 2n =

Vapnik/ChervonenkisDimension

Definition:TheVC-dimensionofHisequaltothemaximalnumberd

  • fexamplesthatcanbesplitintotwosetsinall2dwaysusing

functionsfromH(shattering). Growthfunction :ForallS

x1 x2 x3 ... xd h1 + + + ... + h2

  • +

+ ... + h3 +

  • +

... + h4

  • +

... + ... ... ... ... ... ... hN

  • ...
  • Φd S

( ) ΠH S ( ) ΦVCdim H

( ) n

( ) en VCdim H ( )

  • VCdim H

( )

≤ ≤

slide-5
SLIDE 5

LinearClassifiers

RulesoftheForm:weightvector ,threshold GeometricInterpretation(Hyperplane):

h x ( ) sign wixi

i 1 = N

  • b

+ 1 if wixi

i 1 = N

  • b

+ > 1 – else

  • =

= w b

w b

VC-DimensionofHyperplanesin

  • Threepointsin

canbeshatteredwithhyperplanes.

  • Fourpointscannotbeshattered.

=>Hyperplanesin

  • >VCdim=3

General:Hyperplanesin

  • >VCdim=N+1

ℜ2

ℜ2 ℜ2 ℜN

ErrorBound

Question:Afterntrainingexamples,howcloseisthetrainingerrorto thetrueerror? Withprobablility itholdsforall :

  • n

numberoftrainingexamples

  • d

VC-dimensionofhypothesisspaceH ==>

ErrP h ( ) ErrS h ( ) – Φ d n η , , ( ) ≤ Φ d n , ( ) d 2n d

  • ln

1 +

  • η

4

  • ln

– n

  • =

η h H ∈

ErrP h ( ) ErrS h ( ) Φ d n η , , ( ) + ≤

SVMMotivation:StructuralRiskMinimization

Idea:Structureon hypothesisspace. Goal:Minimizeupperboundon trueerrorrate.

ErrP hi ( ) ErrS hi ( ) Φ VCdim H ( ) n η , , ( ) + ≤

h*

Φ VCdim H ( ) n η , , ( ) ErrS hi ( ) ErrP hi ( )

VCdim(H)

  • pt