Computational Learning Theory [read Chapter 7] [Suggested - - PDF document

computational learning theory read chapter 7 suggested
SMART_READER_LITE
LIVE PREVIEW

Computational Learning Theory [read Chapter 7] [Suggested - - PDF document

Computational Learning Theory [read Chapter 7] [Suggested exercises: 7.1, 7.2, 7.5, 7.8] Computational learning theory Setting 1: learner p oses queries to teac her Setting 2: teac her c ho oses examples


slide-1
SLIDE 1 Computational Learning Theory [read Chapter 7] [Suggested exercises: 7.1, 7.2, 7.5, 7.8]
  • Computational
learning theory
  • Setting
1: learner p
  • ses
queries to teac her
  • Setting
2: teac her c ho
  • ses
examples
  • Setting
3: randomly generated instances, lab eled b y teac her
  • Probably
appro ximately correct (P A C) learning
  • V
apnik-Cherv
  • nenkis
Dimension
  • Mistak
e b
  • unds
175 lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997
slide-2
SLIDE 2 Computational Learning Theory What general la ws constrain inductiv e learning? W e seek theory to relate:
  • Probabilit
y
  • f
successful learning
  • Num
b er
  • f
training examples
  • Complexit
y
  • f
h yp
  • thesis
space
  • Accuracy
to whic h target concept is appro ximated
  • Manner
in whic h training examples presen ted 176 lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997
slide-3
SLIDE 3 Protot ypical Concept Learning T ask
  • Giv
en: { Instances X : P
  • ssible
da ys, eac h describ ed b y the attributes Sky, A irT emp, Humidity, Wind, Water, F
  • r
e c ast { T arget function c: E nj
  • y
S por t : X ! f0; 1g { Hyp
  • theses
H : Conjunctions
  • f
literals. E.g. h?; C
  • l
d; H ig h; ?; ?; ?i: { T raining examples D : P
  • sitiv
e and negativ e examples
  • f
the target function hx 1 ; c(x 1 )i; : : : hx m ; c(x m )i
  • Determine:
{ A h yp
  • thesis
h in H suc h that h(x) = c(x) for all x in D ? { A h yp
  • thesis
h in H suc h that h(x) = c(x) for all x in X ? 177 lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997
slide-4
SLIDE 4 Sample Complexit y Ho w man y training examples are sucien t to learn the target concept? 1. If learner prop
  • ses
instances, as queries to teac her
  • Learner
prop
  • ses
instance x, teac her pro vides c(x) 2. If teac her (who kno ws c) pro vides training examples
  • teac
her pro vides sequence
  • f
examples
  • f
form hx; c(x)i 3. If some random pro cess (e.g., nature) prop
  • ses
instances
  • instance
x generated randomly , teac her pro vides c(x) 178 lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997
slide-5
SLIDE 5 Sample Complexit y: 1 Learner prop
  • ses
instance x, teac her pro vides c(x) (assume c is in learner's h yp
  • thesis
space H ) Optimal query strategy: pla y 20 questions
  • pic
k instance x suc h that half
  • f
h yp
  • theses
in V S classify x p
  • sitiv
e, half classify x negativ e
  • When
this is p
  • ssible,
need dlog 2 jH je queries to learn c
  • when
not p
  • ssible,
need ev en more 179 lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997
slide-6
SLIDE 6 Sample Complexit y: 2 T eac her (who kno ws c) pro vides training examples (assume c is in learner's h yp
  • thesis
space H ) Optimal teac hing strategy: dep ends
  • n
H used b y learner Consider the case H = conjunctions
  • f
up to n b
  • lean
literal s and their negations e.g., (Air T emp = W ar m) ^ (W ind = S tr
  • ng
), where Air T emp; W ind; : : : eac h ha v e 2 p
  • ssible
v alues.
  • if
n p
  • ssible
b
  • lean
attributes in H , n + 1 examples suce
  • wh
y? 180 lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997
slide-7
SLIDE 7 Sample Complexit y: 3 Giv en:
  • set
  • f
instances X
  • set
  • f
h yp
  • theses
H
  • set
  • f
p
  • ssible
target concepts C
  • training
instances generated b y a xed, unkno wn probabilit y distribution D
  • v
er X Learner
  • bserv
es a sequence D
  • f
training examples
  • f
form hx; c(x)i, for some target concept c 2 C
  • instances
x are dra wn from distribution D
  • teac
her pro vides target v alue c(x) for eac h Learner m ust
  • utput
a h yp
  • thesis
h estimating c
  • h
is ev aluated b y its p erformance
  • n
subsequen t instances dra wn according to D Note: randomly dra wn instances, noise-free classicati
  • ns
181 lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997
slide-8
SLIDE 8 T rue Error
  • f
a Hyp
  • thesis

+ +

  • c

h Instance space X

  • Where c

and h disagree

Denition: The true error (denoted er r
  • r
D (h))
  • f
h yp
  • thesis
h with resp ect to target concept c and distribution D is the probabilit y that h will misclassify an instance dra wn at random according to D . er r
  • r
D (h)
  • Pr
x2D [c(x) 6= h(x)] 182 lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997
slide-9
SLIDE 9 Tw
  • Notions
  • f
Error T r aining err
  • r
  • f
h yp
  • thesis
h with resp ect to target concept c
  • Ho
w
  • ften
h(x) 6= c(x)
  • v
er training instances T rue err
  • r
  • f
h yp
  • thesis
h with resp ect to c
  • Ho
w
  • ften
h(x) 6= c(x)
  • v
er future random instances Our concern:
  • Can
w e b
  • und
the true error
  • f
h giv en the training error
  • f
h?
  • First
consider when training error
  • f
h is zero (i.e., h 2 V S H ;D ) 183 lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997
slide-10
SLIDE 10 Exhausting the V ersion Space

VSH,D

error=.1 =.2 r error=.2 =0 r error=.1 =0 r error=.3 =.1 r error=.2 =.3 r error=.3 r =.4

Hypothesis space H

(r = training error, er r
  • r
= true error) Denition: The v ersion space V S H ;D is said to b e
  • exhausted
with resp ect to c and D , if ev ery h yp
  • thesis
h in V S H ;D has error less than
  • with
resp ect to c and D . (8h 2 V S H ;D ) er r
  • r
D (h) <
  • 184
lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997
slide-11
SLIDE 11 Ho w man y examples will
  • exhaust
the VS? Theorem: [Haussler, 1988]. If the h yp
  • thesis
space H is nite, and D is a sequence
  • f
m
  • 1
indep enden t random examples
  • f
some target concept c, then for an y
  • 1,
the probabilit y that the v ersion space with resp ect to H and D is not
  • exhausted
(with resp ect to c) is less than jH je m In teresting! This b
  • unds
the probabilit y that an y consisten t learner will
  • utput
a h yp
  • thesis
h with er r
  • r
(h)
  • If
w e w an t to this probabilit y to b e b elo w
  • jH
je m
  • then
m
  • 1
  • (ln
jH j + ln (1= )) 185 lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997
slide-12
SLIDE 12 Learning Conjunctions
  • f
Bo
  • lean
Literals Ho w man y examples are sucien t to assure with probabilit y at least (1
  • )
that ev ery h in V S H ;D satises er r
  • r
D (h)
  • Use
  • ur
theorem: m
  • 1
  • (ln
jH j + ln (1= )) Supp
  • se
H con tains conjunctions
  • f
constrain ts
  • n
up to n b
  • lean
attributes (i.e., n b
  • lean
literals) . Then jH j = 3 n , and m
  • 1
  • (ln
3 n + ln (1= ))
  • r
m
  • 1
  • (n
ln 3 + ln (1= )) 186 lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997
slide-13
SLIDE 13 Ho w Ab
  • ut
E nj
  • y
S por t? m
  • 1
  • (ln
jH j + ln (1= )) If H is as giv en in E nj
  • y
S por t then jH j = 973, and m
  • 1
  • (ln
973 + ln (1= )) ... if w an t to assure that with probabilit y 95%, V S con tains
  • nly
h yp
  • theses
with er r
  • r
D (h)
  • :1,
then it is sucien t to ha v e m examples, where m
  • 1
:1 (ln 973 + ln (1=:05)) m
  • 10(ln
973 + ln 20) m
  • 10(6:88
+ 3:00) m
  • 98:8
187 lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997
slide-14
SLIDE 14 P A C Learning Consider a class C
  • f
p
  • ssible
target concepts dened
  • v
er a set
  • f
instances X
  • f
length n, and a learner L using h yp
  • thesis
space H . Denition: C is P A C-learnable b y L using H if for all c 2 C , distributions D
  • v
er X ,
  • suc
h that <
  • <
1=2, and
  • suc
h that <
  • <
1=2, learner L will with probabilit y at least (1
  • )
  • utput
a h yp
  • thesis
h 2 H suc h that er r
  • r
D (h)
  • ,
in time that is p
  • lynomial
in 1=, 1= , n and siz e(c). 188 lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997
slide-15
SLIDE 15 Agnostic Learning So far, assumed c 2 H Agnostic learning setting: don't assume c 2 H
  • What
do w e w an t then? { The h yp
  • thesis
h that mak es few est errors
  • n
training data
  • What
is sample complexit y in this case? m
  • 1
2 2 (ln jH j + ln (1= )) deriv ed from Ho eding b
  • unds:
P r [er r
  • r
D (h) > er r
  • r
D (h) + ]
  • e
2m 2 189 lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997
slide-16
SLIDE 16 Shattering a Set
  • f
Instances Denition: a dic hotom y
  • f
a set S is a partition
  • f
S in to t w
  • disjoin
t subsets. Denition: a set
  • f
instances S is shattered b y h yp
  • thesis
space H if and
  • nly
if for ev ery dic hotom y
  • f
S there exists some h yp
  • thesis
in H consisten t with this dic hotom y . 190 lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997
slide-17
SLIDE 17 Three Instances Shattered

Instance space X

191 lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997
slide-18
SLIDE 18 The V apnik-Cherv
  • nenkis
Dimen- sion Denition: The V apnik-Cherv
  • nenkis
dimension, V C (H ),
  • f
h yp
  • thesis
space H dened
  • v
er instance space X is the size
  • f
the largest nite subset
  • f
X shattered b y H . If arbitrarily large nite sets
  • f
X can b e shattered b y H , then V C (H )
  • 1.
192 lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997
slide-19
SLIDE 19 V C Dim.
  • f
Linear Decision Surfaces

( ) ( ) a b

193 lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997
slide-20
SLIDE 20 Sample Complexit y from V C Dimen- sion Ho w man y randomly dra wn examples suce to
  • exhaust
V S H ;D with probabilit y at least (1
  • )?
m
  • 1
  • (4
log 2 (2= ) + 8V C (H ) log 2 (13=)) 194 lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997
slide-21
SLIDE 21 Mistak e Bounds So far: ho w man y examples needed to learn? What ab
  • ut:
ho w man y mistak es b efore con v ergence? Let's consider similar setting to P A C learning:
  • Instances
dra wn at random from X according to distribution D
  • Learner
m ust classify eac h instance b efore receiving correct classication from teac her
  • Can
w e b
  • und
the n um b er
  • f
mistak es learner mak es b efore con v erging? 195 lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997
slide-22
SLIDE 22 Mistak e Bounds: Find-S Consider Find-S when H = conjunction
  • f
b
  • lean
literals Find-S:
  • Initiali
ze h to the most sp ecic h yp
  • thesis
l 1 ^ :l 1 ^ l 2 ^ :l 2 : : : l n ^ :l n
  • F
  • r
eac h p
  • sitiv
e training instance x { Remo v e from h an y literal that is not satised b y x
  • Output
h yp
  • thesis
h. Ho w man y mistak es b efore con v erging to correct h? 196 lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997
slide-23
SLIDE 23 Mistak e Bounds: Halving Algorithm Consider the Halving Algorithm:
  • Learn
concept using v ersion space Candid a te-Elimina tion algorithm
  • Classify
new instances b y ma jorit y v
  • te
  • f
v ersion space mem b ers Ho w man y mistak es b efore con v erging to correct h?
  • ...
in w
  • rst
case?
  • ...
in b est case? 197 lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997
slide-24
SLIDE 24 Optimal Mistak e Bounds Let M A (C ) b e the max n um b er
  • f
mistak es made b y algorithm A to learn concepts in C . (maxim um
  • v
er all p
  • ssible
c 2 C , and all p
  • ssible
training sequences) M A (C )
  • max
c2C M A (c) Denition: Let C b e an arbitrary non-empt y concept class. The
  • ptimal
mistak e b
  • und
for C , denoted O pt(C ), is the minim um
  • v
er all p
  • ssible
learning algorithms A
  • f
M A (C ). O pt(C )
  • min
A2l ear ning al g
  • r
ithms M A (C ) V C (C )
  • O
pt(C )
  • M
H al v ing (C )
  • l
  • g
2 (jC j): 198 lecture slides for textb
  • k
Machine L e arning, c
  • T
  • m
M. Mitc hell, McGra w Hill, 1997