A trichotomy of rates in supervised learning Amir Yehudayoff - - PowerPoint PPT Presentation

a trichotomy of rates in supervised learning
SMART_READER_LITE
LIVE PREVIEW

A trichotomy of rates in supervised learning Amir Yehudayoff - - PowerPoint PPT Presentation

A trichotomy of rates in supervised learning Amir Yehudayoff (Technion) Olivier Bousquet (Google) Steve Hanneke (TTIC) Shay Moran (Technion & Google) Ramon van Handel (Princeton) background learning theory PAC learning is standard


slide-1
SLIDE 1

A trichotomy of rates in supervised learning

Amir Yehudayoff (Technion) Olivier Bousquet (Google) Steve Hanneke (TTIC) Shay Moran (Technion & Google) Ramon van Handel (Princeton)

slide-2
SLIDE 2

background

learning theory PAC learning is standard definition sometimes fails to provide valuable information – specific algorithms (nearest neighbor, neural nets, ...) – specific problems learning rates

slide-3
SLIDE 3

framework

input: sample of size n S =

  • (x1, y1), . . . , (xn, yn)
  • ∈ (X × {0, 1})n
  • utput: an hypothesis

S →

A h ∈ {0, 1}X

learning algorithm A

slide-4
SLIDE 4

generalization

goal: PAC learning if S =

  • (x1, y1), . . . , (xn, yn)
  • is i.i.d. from unknown µ

then h = A(S) is typically close to µ closeness is measured by err(h) = Pr

(x,y)∼µ[h(x) = y]

slide-5
SLIDE 5

context

without “context” learning is “impossible” what is next element of 1, 2, 3, 4, 5, . . .? few possible definitions for a class H, the distribution µ is realizable if inf{err(h) : h ∈ H} = 0 where err(h) = Pr(x,y)∼µ[h(x) = y]

slide-6
SLIDE 6

PAC learning

error of algorithm for sample size n ERRn(A, H) = sup

  • E

S∼µn err(A(S)) : µ is H-realizable

  • the class H is PAC learnable if there is A so that

lim

n→∞ ERRn(A, H) = 0

slide-7
SLIDE 7

VC theory

theorem [Vapnik-Chervonenkis, Blumer-Ehrenfeucht-Haussler-Warmuth, ...] H is PAC learnable ⇔ VC dimension of H is finite

slide-8
SLIDE 8

learning curve [Schuurmans]

error “should” decrease as more examples are seen this improvement is important (predict, estimate, ...)

slide-9
SLIDE 9

rates

usually: µ is unknown but fixed want definition to capture this the rate of algorithm A with respect to µ is rate(n) = rateA,µ(n) = E

S err(A(S))

where err(h) = Pr(x,y)∼µ[h(x) = y] and |S| = n

slide-10
SLIDE 10

VC classes

thm: upper envelope ≈ VC

n

[Vapnik-Chervonenkis, Blumer-Ehrenfeucht-Haussler-Warmuth, ...]

experiments: rate(n) exp(−n) for fixed µ [Cohn-Tesauro]

slide-11
SLIDE 11

rate of class

R : N → [0, 1] is a rate function the class H has rate ≤ R if ∃A ∀µ ∃C ∀n E err(A(S)) < CR n C

  • the class H has rate ≥ R if

∃C ∀A ∃µ for ∞ many n E err(A(S)) > R(Cn) C the class H has rate R if both

slide-12
SLIDE 12

rates: comments

rate ≤ R if ∃A ∀µ ∃C ∀n E err(A(S)) < CR(n/C) algorithm A does not know distribution µ the “complexity” of µ is captured by delay factor C = C(µ)

slide-13
SLIDE 13

trichotomy theorem∗ the rate of H can be – exponential (e−n) – linear ( 1

n)

– arbitrarily slow (for every R → 0, at least R)

∗ realizable, |H| > 2, standard measurability assumptions

slide-14
SLIDE 14

trichotomy: comments

rate 2−√n e.g. is not an option Schuurmans proved a special case (dichotomy for chains) the higher the complexity of H, the slower the rate the complexity is characterized by “shattering capabilities”

slide-15
SLIDE 15

exponential rate

proposition the rate of H is exponential iff H does not shatter an infinite Littlestone tree

slide-16
SLIDE 16

exponential rate

lower bound: if |H| > 2 then rate is ≥ e−n upper bound: if H does not shatter an infinite Littlestone tree then rate is ≤ e−n ∃A ∀µ ∃C ∀n E err(A(S)) < Ce−n/C

slide-17
SLIDE 17

exponential rate

lower bound: if |H| > 2 then rate is ≥ e−n upper bound: if H does not shatter an infinite Littlestone tree then rate is ≤ e−n ∃A ∀µ ∃C ∀n E err(A(S)) < Ce−n/C need: no tree ⇒ algorithm

slide-18
SLIDE 18

duality (LP, games,...)

no tree ⇒ algorithm simplest example: no point in intersection of two convex bodies ⇒ a separating hyperplane

slide-19
SLIDE 19

duality (LP, games,...)

no tree ⇒ algorithm simplest example: no point in intersection of two convex bodies ⇒ a separating hyperplane duality for Gale-Stewart games:

  • ne of players have a winning strategy
slide-20
SLIDE 20

duality (LP, games,...)

no tree ⇒ algorithm simplest example: no point in intersection of two convex bodies ⇒ a separating hyperplane duality for Gale-Stewart games:

  • ne of players have a winning strategy

problem: how complex is this strategy?

slide-21
SLIDE 21

measurability

value of position is an ordinal measures “how many steps to victory” n-steps to mate [Evans, Hamkins]

slide-22
SLIDE 22

measurability

value of position is an ordinal measures “how many steps to victory” n-steps to mate [Evans, Hamkins] the Littlestone dimension of H is the ordinal LD(H) =        |H| = 1 ∞ H has ∞ tree

  • supx∈X miny∈{0,1} LD
  • H
  • x→y
  • + 1
  • therwise
slide-23
SLIDE 23

measurability

value of position is an ordinal measures “how many steps to victory” n-steps to mate [Evans, Hamkins] the Littlestone dimension of H is the ordinal LD(H) =        |H| = 1 ∞ H has ∞ tree

  • supx∈X miny∈{0,1} LD
  • H
  • x→y
  • + 1
  • therwise

theorem (relies on [Kunen-Martin]) if H is measurable∗ then LD(H) is countable

slide-24
SLIDE 24

summary

learning rates capture distribution specific performance there are 3 possible learning rates in realizable case rate is characterizes by shattering capabilities – shattering ⇒ hard distribution via construction – no shattering ⇒ algorithm via duality complexity of algorithm via ordinals etc.

slide-25
SLIDE 25

to do

agnostic case accurate bounds on rates applications for shattering framework