a trichotomy of rates in supervised learning
play

A trichotomy of rates in supervised learning Amir Yehudayoff - PowerPoint PPT Presentation

A trichotomy of rates in supervised learning Amir Yehudayoff (Technion) Olivier Bousquet (Google) Steve Hanneke (TTIC) Shay Moran (Technion & Google) Ramon van Handel (Princeton) background learning theory PAC learning is standard


  1. A trichotomy of rates in supervised learning Amir Yehudayoff (Technion) Olivier Bousquet (Google) Steve Hanneke (TTIC) Shay Moran (Technion & Google) Ramon van Handel (Princeton)

  2. background learning theory PAC learning is standard definition sometimes fails to provide valuable information – specific algorithms (nearest neighbor, neural nets, ...) – specific problems learning rates

  3. framework input: sample of size n ∈ ( X × { 0 , 1 } ) n � � S = ( x 1 , y 1 ) , . . . , ( x n , y n ) output: an hypothesis A h ∈ { 0 , 1 } X S �→ learning algorithm A

  4. generalization goal: PAC learning � � if S = ( x 1 , y 1 ) , . . . , ( x n , y n ) is i.i.d. from unknown µ then h = A ( S ) is typically close to µ closeness is measured by err ( h ) = ( x , y ) ∼ µ [ h ( x ) � = y ] Pr

  5. context without “context” learning is “impossible” what is next element of 1 , 2 , 3 , 4 , 5 , . . . ? few possible definitions for a class H , the distribution µ is realizable if inf { err ( h ) : h ∈ H} = 0 where err ( h ) = Pr ( x , y ) ∼ µ [ h ( x ) � = y ]

  6. PAC learning error of algorithm for sample size n � � ERR n ( A , H ) = sup S ∼ µ n err ( A ( S )) : µ is H -realizable E the class H is PAC learnable if there is A so that n →∞ ERR n ( A , H ) = 0 lim

  7. VC theory theorem [Vapnik-Chervonenkis, Blumer-Ehrenfeucht-Haussler-Warmuth, ...] H is PAC learnable ⇔ VC dimension of H is finite

  8. learning curve [Schuurmans] error “should” decrease as more examples are seen this improvement is important (predict, estimate, ...)

  9. rates usually: µ is unknown but fixed want definition to capture this the rate of algorithm A with respect to µ is rate( n ) = rate A ,µ ( n ) = E S err ( A ( S )) where err ( h ) = Pr ( x , y ) ∼ µ [ h ( x ) � = y ] and | S | = n

  10. VC classes thm: upper envelope ≈ VC [Vapnik-Chervonenkis, Blumer-Ehrenfeucht-Haussler-Warmuth, ...] n experiments: rate ( n ) � exp( − n ) for fixed µ [Cohn-Tesauro]

  11. rate of class R : N → [0 , 1] is a rate function the class H has rate ≤ R if � n � ∃ A ∀ µ ∃ C ∀ n E err ( A ( S )) < CR C the class H has rate ≥ R if E err ( A ( S )) > R ( Cn ) ∃ C ∀ A ∃ µ for ∞ many n C the class H has rate R if both

  12. rates: comments rate ≤ R if ∃ A ∀ µ ∃ C ∀ n E err ( A ( S )) < CR ( n / C ) algorithm A does not know distribution µ the “complexity” of µ is captured by delay factor C = C ( µ )

  13. trichotomy theorem ∗ the rate of H can be – exponential ( e − n ) – linear ( 1 n ) – arbitrarily slow (for every R → 0, at least R ) ∗ realizable, |H| > 2, standard measurability assumptions

  14. trichotomy: comments rate 2 −√ n e.g. is not an option Schuurmans proved a special case (dichotomy for chains) the higher the complexity of H , the slower the rate the complexity is characterized by “shattering capabilities”

  15. exponential rate proposition the rate of H is exponential iff H does not shatter an infinite Littlestone tree

  16. exponential rate lower bound: if |H| > 2 then rate is ≥ e − n upper bound: if H does not shatter an infinite Littlestone tree then rate is ≤ e − n ∃ A ∀ µ ∃ C ∀ n E err ( A ( S )) < Ce − n / C

  17. exponential rate lower bound: if |H| > 2 then rate is ≥ e − n upper bound: if H does not shatter an infinite Littlestone tree then rate is ≤ e − n ∃ A ∀ µ ∃ C ∀ n E err ( A ( S )) < Ce − n / C need: no tree ⇒ algorithm

  18. duality (LP, games,...) no tree ⇒ algorithm simplest example: no point in intersection of two convex bodies ⇒ a separating hyperplane

  19. duality (LP, games,...) no tree ⇒ algorithm simplest example: no point in intersection of two convex bodies ⇒ a separating hyperplane duality for Gale-Stewart games: one of players have a winning strategy

  20. duality (LP, games,...) no tree ⇒ algorithm simplest example: no point in intersection of two convex bodies ⇒ a separating hyperplane duality for Gale-Stewart games: one of players have a winning strategy problem: how complex is this strategy?

  21. measurability value of position is an ordinal measures “how many steps to victory” n -steps to mate [Evans, Hamkins]

  22. measurability value of position is an ordinal measures “how many steps to victory” n -steps to mate [Evans, Hamkins] the Littlestone dimension of H is the ordinal  0 |H| = 1    ∞ H has ∞ tree LD ( H ) = � ��  � � sup x ∈X min y ∈{ 0 , 1 } LD H + 1 otherwise   � x �→ y

  23. measurability value of position is an ordinal measures “how many steps to victory” n -steps to mate [Evans, Hamkins] the Littlestone dimension of H is the ordinal  0 |H| = 1    ∞ H has ∞ tree LD ( H ) = � ��  � � sup x ∈X min y ∈{ 0 , 1 } LD H + 1 otherwise   � x �→ y theorem (relies on [Kunen-Martin]) if H is measurable ∗ then LD ( H ) is countable

  24. summary learning rates capture distribution specific performance there are 3 possible learning rates in realizable case rate is characterizes by shattering capabilities – shattering ⇒ hard distribution via construction – no shattering ⇒ algorithm via duality complexity of algorithm via ordinals etc.

  25. to do agnostic case accurate bounds on rates applications for shattering framework

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend