mixability in statistical learning
play

Mixability in Statistical Learning Tim van Erven Joint work with: - PowerPoint PPT Presentation

Mixability in Statistical Learning Tim van Erven Joint work with: Peter Grnwald, Mark Reid, Bob Williamson SMILE Seminar, 24 September 2012 Summary Stochastic mixability fast rates of convergence in different settings: statistical


  1. Mixability in Statistical Learning Tim van Erven Joint work with: Peter Grünwald, Mark Reid, Bob Williamson SMILE Seminar, 24 September 2012

  2. Summary • Stochastic mixability fast rates of convergence in different settings: • statistical learning (margin condition) • sequential prediction (mixability)

  3. Outline • Part 1: Statistical learning • Stochastic mixability (definition) • Equivalence to margin condition • Part 2: Sequential prediction • Part 3: Convexity interpretation for stochastic mixability • Part 4: Grünwald’s idea for adaptation to the margin

  4. Notation

  5. Notation • Data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) • Predict from : F = { f : X → A} Y X • Loss: ` : Y × A → [0 , ∞ ]

  6. Notation • Data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) • Predict from : F = { f : X → A} Y X • Loss: ` : Y × A → [0 , ∞ ] Classification Y = { 0 , 1 } , A = { 0 , 1 } ( 0 if y = a ` ( y, a ) = 1 if y 6 = a

  7. Notation • Data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) • Predict from : F = { f : X → A} Y X • Loss: ` : Y × A → [0 , ∞ ] Classification Density estimation Y = { 0 , 1 } , A = { 0 , 1 } A = density functions on Y ( ` ( y, p ) = − log p ( y ) 0 if y = a ` ( y, a ) = 1 if y 6 = a

  8. Notation • Data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) • Predict from : F = { f : X → A} Y X • Loss: ` : Y × A → [0 , ∞ ] Classification Density estimation Y = { 0 , 1 } , A = { 0 , 1 } A = density functions on Y ( ` ( y, p ) = − log p ( y ) 0 if y = a ` ( y, a ) = 1 if y 6 = a Without X : F ⊂ A

  9. Statistical Learning

  10. Statistical Learning iid ( X 1 , Y 1 ) , . . . , ( X n , Y n ) ∼ P ∗ f ∗ = arg min E [ ` ( Y, f ( X ))] f ∈ F d ( ˆ f, f ∗ ) = E [ ` ( Y, ˆ f ( X )) − ` ( Y, f ∗ ( X ))]

  11. Statistical Learning iid ( X 1 , Y 1 ) , . . . , ( X n , Y n ) ∼ P ∗ f ∗ = arg min E [ ` ( Y, f ( X ))] f ∈ F d ( ˆ f, f ∗ ) = E [ ` ( Y, ˆ f ( X )) − ` ( Y, f ∗ ( X ))]

  12. Statistical Learning iid ( X 1 , Y 1 ) , . . . , ( X n , Y n ) ∼ P ∗ f ∗ = arg min E [ ` ( Y, f ( X ))] f ∈ F d ( ˆ f, f ∗ ) = E [ ` ( Y, ˆ f ( X )) − ` ( Y, f ∗ ( X ))] = O ( n − ? )

  13. Statistical Learning iid ( X 1 , Y 1 ) , . . . , ( X n , Y n ) ∼ P ∗ f ∗ = arg min E [ ` ( Y, f ( X ))] f ∈ F d ( ˆ f, f ∗ ) = E [ ` ( Y, ˆ f ( X )) − ` ( Y, f ∗ ( X ))] = O ( n − ? ) • Two factors that determine rate of convergence: 1. complexity of 2. the margin condition F

  14. Definition of Stochastic Mixability • Let . Then is -stochastically mixable if ( ` , F , P ∗ ) η ≥ 0 η f ∗ ∈ F there exists an such that  e − ⌘` ( Y,f ( X )) � E ≤ 1 for all f ∈ F . e − ⌘` ( Y,f ∗ ( X )) • Stochastically mixable: this holds for some η > 0

  15. Immediate Consequences  e − ⌘` ( Y,f ( X )) � E ≤ 1 for all f ∈ F e − ⌘` ( Y,f ∗ ( X )) f ∗ = arg min • minimizes risk over : f ∗ E [ ` ( Y, f ( X ))] F f ∈ F • The larger , the stronger the property of being - η η stochastically mixable

  16. Density estimation example 1 • Log-loss: , ` ( y, p ) = − log p ( y ) F = { p θ | θ ∈ Θ } • Suppose is the true density p θ ∗ ∈ F • Then for and any : p θ ∈ F η = 1  e − ⌘` ( Y,p θ ) � p ✓ ( y ) Z = p ✓ ∗ ( y ) P ∗ (d y ) = 1 E e − ⌘` ( Y,p θ ∗ )

  17. Density estimation example 2

  18. Density estimation example 2 • Normal location family with fixed variance : σ 2 P ∗ = N ( µ ∗ , τ 2 ) F = {N ( µ, σ 2 ) | µ ∈ R } • -stochastically mixable for : η = σ 2 / τ 2 η  e − ⌘` ( Y,p µ ) � Z 2 σ 2 ( y − µ ) 2 + 2 σ 2 ( y − µ ∗ ) 2 − 2 τ 2 ( y − µ ∗ ) 2 d y 1 η η 1 E = e − √ e − ⌘` ( Y,p µ ∗ ) 2 ⇡⌧ 2 Z 2 τ 2 ( y − µ ) 2 d y = 1 1 1 = e − √ 2 ⇡⌧ 2

  19. Density estimation example 2 • Normal location family with fixed variance : σ 2 P ∗ = N ( µ ∗ , τ 2 ) F = {N ( µ, σ 2 ) | µ ∈ R } • -stochastically mixable for : η = σ 2 / τ 2 η  e − ⌘` ( Y,p µ ) � Z 2 σ 2 ( y − µ ) 2 + 2 σ 2 ( y − µ ∗ ) 2 − 2 τ 2 ( y − µ ∗ ) 2 d y 1 η η 1 E = e − √ e − ⌘` ( Y,p µ ∗ ) 2 ⇡⌧ 2 Z 2 τ 2 ( y − µ ) 2 d y = 1 1 1 = e − √ 2 ⇡⌧ 2 2 σ 2 n = η − 1 τ 2 • If is empirical mean: E [ d ( ˆ ˆ f f, f ∗ )] = 2 n

  20. Outline • Part 1: Statistical learning • Stochastic mixability (definition) • Equivalence to margin condition • Part 2: Sequential prediction • Part 3: Convexity interpretation for stochastic mixability • Part 4: Grünwald’s idea for adaptation to the margin

  21. Margin condition c 0 V ( f, f ∗ ) κ ≤ d ( f, f ∗ ) for all f ∈ F • where d ( f, f ∗ ) = E [ ` ( Y, f ( X )) − ` ( Y, f ∗ ( X ))] � 2 � V ( f, f ∗ ) = E ` ( Y, f ( X )) − ` ( Y, f ∗ ( X )) κ ≥ 1 , c 0 > 0 • For 0/1-loss implies rate of convergence O ( n − κ / (2 κ − 1) ) [Tsybakov, 2004] • So smaller is better κ

  22. Stochastic mixability margin c 0 V ( f, f ∗ ) κ ≤ d ( f, f ∗ ) for all f ∈ F • Thm [ ] : Suppose takes values in . Then is ` [0 , V ] ( ` , F , P ∗ ) κ = 1 stochastically mixable if and only if there exists such c 0 > 0 that the margin condition is satisfied with . κ = 1

  23. Margin condition with κ > 1 F ✏ = { f ∗ } ∪ { f ∈ F | d ( f, f ∗ ) ≥ ✏ } • Thm [ all ] : Suppose takes values in . Then the κ ≥ 1 [0 , V ] ` margin condition is satisfied if and only if there exists a constant such that, for all , is - ✏ > 0 ( ` , F ✏ , P ∗ ) C > 0 η ⌘ = C ✏ ( κ − 1) / κ stochastically mixable for .

  24. Outline • Part 1: Statistical learning • Part 2: Sequential prediction • Part 3: Convexity interpretation for stochastic mixability • Part 4: Grünwald’s idea for adaptation to the margin

  25. Sequential Prediction with Expert Advice • For rounds : t = 1 , . . . , n ˆ t , . . . , ˆ f 1 f K • K experts predict t ˆ • Predict by choosing ( x t , y t ) f t • Observe ( x t , y t ) n n 1 1 ` ( y t , ˆ ` ( y t , ˆ X X f k • Regret = f t ( x t )) − min t ( x t )) n n k t =1 t =1 • Game-theoretic (minimax) analysis: want to guarantee small regret against adversarial data

  26. Sequential Prediction with Expert Advice • For rounds : t = 1 , . . . , n ˆ t , . . . , ˆ f 1 f K • K experts predict t ˆ • Predict by choosing ( x t , y t ) f t • Observe ( x t , y t ) n n 1 1 ` ( y t , ˆ ` ( y t , ˆ X X f k • Regret = f t ( x t )) − min t ( x t )) n n k t =1 t =1 • Worst-case regret = iff the loss is mixable! [Vovk, 1995] O (1 /n )

  27. Mixability • A loss is -mixable if for any ` : Y × A → [0 , ∞ ] η distribution on there exists an action such that A a π ∈ A π  e − ⌘` ( y,A ) � E A ∼ ⇡ ≤ 1 for all y . e − ⌘` ( y,a π ) • Vovk: fast rates if and only if loss is mixable O (1 /n )

  28. (Stochastic) Mixability • A loss is -mixable if for any ` : Y × A → [0 , ∞ ] η distribution on there exists an action such that A a π ∈ A π  e − ⌘` ( y,A ) � E A ∼ ⇡ ≤ 1 for all y . e − ⌘` ( y,a π ) • is -stochastically mixable if ( ` , F , P ∗ ) η  e − ⌘` ( Y,f ( X )) � E X,Y ∼ P ∗ ≤ 1 for all f ∈ F . e − ⌘` ( Y,f ∗ ( X ))

  29. (Stochastic) Mixability • A loss is -mixable if for any ` : Y × A → [0 , ∞ ] η distribution on there exists an action such that A a π ∈ A π ` ( y, a ⇡ ) ≤ − 1 Z e − ⌘` ( y,a ) ⇡ (d a ) ⌘ ln for all y .

  30. (Stochastic) Mixability • A loss is -mixable if for any ` : Y × A → [0 , ∞ ] η distribution on there exists an action such that A a π ∈ A π ` ( y, a ⇡ ) ≤ − 1 Z e − ⌘` ( y,a ) ⇡ (d a ) ⌘ ln for all y . • Thm: is -stochastically mixable iff for any ( ` , F , P ∗ ) η f ∗ ∈ F distribution on there exists such that F π E [ ` ( Y, f ∗ ( X ))] ≤ E [ − 1 Z e − ⌘` ( Y,f ( X )) ⇡ (d f )] ⌘ ln

  31. Equivalence of Stochastic Mixability and Ordinary Mixability

  32. Equivalence of Stochastic Mixability and Ordinary Mixability F full = { all functions from X to A} • Thm : Suppose is a proper loss and is discrete. Then ` ` X is -mixable if and only if is -stochastically ( ` , F full , P ∗ ) η η mixable for all . P ∗

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend