some bias and a pinch of variance
play

Some bias and a pinch of variance Sara van de Geer November 2, 2016 - PowerPoint PPT Presentation

Some bias and a pinch of variance Sara van de Geer November 2, 2016 Joint work with: Andreas Elsener, Alan Muro, Jana Jankov a, Benjamin Stucky ... this talk is about theory for machine learning algorithms ... ... this talk is about theory


  1. Some bias and a pinch of variance Sara van de Geer November 2, 2016 Joint work with: Andreas Elsener, Alan Muro, Jana Jankov´ a, Benjamin Stucky

  2. ... this talk is about theory for machine learning algorithms ...

  3. ... this talk is about theory for machine learning algorithms ... ... for high-dimensional data ...

  4. ... it is about prediction performance of algorithms trained on random data ... it is not about the scripts used

  5. Problem statement Concepts: Detour: Sparsity exact recovery Effective sparsity Norm penalized Margin Curvature empirical risk Triangle property minimization Adaptation

  6. Problem statement Concepts: Detour: Sparsity exact recovery Effective sparsity Norm penalized Margin Curvature empirical risk Triangle property minimization Adaptation

  7. Problem: Let f : X → R , X ⊂ R m Find min x ∈X f ( x )

  8. Problem: Let f : X → R , X ⊂ R m Find min x ∈X f ( x ) Severe Problem: The function f is unknown!

  9. What we do know: � f ( x ) = ℓ ( x , y ) dP ( y ) =: f P ( x ) where ◦ ℓ ( x , y ) is a given “loss” function: ℓ : X × Y → R ◦ P is an unknown probability measure on the space Y

  10. Example ◦ X := the persons you consider marrying ◦ Y := possible states of the world ◦ ℓ ( x , y ) := the loss when marrying x in world y ◦ P := the distribution of possible states of the world � ◦ f ( x ) = ℓ ( x , y ) dP ( y ) the “risk” of marrying x

  11. Let Q be a given probability measure on Y We replace P by Q : � f Q ( x ) := ℓ ( x , y ) dQ ( y ) and estimate x P := arg min x ∈X f P ( x ) by x Q := arg min x ∈X f Q ( x ) Question: How “good” is this estimate?

  12. 0.25 empirical risk 0.20 f(x) Q 0.15 theoretical risk risk Q x 0.10 excess risk f(x) 0.05 P P x 0.00 0.0 0.2 0.4 0.6 0.8 1.0 x beta

  13. Question: Is x Q close to x P ? f ( x Q ) close to f ( x P )

  14. ... in our setup ... we have to regularize: accept some bias to reduce variance

  15. Our setup: Q := corresponds to a sample Y 1 , . . . , Y n from P n := sample size Thus n f n ( x ) = 1 � f Q ( x ) := ˆ ℓ ( x , Y i ) , x ∈ X ⊂ R m n i =1 (a random function)

  16. number of parameters m number of observations n high-dimensional statistics : m ≫ n

  17. DATA Y 1 , . . . , Y n ↓ x ∈ R m ˆ

  18. In our setup with m ≫ n we need to regularize That is: accept some bias to be able to reduce the variance.

  19. Regularized empirical risk minimization Target: x P := x 0 = arg min f P ( x ) x ∈X⊂ R m � �� � unobservable risk Estimator based on sample: � � x Q := ˆ x := arg min f Q ( x ) + pen ( x ) x ∈X⊂ R m � �� � � �� � empirical risk regularization penalty

  20. Example: Let Z ∈ R n × m be a given design matrix and b 0 ∈ R n unobserved vector 2 := � n Let � v � 2 i =1 v 2 i and f P ( x ) � �� � x 0 ∈ arg min � b 0 − Zx � 2 2 x ∈ R m Sample Y = b 0 + ǫ, ǫ ∈ R n noise “Lasso” with “tuning parameter” λ ≥ 0: := � m f Q ( x ) j =1 | x j | � � � �� � ���� � Y − Zx � 2 ˆ x := arg min 2 +2 λ � x � 1 x ∈ R p n := number of observations , m := number of parameters . High-dimensional: m ≫ n

  21. Definition We call j an active parameter if (roughly speaking) x 0 j � = 0 We say x 0 is sparse if the number of active parameters is small We write the active set of x 0 as S 0 := { j : x 0 j � = 0 } We call s 0 := | S 0 | the sparsity of x 0

  22. Goal: ◦ derive oracle inequalities for norm-penalized empirical risk minimizers oracle: an estimator that knows the “true” sparsity oracle inequalities: Adaptation to unknown sparsity

  23. Benchmark Low-dimensional x ∈X⊂ R m ˆ x = arg ˆ min f n ( x ) Then typically x ) − f P ( x 0 ) ∼ m n = number of parameters f P (ˆ number of observations High-dimensional � � ˆ x = arg ˆ min f n ( x ) + pen ( x ) x ∈X⊂ R m Aim is Adaptation x ) − f P ( x 0 ) ∼ s 0 n = number of active parameters f P (ˆ number of observations

  24. Problem statement Concepts: Detour: Sparsity exact recovery Effective sparsity Norm penalized Margin curvature empirical risk Triangle property minimization Adaptation

  25. Exact recovery Let Z ∈ R n × m be given and b 0 ∈ R n be given with m ≫ n Consider the system Zx 0 = b 0 of n equations with m unknowns Basis pursuit: � � x ∗ := arg min � x � 1 : Zx = b 0 x ∈ R m

  26. Notation Active set: S 0 := { j : x 0 j � = 0 } Sparsity: s 0 := | S 0 | Effective sparsity: � � x S 0 � 2 � s 0 Γ 2 1 0 := = max 2 / n : � x − S 0 � 1 ≤ � x S 0 � 1 ˆ � Zx � 2 φ 2 ( S 0 ) � �� � “ cone condition ” Compatibility constant: ˆ φ 2 ( S 0 )

  27. The compatibility constant is canonical correlation ... ... in the ℓ 1 -world The effective sparsity Γ 2 0 is ≈ the sparsity s 0 but taking into account the correlation between variables.

  28. Compatibility constant: (in R 2 ) Z Z , . . . , m 2 ˆ φ (1 , { 1 } ) Z 1 φ ( S ) = ˆ ˆ φ (1 , S ) for the case S = { 1 }

  29. Basis Pursuit Z given n × m matrix with m ≫ n . Let x 0 be the sparsest solution of Zx = b 0 . Basis Pursuit [Chen, Donoho and Saunders (1998) ]: � � x ∗ := min � x � 1 : Zx = b 0 Exact recovery Γ( S 0 ) < ∞ ⇒ x ∗ = x 0

  30. Problem statement Concepts: Detour: Sparsity exact recovery Effective sparsity Norm penalized Margin curvature empirical risk Triangle property minimization Adaptation

  31. General norms Let Ω be a norm on R m The Ω − world

  32. Norm-regularized empirical risk minimization � � x Q := ˆ x := arg min f Q ( x ) + λ Ω( x ) x ∈X⊂ R m � �� � � �� � empirical risk regularization penalty where ◦ Ω is a given norm on R p , ◦ λ > 0 is a tuning parameter

  33. Examples of norms ℓ 1 -norm: Ω( x ) = � x � 1 =: � m j =1 | x j |

  34. Examples of norms ℓ 1 -norm: Ω( x ) = � x � 1 =: � m j =1 | x j | given ˜ Oscar: λ > 0 p � (˜ Ω( x ) := λ ( j − 1) + 1) | x | ( j ) where | x | (1) ≥ · · · ≥ | x | ( p ) j =1 [Bondell and Reich 2008]

  35. Examples of norms ℓ 1 -norm: Ω( x ) = � x � 1 =: � m j =1 | x j | given ˜ Oscar: λ > 0 p � (˜ Ω( x ) := λ ( j − 1) + 1) | x | ( j ) where | x | (1) ≥ · · · ≥ | x | ( p ) j =1 [Bondell and Reich 2008] sorted ℓ 1 -norm: given λ 1 ≥ · · · ≥ λ p > 0, p � Ω( x ) := λ j | x | ( j ) where | x | (1) ≥ · · · ≥ | x | ( p ) j =1 [Bogdan et al. 2013]

  36. norms generated from cones: � � x 2 � m 1 , A ⊂ R m Ω( x ) := min a ∈A a j + a j j + j =1 2 [Micchelli et al. 2010] [Jenatton et al. 2011] [Bach et al. 2012] unit ball for wedge norm unit ball for group Lasso norm A = { a : a 1 ≥ a 2 ≥ · · · }

  37. nuclear norm for matrices: X ∈ R m 1 × m 2 , √ Ω( X ) := � X � nuclear := trace ( X T X )

  38. nuclear norm for matrices: X ∈ R m 1 × m 2 , √ Ω( X ) := � X � nuclear := trace ( X T X ) nuclear norm for tensors: X ∈ R m 1 × m 2 × m 3 , Ω( X ) := dual norm of Ω ∗ where � u 1 � 2 = � u 2 � 2 = � u 3 � 2 =1 trace ( W T u 1 ⊗ u 2 ⊗ u 3 ) , W ∈ R m 1 × m 2 × m 3 Ω ∗ ( W ) := max [Yuan and Zhang 2014]

  39. Some concepts 4 Let ˙ ∂ f P ( x ) := ∂ x f P ( x ) 3 The Bregman divergence ˆ is R f(x) 2 P D ( x � ˆ x ) f(x) x ) − ˙ x ) T ( x − ˆ = f P ( x ) − f P (ˆ f P (ˆ x ) 1 P ˆ D(x|| x) 0 ˆ x x -1.0 -0.5 0.0 0.5 1.0 beta Definition (Property of f P ) We have margin curvature G if x ) ≥ G ( τ ( x ∗ − ˆ D ( x ∗ � ˆ x ))

  40. Definition (Property of Ω) The triangle property holds at x ∗ if ∃ semi-norms Ω + and Ω − such that Ω( x ∗ ) − Ω( x ) ≤ Ω + ( x − x ∗ ) − Ω − ( x ) property triangle Definition The effective sparsity at x ∗ is �� Ω + ( x ) � 2 � Γ 2 : Ω − ( x ) ≤ L Ω + ( x ) ∗ ( L ) := max τ ( x ) � �� � “ cone condition ” L ≥ 1 is a stretching factor.

  41. Problem statement Concepts: Detour: Sparsity exact recovery Effective sparsity Norm penalized Margin curvature empirical risk Triangle property minimization Adaptation

  42. Norm-regularized empirical risk minimization � � x Q := ˆ x := arg min f Q ( x ) + λ Ω( x ) x ∈X⊂ R m � �� � � �� � empirical risk regularization penalty where ◦ Ω is a given norm on R p , ◦ λ > 0 is a tuning parameter

  43. A sharp oracle inequality Theorem [vdG, 2016] Let this measures how close Q is to P ↓ � � ( ˙ f Q − ˙ ( i . e . remove most λ > λ ǫ ≥ Ω ∗ f P )(ˆ x ) of the variance ) ↑ dual norm Define ¯ λ λ := λ − λ ǫ , ¯ λ := λ + λ ǫ , L = λ. H := convex x = x Q , x 0 = x P ) conjugate Then (recall ˆ of G ↓ � � H (¯ f P ( x ∗ ) − f P ( x 0 ) x ) − f P ( x 0 ) ≤ min x ∗ ∈X f P (ˆ + λ Γ ∗ ( L )) . � �� � � �� � “ bias ” pinch of “ variance ” that is: Adaptation

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend