Some bias and a pinch of variance Sara van de Geer November 2, 2016 - PowerPoint PPT Presentation

Some bias and a pinch of variance Sara van de Geer November 2, 2016 Joint work with: Andreas Elsener, Alan Muro, Jana Jankov´ a, Benjamin Stucky

... this talk is about theory for machine learning algorithms ...

... this talk is about theory for machine learning algorithms ... ... for high-dimensional data ...

... it is about prediction performance of algorithms trained on random data ... it is not about the scripts used

Problem statement Concepts: Detour: Sparsity exact recovery Effective sparsity Norm penalized Margin Curvature empirical risk Triangle property minimization Adaptation

Problem: Let f : X → R , X ⊂ R m Find min x ∈X f ( x )

Problem: Let f : X → R , X ⊂ R m Find min x ∈X f ( x ) Severe Problem: The function f is unknown!

What we do know: � f ( x ) = ℓ ( x , y ) dP ( y ) =: f P ( x ) where ◦ ℓ ( x , y ) is a given “loss” function: ℓ : X × Y → R ◦ P is an unknown probability measure on the space Y

Example ◦ X := the persons you consider marrying ◦ Y := possible states of the world ◦ ℓ ( x , y ) := the loss when marrying x in world y ◦ P := the distribution of possible states of the world � ◦ f ( x ) = ℓ ( x , y ) dP ( y ) the “risk” of marrying x

Let Q be a given probability measure on Y We replace P by Q : � f Q ( x ) := ℓ ( x , y ) dQ ( y ) and estimate x P := arg min x ∈X f P ( x ) by x Q := arg min x ∈X f Q ( x ) Question: How “good” is this estimate?

0.25 empirical risk 0.20 f(x) Q 0.15 theoretical risk risk Q x 0.10 excess risk f(x) 0.05 P P x 0.00 0.0 0.2 0.4 0.6 0.8 1.0 x beta

Question: Is x Q close to x P ? f ( x Q ) close to f ( x P )

... in our setup ... we have to regularize: accept some bias to reduce variance

Our setup: Q := corresponds to a sample Y 1 , . . . , Y n from P n := sample size Thus n f n ( x ) = 1 � f Q ( x ) := ˆ ℓ ( x , Y i ) , x ∈ X ⊂ R m n i =1 (a random function)

number of parameters m number of observations n high-dimensional statistics : m ≫ n

DATA Y 1 , . . . , Y n ↓ x ∈ R m ˆ

In our setup with m ≫ n we need to regularize That is: accept some bias to be able to reduce the variance.

Regularized empirical risk minimization Target: x P := x 0 = arg min f P ( x ) x ∈X⊂ R m � �� unobservable risk Estimator based on sample: � � x Q := ˆ x := arg min f Q ( x ) + pen ( x ) x ∈X⊂ R m � �� empirical risk regularization penalty

Example: Let Z ∈ R n × m be a given design matrix and b 0 ∈ R n unobserved vector 2 := � n Let � v � 2 i =1 v 2 i and f P ( x ) � �� x 0 ∈ arg min � b 0 − Zx � 2 2 x ∈ R m Sample Y = b 0 + ǫ, ǫ ∈ R n noise “Lasso” with “tuning parameter” λ ≥ 0: := � m f Q ( x ) j =1 | x j | � � � �� Y − Zx � 2 ˆ x := arg min 2 +2 λ � x � 1 x ∈ R p n := number of observations , m := number of parameters . High-dimensional: m ≫ n

Definition We call j an active parameter if (roughly speaking) x 0 j � = 0 We say x 0 is sparse if the number of active parameters is small We write the active set of x 0 as S 0 := { j : x 0 j � = 0 } We call s 0 := | S 0 | the sparsity of x 0

Goal: ◦ derive oracle inequalities for norm-penalized empirical risk minimizers oracle: an estimator that knows the “true” sparsity oracle inequalities: Adaptation to unknown sparsity

Benchmark Low-dimensional x ∈X⊂ R m ˆ x = arg ˆ min f n ( x ) Then typically x ) − f P ( x 0 ) ∼ m n = number of parameters f P (ˆ number of observations High-dimensional � � ˆ x = arg ˆ min f n ( x ) + pen ( x ) x ∈X⊂ R m Aim is Adaptation x ) − f P ( x 0 ) ∼ s 0 n = number of active parameters f P (ˆ number of observations

Problem statement Concepts: Detour: Sparsity exact recovery Effective sparsity Norm penalized Margin curvature empirical risk Triangle property minimization Adaptation

Exact recovery Let Z ∈ R n × m be given and b 0 ∈ R n be given with m ≫ n Consider the system Zx 0 = b 0 of n equations with m unknowns Basis pursuit: � � x ∗ := arg min � x � 1 : Zx = b 0 x ∈ R m

Notation Active set: S 0 := { j : x 0 j � = 0 } Sparsity: s 0 := | S 0 | Effective sparsity: � � x S 0 � 2 � s 0 Γ 2 1 0 := = max 2 / n : � x − S 0 � 1 ≤ � x S 0 � 1 ˆ � Zx � 2 φ 2 ( S 0 ) � �� “ cone condition ” Compatibility constant: ˆ φ 2 ( S 0 )

The compatibility constant is canonical correlation ... ... in the ℓ 1 -world The effective sparsity Γ 2 0 is ≈ the sparsity s 0 but taking into account the correlation between variables.

Compatibility constant: (in R 2 ) Z Z , . . . , m 2 ˆ φ (1 , { 1 } ) Z 1 φ ( S ) = ˆ ˆ φ (1 , S ) for the case S = { 1 }

Basis Pursuit Z given n × m matrix with m ≫ n . Let x 0 be the sparsest solution of Zx = b 0 . Basis Pursuit [Chen, Donoho and Saunders (1998) ]: � � x ∗ := min � x � 1 : Zx = b 0 Exact recovery Γ( S 0 ) < ∞ ⇒ x ∗ = x 0

General norms Let Ω be a norm on R m The Ω − world

Norm-regularized empirical risk minimization � � x Q := ˆ x := arg min f Q ( x ) + λ Ω( x ) x ∈X⊂ R m � �� empirical risk regularization penalty where ◦ Ω is a given norm on R p , ◦ λ > 0 is a tuning parameter

Examples of norms ℓ 1 -norm: Ω( x ) = � x � 1 =: � m j =1 | x j |

Examples of norms ℓ 1 -norm: Ω( x ) = � x � 1 =: � m j =1 | x j | given ˜ Oscar: λ > 0 p � (˜ Ω( x ) := λ ( j − 1) + 1) | x | ( j ) where | x | (1) ≥ · · · ≥ | x | ( p ) j =1 [Bondell and Reich 2008]

Examples of norms ℓ 1 -norm: Ω( x ) = � x � 1 =: � m j =1 | x j | given ˜ Oscar: λ > 0 p � (˜ Ω( x ) := λ ( j − 1) + 1) | x | ( j ) where | x | (1) ≥ · · · ≥ | x | ( p ) j =1 [Bondell and Reich 2008] sorted ℓ 1 -norm: given λ 1 ≥ · · · ≥ λ p > 0, p � Ω( x ) := λ j | x | ( j ) where | x | (1) ≥ · · · ≥ | x | ( p ) j =1 [Bogdan et al. 2013]

norms generated from cones: � � x 2 � m 1 , A ⊂ R m Ω( x ) := min a ∈A a j + a j j + j =1 2 [Micchelli et al. 2010] [Jenatton et al. 2011] [Bach et al. 2012] unit ball for wedge norm unit ball for group Lasso norm A = { a : a 1 ≥ a 2 ≥ · · · }

nuclear norm for matrices: X ∈ R m 1 × m 2 , √ Ω( X ) := � X � nuclear := trace ( X T X )

nuclear norm for matrices: X ∈ R m 1 × m 2 , √ Ω( X ) := � X � nuclear := trace ( X T X ) nuclear norm for tensors: X ∈ R m 1 × m 2 × m 3 , Ω( X ) := dual norm of Ω ∗ where � u 1 � 2 = � u 2 � 2 = � u 3 � 2 =1 trace ( W T u 1 ⊗ u 2 ⊗ u 3 ) , W ∈ R m 1 × m 2 × m 3 Ω ∗ ( W ) := max [Yuan and Zhang 2014]

Some concepts 4 Let ˙ ∂ f P ( x ) := ∂ x f P ( x ) 3 The Bregman divergence ˆ is R f(x) 2 P D ( x � ˆ x ) f(x) x ) − ˙ x ) T ( x − ˆ = f P ( x ) − f P (ˆ f P (ˆ x ) 1 P ˆ D(x|| x) 0 ˆ x x -1.0 -0.5 0.0 0.5 1.0 beta Definition (Property of f P ) We have margin curvature G if x ) ≥ G ( τ ( x ∗ − ˆ D ( x ∗ � ˆ x ))

Definition (Property of Ω) The triangle property holds at x ∗ if ∃ semi-norms Ω + and Ω − such that Ω( x ∗ ) − Ω( x ) ≤ Ω + ( x − x ∗ ) − Ω − ( x ) property triangle Definition The effective sparsity at x ∗ is �� Ω + ( x ) � 2 � Γ 2 : Ω − ( x ) ≤ L Ω + ( x ) ∗ ( L ) := max τ ( x ) � �� “ cone condition ” L ≥ 1 is a stretching factor.

Norm-regularized empirical risk minimization � � x Q := ˆ x := arg min f Q ( x ) + λ Ω( x ) x ∈X⊂ R m � �� empirical risk regularization penalty where ◦ Ω is a given norm on R p , ◦ λ > 0 is a tuning parameter

A sharp oracle inequality Theorem [vdG, 2016] Let this measures how close Q is to P ↓ � � ( ˙ f Q − ˙ ( i . e . remove most λ > λ ǫ ≥ Ω ∗ f P )(ˆ x ) of the variance ) ↑ dual norm Define ¯ λ λ := λ − λ ǫ , ¯ λ := λ + λ ǫ , L = λ. H := convex x = x Q , x 0 = x P ) conjugate Then (recall ˆ of G ↓ � � H (¯ f P ( x ∗ ) − f P ( x 0 ) x ) − f P ( x 0 ) ≤ min x ∗ ∈X f P (ˆ + λ Γ ∗ ( L )) . � �� “ bias ” pinch of “ variance ” that is: Adaptation

Some bias and a pinch of variance Sara van de Geer November 2, 2016 - PowerPoint PPT Presentation

Some bias and a pinch of variance Sara van de Geer November 2, 2016 Joint work with: Andreas Elsener, Alan Muro, Jana Jankov a, Benjamin Stucky ... this talk is about theory for machine learning algorithms ... ... this talk is about theory

Bias- -Variance Theory Variance Theory Bias Decompose Error Rate into components, some

Bias, Variance and Error Bias and Variance given algorithm that outputs estimate for , we

Review Selection bias, overfitting Bias v. variance v. residual Bias-variance tradeoff

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

BIAS What Is Bias? Bias can be defined as favoring one side, position, or belief being

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Bias-Variance Tradeoff Machine Learning 1 Bias and variance Every learning algorithm requires

BIAS BIAS LIGHT LIGHT & & MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh

Expectancy bias and Bias and forensic evidence Bias and speech research forensic speech

Publication bias in QCA Publication bias in QCA Publication bias in QCA Meaning, diagnosis and

The Pinch How the baby boomers took their childrens future - and why they should give it back

Physics Progress of Reversed Field Pinch Magnetic Confinement John Sarff University of

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

Analysis of variance and regression December 4, 2007 Variance component models Variance

Alex Psomas: Lecture 18. Random Variables: Variance 1. Variance 2. Distributions Variance Flip

Variance = E[I 2 ] 2pE[I] + p 2 = E[I] 2p p + p 2 = 2 2 = p-2p+ p pq variance.1

Model-based clustering and data transformations of gene expression data Walter L. Ruzzo

Sick leaves: Understanding disparities between French Departm ents 2 nd IRDES WORKSHOP on Applied

Investor Presentation August 2020 Forward-Looking Statements & Other Important Disclosures

European rail systems Chris Nash Institute for Transport Studies University of Leeds

Simulation of the Spatial Covariance Matrix 802.11 TGn Channel Model Special Committee November

Variations to registered medicines Further work to develop a better risk-based approach Jenny

1 Key Item 1: Harmonisation Initial Marketing Authorisation Current situation with national

The Variation in Third Party Politics Across the American States T R E V O R G R U N W A L D B

Some bias and a pinch of variance Sara van de Geer November 2, 2016 - PowerPoint PPT Presentation

Some bias and a pinch of variance Sara van de Geer November 2, 2016 Joint work with: Andreas Elsener, Alan Muro, Jana Jankov a, Benjamin Stucky ... this talk is about theory for machine learning algorithms ... ... this talk is about theory

Bias- -Variance Theory Variance Theory Bias Decompose Error Rate into components, some

Bias, Variance and Error Bias and Variance given algorithm that outputs estimate for , we

Review Selection bias, overfitting Bias v. variance v. residual Bias-variance tradeoff

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

BIAS What Is Bias? Bias can be defined as favoring one side, position, or belief being

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Bias-Variance Tradeoff Machine Learning 1 Bias and variance Every learning algorithm requires

BIAS BIAS LIGHT LIGHT &amp; &amp; MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh

Expectancy bias and Bias and forensic evidence Bias and speech research forensic speech

Publication bias in QCA Publication bias in QCA Publication bias in QCA Meaning, diagnosis and

The Pinch How the baby boomers took their childrens future - and why they should give it back

Physics Progress of Reversed Field Pinch Magnetic Confinement John Sarff University of

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

Analysis of variance and regression December 4, 2007 Variance component models Variance

Alex Psomas: Lecture 18. Random Variables: Variance 1. Variance 2. Distributions Variance Flip

Variance = E[I 2 ] 2pE[I] + p 2 = E[I] 2p p + p 2 = 2 2 = p-2p+ p pq variance.1

Model-based clustering and data transformations of gene expression data Walter L. Ruzzo

Sick leaves: Understanding disparities between French Departm ents 2 nd IRDES WORKSHOP on Applied

Investor Presentation August 2020 Forward-Looking Statements &amp; Other Important Disclosures

European rail systems Chris Nash Institute for Transport Studies University of Leeds

Simulation of the Spatial Covariance Matrix 802.11 TGn Channel Model Special Committee November

Variations to registered medicines Further work to develop a better risk-based approach Jenny

1 Key Item 1: Harmonisation Initial Marketing Authorisation Current situation with national

The Variation in Third Party Politics Across the American States T R E V O R G R U N W A L D B

BIAS BIAS LIGHT LIGHT & & MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh

Investor Presentation August 2020 Forward-Looking Statements & Other Important Disclosures