Exponentially weighted aggregation Laplace prior for linear - - PowerPoint PPT Presentation

exponentially weighted aggregation laplace prior for
SMART_READER_LITE
LIVE PREVIEW

Exponentially weighted aggregation Laplace prior for linear - - PowerPoint PPT Presentation

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average Exponentially weighted aggregation Laplace prior for linear regression Arnak Dalalyan, Edwin Grappin & Quentin Paris edwin.grappin@ensae.fr


slide-1
SLIDE 1

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average

Exponentially weighted aggregation Laplace prior for linear regression

Arnak Dalalyan, Edwin Grappin & Quentin Paris

edwin.grappin@ensae.fr

JPS - Les Houches - 2016

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-2
SLIDE 2

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average Linear regression: goals & settings Linear regression: least squares

Goals & settings

We observe n labels (Yi)i∈{1,...,n} and there is a linear relation between the label and the p features (Xi,j)j∈{1,...,p} such that: Y = Xβ⋆ + ξ, where Y ∈ Rn, X ∈ Rn×p, β⋆ ∈ Rp and ξ ∈ Rn a random variable such that ξi is N(0, σ2).

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-3
SLIDE 3

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average Linear regression: goals & settings Linear regression: least squares

Goals & settings

We observe n labels (Yi)i∈{1,...,n} and there is a linear relation between the label and the p features (Xi,j)j∈{1,...,p} such that: Y = Xβ⋆ + ξ, where Y ∈ Rn, X ∈ Rn×p, β⋆ ∈ Rp and ξ ∈ Rn a random variable such that ξi is N(0, σ2). Our interests are: Low prediction loss: X(β⋆ − ˆ β)2

2 (fitting β⋆ is less

important), Good quality when p is large (p >> n), Efficient use of sparsity property of β⋆ (β⋆ is s-sparse if at most s elements are non null).

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-4
SLIDE 4

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average Linear regression: goals & settings Linear regression: least squares

Least squares method

Ordinary least squares (OLS) estimator is defined by: ˆ βOLS = arg min

β∈Rp Y − Xβ2 2.

OLS minimizes the sum of the squares of the residuals.

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-5
SLIDE 5

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average Linear regression: goals & settings Linear regression: least squares

Least squares method

Ordinary least squares (OLS) estimator is defined by: ˆ βOLS = arg min

β∈Rp Y − Xβ2 2.

OLS minimizes the sum of the squares of the residuals.

  • Overfitting. If p is very large, OLS has

poor prediction results:

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-6
SLIDE 6

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average Linear regression: goals & settings Linear regression: least squares

Least squares method

Ordinary least squares (OLS) estimator is defined by: ˆ βOLS = arg min

β∈Rp Y − Xβ2 2.

OLS minimizes the sum of the squares of the residuals.

  • Overfitting. If p is very large, OLS has

poor prediction results: There is not a unique solution when p > n,

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-7
SLIDE 7

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average Linear regression: goals & settings Linear regression: least squares

Least squares method

Ordinary least squares (OLS) estimator is defined by: ˆ βOLS = arg min

β∈Rp Y − Xβ2 2.

OLS minimizes the sum of the squares of the residuals.

  • Overfitting. If p is very large, OLS has

poor prediction results: There is not a unique solution when p > n, Does not detect meaningful features among all features,

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-8
SLIDE 8

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average Linear regression: goals & settings Linear regression: least squares

Least squares method

Ordinary least squares (OLS) estimator is defined by: ˆ βOLS = arg min

β∈Rp Y − Xβ2 2.

OLS minimizes the sum of the squares of the residuals.

  • Overfitting. If p is very large, OLS has

poor prediction results: There is not a unique solution when p > n, Does not detect meaningful features among all features, Performance is focus on fitting the data not predicting labels.

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-9
SLIDE 9

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average Penalized regression ℓ0 penalization A trade off between sparsity and convexity Lasso and oracle inequality

Penalized regression

In our case, a good estimator has the following properties: Guarantees on prediction results, Use sparsity assumption to manage p > n, Computationnaly fast (of paramount importance when p is large).

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-10
SLIDE 10

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average Penalized regression ℓ0 penalization A trade off between sparsity and convexity Lasso and oracle inequality

Penalized regression

In our case, a good estimator has the following properties: Guarantees on prediction results, Use sparsity assumption to manage p > n, Computationnaly fast (of paramount importance when p is large). Penalized regression is a method that combines the usual fitting term with a penalty term : ˆ βpen = arg min

β∈Rp

  • Y − Xβ2

2 + λP(β)

  • ,

P is the penalty function and λ ≥ 0 controls the trade off between the two terms.

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-11
SLIDE 11

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average Penalized regression ℓ0 penalization A trade off between sparsity and convexity Lasso and oracle inequality

Subset selection with a ℓ0 penalization

An intuitive candidate would be a penalization based on ℓ0 pseudo-norm (the sparsity level): β0 =

p

  • i=1

✶βi=0. ˆ βℓ0 = arg min

β∈Rp

  • Y − Xβ2

2 + λβ0

  • .

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-12
SLIDE 12

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average Penalized regression ℓ0 penalization A trade off between sparsity and convexity Lasso and oracle inequality

Subset selection with a ℓ0 penalization

An intuitive candidate would be a penalization based on ℓ0 pseudo-norm (the sparsity level): β0 =

p

  • i=1

✶βi=0. ˆ βℓ0 = arg min

β∈Rp

  • Y − Xβ2

2 + λβ0

  • .

The penalty forces many elements of ˆ β to be null. It chooses the most important features.

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-13
SLIDE 13

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average Penalized regression ℓ0 penalization A trade off between sparsity and convexity Lasso and oracle inequality

Subset selection with a ℓ0 penalization

An intuitive candidate would be a penalization based on ℓ0 pseudo-norm (the sparsity level): β0 =

p

  • i=1

✶βi=0. ˆ βℓ0 = arg min

β∈Rp

  • Y − Xβ2

2 + λβ0

  • .

The penalty forces many elements of ˆ β to be null. It chooses the most important features. Due to the ℓ0 pseudo-norm, the objective function is nonconvex. Hence, computational time grows exponentially with p.

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-14
SLIDE 14

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average Penalized regression ℓ0 penalization A trade off between sparsity and convexity Lasso and oracle inequality

Choice of the penalization term

Let q > 0, we consider the estimators ˆ βq = arg min

β∈Rp

  • Y − Xβ2

2 + λβq q

  • .

If q <1, the solution is sparse but the problem is nonconvex.

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-15
SLIDE 15

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average Penalized regression ℓ0 penalization A trade off between sparsity and convexity Lasso and oracle inequality

Choice of the penalization term

Let q > 0, we consider the estimators ˆ βq = arg min

β∈Rp

  • Y − Xβ2

2 + λβq q

  • .

If q > 1, the problem is convex but the solution is not sparse.

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-16
SLIDE 16

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average Penalized regression ℓ0 penalization A trade off between sparsity and convexity Lasso and oracle inequality

Choice of the penalization term

Let q > 0, we consider the estimators ˆ βq = arg min

β∈Rp

  • Y − Xβ2

2 + λβq q

  • .

If q = 1, the solution is sparse and the problem is convex.

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-17
SLIDE 17

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average Penalized regression ℓ0 penalization A trade off between sparsity and convexity Lasso and oracle inequality

Lasso, the ℓ1 norm

The Lasso estimator is defined by: ˆ βL = arg min

β∈Rp

Y − Xβ2

2

2n + λβ1

  • .

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-18
SLIDE 18

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average Penalized regression ℓ0 penalization A trade off between sparsity and convexity Lasso and oracle inequality

Lasso, the ℓ1 norm

The Lasso estimator is defined by: ˆ βL = arg min

β∈Rp

Y − Xβ2

2

2n + λβ1

  • .

Theorem

Dalalyan & al. (2014). On the Prediction Performance of the Lasso

Let λ = 2σ

  • 2 log(p/δ)

n

. Then, with probability at least 1 − δ, X(β⋆ − ˆ βL)2

2

n ≤ inf

β∈Rp s−sparse

X(β⋆ − β)2

2

n +10 s σ2 log(p/δ) n κ

  • ,

where κ is a constant depending on the design of X.

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-19
SLIDE 19

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average EWA: definition Oracle inequality The choice of T Conclusion & questions

EWA: definition

Lasso estimator is a maximum a posteriori estimator with Laplace prior : ˆ βL = arg min

β∈Rp

Y − Xβ2

2

2n + λβ1

  • = arg max

β∈Rp

  • exp
  • − 1

2 Y − Xβ2

2

σ2

  • ∝N(Xβ,σ2In)

exp

  • − λn

σ2 β1

  • ∝π0(β):Laplace prior
  • Arnak Dalalyan, Edwin Grappin & Quentin Paris

EWA & Laplace prior

slide-20
SLIDE 20

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average EWA: definition Oracle inequality The choice of T Conclusion & questions

EWA: definition

Lasso estimator is a maximum a posteriori estimator with Laplace prior : ˆ βL = arg min

β∈Rp

Y − Xβ2

2

2n + λβ1

  • = arg max

β∈Rp

  • exp
  • − 1

2 Y − Xβ2

2

σ2

  • ∝N(Xβ,σ2In)

exp

  • − λn

σ2 β1

  • ∝π0(β):Laplace prior
  • Let, V (β) =

1 2σ2 Y − Xβ2 2 + λn σ2 β1, and

ˆ πT(β) ∝ exp

  • − V (β)

T

  • .

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-21
SLIDE 21

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average EWA: definition Oracle inequality The choice of T Conclusion & questions

EWA: definition

Lasso estimator is a maximum a posteriori estimator with Laplace prior : ˆ βL = arg min

β∈Rp

Y − Xβ2

2

2n + λβ1

  • = arg max

β∈Rp

  • exp
  • − 1

2 Y − Xβ2

2

σ2

  • ∝N(Xβ,σ2In)

exp

  • − λn

σ2 β1

  • ∝π0(β):Laplace prior
  • Let, V (β) =

1 2σ2 Y − Xβ2 2 + λn σ2 β1, and

ˆ πT(β) ∝ exp

  • − V (β)

T

  • . We define the exponentially weighted

average (EWA) estimator with Laplace prior by ˆ βEWA =

  • Rp βˆ

πT(β)dβ.

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-22
SLIDE 22

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average EWA: definition Oracle inequality The choice of T Conclusion & questions

Results

Theorem Let λ = 2σ

  • 2 log(p/δ)

n

, then with probability at least 1 − δ, X(β⋆ − ˆ βEWA)2

2

n ≤ inf

β∈Rp s−sparse

X(β⋆ − β)2

2

n + 10sσ2 log(p/δ) nκ

  • + 2H(T).

Where H(T) = pT −

  • G(β)ˆ

πT(β)dβ + G(ˆ βEWA), and G(β) = 1

nXβ2 2 + λβ1. G is convex, hence H(T) ≤ pT.

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-23
SLIDE 23

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average EWA: definition Oracle inequality The choice of T Conclusion & questions

The choice of T

If T = 0, ˆ βL = ˆ βEWA.

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-24
SLIDE 24

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average EWA: definition Oracle inequality The choice of T Conclusion & questions

The choice of T

If T = 0, ˆ βL = ˆ βEWA. We are interested in T < 1/p, remember: H(T) ≤ pT.

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-25
SLIDE 25

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average EWA: definition Oracle inequality The choice of T Conclusion & questions

The choice of T

If T = 0, ˆ βL = ˆ βEWA. We are interested in T < 1/p, remember: H(T) ≤ pT. The larger T is, the larger is the variance of the posterior.

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-26
SLIDE 26

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average EWA: definition Oracle inequality The choice of T Conclusion & questions

The choice of T

If T = 0, ˆ βL = ˆ βEWA. We are interested in T < 1/p, remember: H(T) ≤ pT. The larger T is, the larger is the variance of the posterior. We believe that variance brings robustness to the choice of λ.

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-27
SLIDE 27

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average EWA: definition Oracle inequality The choice of T Conclusion & questions

The choice of T

If T = 0, ˆ βL = ˆ βEWA. We are interested in T < 1/p, remember: H(T) ≤ pT. The larger T is, the larger is the variance of the posterior. We believe that variance brings robustness to the choice of λ.

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-28
SLIDE 28

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average EWA: definition Oracle inequality The choice of T Conclusion & questions

Conclusion & questions

Results: EWA with Laplace prior is a family of estimator that includes the Lasso. There is a sharp oracle inequality for this family of estimator.

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-29
SLIDE 29

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average EWA: definition Oracle inequality The choice of T Conclusion & questions

Conclusion & questions

Results: EWA with Laplace prior is a family of estimator that includes the Lasso. There is a sharp oracle inequality for this family of estimator. Questions: What is a good value of T? Can we prove a result on the robustness of λ? Can we compute efficiently this estimator?

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior

slide-30
SLIDE 30

Introduction: prediction in high dimension Penalization and Lasso Exponentially weighted average EWA: definition Oracle inequality The choice of T Conclusion & questions

Conclusion & questions

Results: EWA with Laplace prior is a family of estimator that includes the Lasso. There is a sharp oracle inequality for this family of estimator. Questions: What is a good value of T? Can we prove a result on the robustness of λ? Can we compute efficiently this estimator? Thank you!

Arnak Dalalyan, Edwin Grappin & Quentin Paris EWA & Laplace prior