Loss minimization and parameter estimation with heavy tails Sivan - - PowerPoint PPT Presentation

loss minimization and parameter estimation with heavy
SMART_READER_LITE
LIVE PREVIEW

Loss minimization and parameter estimation with heavy tails Sivan - - PowerPoint PPT Presentation

Loss minimization and parameter estimation with heavy tails Sivan Sabato # Daniel Hsu ? ? Department of Computer Science, Columbia University # Microsoft Research New England On the job marketdont miss this amazing hiring opportunity!


slide-1
SLIDE 1

Loss minimization and parameter estimation with heavy tails

Daniel Hsu? Sivan Sabato#†

?Department of Computer Science, Columbia University #Microsoft Research New England †On the job market—don’t miss this amazing hiring opportunity! 1

slide-2
SLIDE 2

Outline

  • 1. Introduction
  • 2. Warm-up: estimating a scalar mean
  • 3. Linear regression with heavy-tail distributions
  • 4. Concluding remarks

2

slide-3
SLIDE 3
  • 1. Introduction

3

slide-4
SLIDE 4

Heavy-tail distributions

Distribution with “tail” that is “heavier” than that of Exponential.

For random vectors, consider the distribution of kXk.

4

slide-5
SLIDE 5

Multivariate heavy-tail distributions

Heavy-tail distributions for random vectors X 2 Rd:

I Marginal distributions of Xi have heavy tails, or I Strong dependencies between the Xi.

5

slide-6
SLIDE 6

Multivariate heavy-tail distributions

Heavy-tail distributions for random vectors X 2 Rd:

I Marginal distributions of Xi have heavy tails, or I Strong dependencies between the Xi.

Can we use the same procedures originally designed for distributions without heavy tails? Or do we need new procedures?

5

slide-7
SLIDE 7

Minimax optimal but not deviation optimal

Empirical mean achieves minimax rate for estimating E(X), but suboptimal when deviations are concerned: Squared error of empirical mean is Ω ✓2 n ◆ with probability 2 for some distribution.

(n = sample size, 2 = var(X) < 1.)

6

slide-8
SLIDE 8

Minimax optimal but not deviation optimal

Empirical mean achieves minimax rate for estimating E(X), but suboptimal when deviations are concerned: Squared error of empirical mean is Ω ✓2 n ◆ with probability 2 for some distribution.

(n = sample size, 2 = var(X) < 1.)

Note: If data were Gaussian, squared error would be O ✓2log(1/) n ◆ .

6

slide-9
SLIDE 9

Main result

New computationally efficient estimator for least squares linear regression when distributions of X 2 Rd and Y 2 R may have heavy tails.

7

slide-10
SLIDE 10

Main result

New computationally efficient estimator for least squares linear regression when distributions of X 2 Rd and Y 2 R may have heavy tails. Assuming bounded (4 + ✏)-order moments and regularity conditions, convergence rate is O ✓2d log(1/) n ◆ with probability 1 as soon as n ˜ O(d log(1/) + log2(1/)).

(n = sample size, 2 = optimal squared error.)

7

slide-11
SLIDE 11

Main result

New computationally efficient estimator for least squares linear regression when distributions of X 2 Rd and Y 2 R may have heavy tails. Assuming bounded (4 + ✏)-order moments and regularity conditions, convergence rate is O ✓2d log(1/) n ◆ with probability 1 as soon as n ˜ O(d log(1/) + log2(1/)).

(n = sample size, 2 = optimal squared error.) Previous state-of-the-art: [Audibert and Catoni, AoS 2011], essentially same conditions and rate, but computationally inefficient. General technique with many other applications: ridge, Lasso, matrix approximation, etc.

7

slide-12
SLIDE 12
  • 2. Warm-up: estimating a scalar mean

8

slide-13
SLIDE 13

Warm-up: estimating a scalar mean

Forget X; how do we estimate E(Y )? (Set µ := E(Y ) and 2 := var(Y ); assume 2 < 1.)

9

slide-14
SLIDE 14

Empirical mean

Let Y1, Y2, . . . , Yn be iid copies of Y , and set b µ := 1 n

n

X

i=1

Yi (empirical mean).

10

slide-15
SLIDE 15

Empirical mean

Let Y1, Y2, . . . , Yn be iid copies of Y , and set b µ := 1 n

n

X

i=1

Yi (empirical mean). There exists distributions for Y with 2 < 1 s.t. P ✓ (b µ µ)2 2 2n (1 2e/n)n1 ◆ 2.

10

(Catoni, 2012)

slide-16
SLIDE 16

Median-of-means

[Nemirovsky and Yudin, 1983; Alon, Matias, and Szegedy, JCSS 1999]

11

slide-17
SLIDE 17

Median-of-means

[Nemirovsky and Yudin, 1983; Alon, Matias, and Szegedy, JCSS 1999]

  • 1. Split the sample {Y1, . . . , Yn} into k parts S1, S2, . . . , Sk of

equal size (say, randomly).

  • 2. For each i = 1, 2, . . . , k: set b

µi := mean(Si).

  • 3. Return b

µ := median({b µ1, b µ2, . . . , b µk}).

11

slide-18
SLIDE 18

Median-of-means

[Nemirovsky and Yudin, 1983; Alon, Matias, and Szegedy, JCSS 1999]

  • 1. Split the sample {Y1, . . . , Yn} into k parts S1, S2, . . . , Sk of

equal size (say, randomly).

  • 2. For each i = 1, 2, . . . , k: set b

µi := mean(Si).

  • 3. Return b

µ := median({b µ1, b µ2, . . . , b µk}).

Theorem (Folklore)

Set k := 4.5 ln(1/). With probability at least 1 , (b µ µ)2  O ✓2 log(1/) n ◆ .

11

slide-19
SLIDE 19

Analysis of median-of-means

  • 1. Assume |Si| = k/n for simplicity. By Chebyshev’s inequality,

for each i = 1, 2, . . . , k: Pr |b µi µ|  r 62k n ! 5/6.

12

slide-20
SLIDE 20

Analysis of median-of-means

  • 1. Assume |Si| = k/n for simplicity. By Chebyshev’s inequality,

for each i = 1, 2, . . . , k: Pr |b µi µ|  r 62k n ! 5/6.

  • 2. Let bi := 1{|b

µi µ|  p 62k/n}. By Hoeffding’s inequality, Pr k X

i=1

bi > k/2 ! 1 exp(k/4.5).

12

slide-21
SLIDE 21

Analysis of median-of-means

  • 1. Assume |Si| = k/n for simplicity. By Chebyshev’s inequality,

for each i = 1, 2, . . . , k: Pr |b µi µ|  r 62k n ! 5/6.

  • 2. Let bi := 1{|b

µi µ|  p 62k/n}. By Hoeffding’s inequality, Pr k X

i=1

bi > k/2 ! 1 exp(k/4.5).

  • 3. In the event that more than half of the b

µi are within p 62k/n of µ, the median b µ is as well.

12

slide-22
SLIDE 22

Alternative: minimize a robust loss function

Alternative is to minimize a “robust” loss function [Catoni, 2012]: b µ := arg min

µ2R n

X

i=1

` ✓µ Yi

. Example: `(z) := log cosh(z). Optimal rate and constants. Catch: need to know 2.

13

slide-23
SLIDE 23
  • 3. Linear regression with heavy-tail distributions

14

slide-24
SLIDE 24

Linear regression (for out-of-sample prediction)

  • 1. Response variable: random variable Y 2 R.
  • 2. Covariates: random vector X 2 Rd.

(Assume Σ := EXX

> 0.)

  • 3. Given: Sample S of n iid copies of (X, Y ).
  • 4. Goal: find b

β = b β(S) 2 Rd to minimize population loss L(β) := E(Y β>X)2.

15

slide-25
SLIDE 25

Linear regression (for out-of-sample prediction)

  • 1. Response variable: random variable Y 2 R.
  • 2. Covariates: random vector X 2 Rd.

(Assume Σ := EXX

> 0.)

  • 3. Given: Sample S of n iid copies of (X, Y ).
  • 4. Goal: find b

β = b β(S) 2 Rd to minimize population loss L(β) := E(Y β>X)2. Recall: Let β? := arg minβ02Rd L(β0). For any β 2 Rd, L(β) L(β?) =

  • Σ1/2(β β?)
  • 2

=: kβ β?k2

Σ .

15

slide-26
SLIDE 26

Generalization of median-of-means

  • 1. Split the sample S into k parts S1, S2, . . . , Sk of equal size

(say, randomly).

  • 2. For each i = 1, 2, . . . , k: set b

βi := ordinary least squares(Si).

  • 3. Return b

β := select good one ⇣n b β1, b β2, . . . , b βk

.

16

slide-27
SLIDE 27

Generalization of median-of-means

  • 1. Split the sample S into k parts S1, S2, . . . , Sk of equal size

(say, randomly).

  • 2. For each i = 1, 2, . . . , k: set b

βi := ordinary least squares(Si).

  • 3. Return b

β := select good one ⇣n b β1, b β2, . . . , b βk

. Questions:

  • 1. Guarantees for b

βi = OLS(Si)?

  • 2. How to select a good b

βi?

16

slide-28
SLIDE 28

Ordinary least squares

Under moment conditions⇤, b βi := OLS(Si) satisfies

  • b

βi β?

  • Σ = O

s 2d |Si| ! with probability at least 5/6 as soon as |Si| O(d log d).⇤⇤

⇤ Requires Kurtosis condition for this simplified bound. ⇤⇤ Can replace d log d with d under some regularity conditions

[Srivastava and Vershynin, AoP 2013].

17

slide-29
SLIDE 29

Ordinary least squares

Under moment conditions⇤, b βi := OLS(Si) satisfies

  • b

βi β?

  • Σ = O

s 2d |Si| ! with probability at least 5/6 as soon as |Si| O(d log d).⇤⇤ Upshot: If k := O(log(1/)), then with probability 1 , more than half of the b βi will be within " := p 2d log(1/)/n of β?.

⇤ Requires Kurtosis condition for this simplified bound. ⇤⇤ Can replace d log d with d under some regularity conditions

[Srivastava and Vershynin, AoP 2013].

17

slide-30
SLIDE 30

Selecting a good b βi assuming Σ is known

Consider metric ⇢(a, b) := ka bkΣ.

  • 1. For each i = 1, 2, . . . , k:

Let ri := median n ⇢(b βi, b βj) : j = 1, 2, . . . , k

  • .
  • 2. Let i? := arg min ri.
  • 3. Return b

β := b βi?.

18

slide-31
SLIDE 31

Selecting a good b βi assuming Σ is known

Consider metric ⇢(a, b) := ka bkΣ.

  • 1. For each i = 1, 2, . . . , k:

Let ri := median n ⇢(b βi, b βj) : j = 1, 2, . . . , k

  • .
  • 2. Let i? := arg min ri.
  • 3. Return b

β := b βi?. Claim: If more than half of the b βi are within distance " of β?, then b β is within distance 3" of β?.

18

slide-32
SLIDE 32

Selecting a good b βi when Σ is unknown

General case: Σ is unknown; can’t compute distances ka bkΣ.

19

slide-33
SLIDE 33

Selecting a good b βi when Σ is unknown

General case: Σ is unknown; can’t compute distances ka bkΣ. Solution: Estimate k

2

  • distances using fresh (unlabeled) samples.

19

slide-34
SLIDE 34

Selecting a good b βi when Σ is unknown

General case: Σ is unknown; can’t compute distances ka bkΣ. Solution: Estimate k

2

  • distances using fresh (unlabeled) samples.

I Only require constant fraction of these estimates to be

accurate within constant multiplicative factors.

I Extra O(k2) = O(log2(1/)) (unlabeled) samples suffice.

19

slide-35
SLIDE 35

Another interpretation: multiplicative approximation

With probability 1 , L(b β)  ✓ 1 + O ✓d log(1/) n ◆◆ L(β?) (as soon as n ˜ O(d log(1/) + log2(1/))). For instance, get 2-approximation with n = ˜ O ⇣ d log(1/) + log2(1/) ⌘ —no dependence on L(β?).

(cf. [Mahdavi and Jin, COLT 2013].)

20

slide-36
SLIDE 36
  • 4. Concluding remarks

21

slide-37
SLIDE 37

Concluding remarks

  • 1. This talk: Linear regression with heavy-tail distributions in

finite dimensions. Paper: Other applications (e.g., ridge, Lasso, matrix approximation). http://arxiv.org/abs/1307.1827

22

slide-38
SLIDE 38

Concluding remarks

  • 1. This talk: Linear regression with heavy-tail distributions in

finite dimensions. Paper: Other applications (e.g., ridge, Lasso, matrix approximation). http://arxiv.org/abs/1307.1827

  • 2. Simple algorithms + simple statistics:

Avoid unnecessary assumptions made in statistical learning theory for classical problems.

22

slide-39
SLIDE 39

Concluding remarks

  • 1. This talk: Linear regression with heavy-tail distributions in

finite dimensions. Paper: Other applications (e.g., ridge, Lasso, matrix approximation). http://arxiv.org/abs/1307.1827

  • 2. Simple algorithms + simple statistics:

Avoid unnecessary assumptions made in statistical learning theory for classical problems.

  • 3. Open questions:

I Remove extraneous log factors? I Validation sets: not just for parameter tuning? 22

slide-40
SLIDE 40

Concluding remarks

  • 1. This talk: Linear regression with heavy-tail distributions in

finite dimensions. Paper: Other applications (e.g., ridge, Lasso, matrix approximation). http://arxiv.org/abs/1307.1827

  • 2. Simple algorithms + simple statistics:

Avoid unnecessary assumptions made in statistical learning theory for classical problems.

  • 3. Open questions:

I Remove extraneous log factors? I Validation sets: not just for parameter tuning?

Thanks!

22