loss minimization and parameter estimation with heavy
play

Loss minimization and parameter estimation with heavy tails Sivan - PowerPoint PPT Presentation

Loss minimization and parameter estimation with heavy tails Sivan Sabato # Daniel Hsu ? ? Department of Computer Science, Columbia University # Microsoft Research New England On the job marketdont miss this amazing hiring opportunity!


  1. Loss minimization and parameter estimation with heavy tails Sivan Sabato # † Daniel Hsu ? ? Department of Computer Science, Columbia University # Microsoft Research New England † On the job market—don’t miss this amazing hiring opportunity! 1

  2. Outline 1. Introduction 2. Warm-up: estimating a scalar mean 3. Linear regression with heavy-tail distributions 4. Concluding remarks 2

  3. 1. Introduction 3

  4. Heavy-tail distributions Distribution with “tail” that is “heavier” than that of Exponential. For random vectors, consider the distribution of k X k . 4

  5. Multivariate heavy-tail distributions Heavy-tail distributions for random vectors X 2 R d : I Marginal distributions of X i have heavy tails, or I Strong dependencies between the X i . 5

  6. Multivariate heavy-tail distributions Heavy-tail distributions for random vectors X 2 R d : I Marginal distributions of X i have heavy tails, or I Strong dependencies between the X i . Can we use the same procedures originally designed for distributions without heavy tails? Or do we need new procedures? 5

  7. Minimax optimal but not deviation optimal Empirical mean achieves minimax rate for estimating E ( X ) , but suboptimal when deviations are concerned: Squared error of empirical mean is ✓ � 2 ◆ Ω n � with probability � 2 � for some distribution. (n = sample size, � 2 = var ( X ) < 1 .) 6

  8. Minimax optimal but not deviation optimal Empirical mean achieves minimax rate for estimating E ( X ) , but suboptimal when deviations are concerned: Squared error of empirical mean is ✓ � 2 ◆ Ω n � with probability � 2 � for some distribution. (n = sample size, � 2 = var ( X ) < 1 .) Note : If data were Gaussian, squared error would be ✓ � 2 log ( 1 / � ) ◆ O . n 6

  9. Main result New computationally e ffi cient estimator for least squares linear regression when distributions of X 2 R d and Y 2 R may have heavy tails. 7

  10. Main result New computationally e ffi cient estimator for least squares linear regression when distributions of X 2 R d and Y 2 R may have heavy tails. Assuming bounded ( 4 + ✏ ) -order moments and regularity conditions, convergence rate is ✓ � 2 d log ( 1 / � ) ◆ O n with probability � 1 � � as soon as n � ˜ O ( d log ( 1 / � ) + log 2 ( 1 / � )) . ( n = sample size, � 2 = optimal squared error.) 7

  11. Main result New computationally e ffi cient estimator for least squares linear regression when distributions of X 2 R d and Y 2 R may have heavy tails. Assuming bounded ( 4 + ✏ ) -order moments and regularity conditions, convergence rate is ✓ � 2 d log ( 1 / � ) ◆ O n with probability � 1 � � as soon as n � ˜ O ( d log ( 1 / � ) + log 2 ( 1 / � )) . ( n = sample size, � 2 = optimal squared error.) Previous state-of-the-art : [Audibert and Catoni, AoS 2011], essentially same conditions and rate, but computationally ine ffi cient. General technique with many other applications : ridge, Lasso, matrix approximation, etc . 7

  12. 2. Warm-up: estimating a scalar mean 8

  13. Warm-up: estimating a scalar mean Forget X ; how do we estimate E ( Y ) ? (Set µ := E ( Y ) and � 2 := var ( Y ) ; assume � 2 < 1 .) 9

  14. Empirical mean Let Y 1 , Y 2 , . . . , Y n be iid copies of Y , and set n X µ := 1 b Y i n i = 1 (empirical mean). 10

  15. Empirical mean Let Y 1 , Y 2 , . . . , Y n be iid copies of Y , and set n X µ := 1 b Y i n i = 1 (empirical mean). There exists distributions for Y with � 2 < 1 s.t. ✓ ◆ µ � µ ) 2 � � 2 2 n � ( 1 � 2 e � / n ) n � 1 ( b � 2 � . P (Catoni, 2012) 10

  16. Median-of-means [Nemirovsky and Yudin, 1983; Alon, Matias, and Szegedy, JCSS 1999] 11

  17. Median-of-means [Nemirovsky and Yudin, 1983; Alon, Matias, and Szegedy, JCSS 1999] 1. Split the sample { Y 1 , . . . , Y n } into k parts S 1 , S 2 , . . . , S k of equal size (say, randomly). 2. For each i = 1 , 2 , . . . , k : set b µ i := mean ( S i ) . 3. Return b µ := median ( { b µ 1 , b µ 2 , . . . , b µ k } ) . 11

  18. Median-of-means [Nemirovsky and Yudin, 1983; Alon, Matias, and Szegedy, JCSS 1999] 1. Split the sample { Y 1 , . . . , Y n } into k parts S 1 , S 2 , . . . , S k of equal size (say, randomly). 2. For each i = 1 , 2 , . . . , k : set b µ i := mean ( S i ) . 3. Return b µ := median ( { b µ 1 , b µ 2 , . . . , b µ k } ) . Theorem (Folklore) Set k := 4 . 5 ln ( 1 / � ) . With probability at least 1 � � , ✓ � 2 log ( 1 / � ) ◆ µ � µ ) 2  O ( b . n 11

  19. Analysis of median-of-means 1. Assume | S i | = k / n for simplicity. By Chebyshev’s inequality, for each i = 1 , 2 , . . . , k : ! r 6 � 2 k Pr | b µ i � µ |  � 5 / 6 . n 12

  20. Analysis of median-of-means 1. Assume | S i | = k / n for simplicity. By Chebyshev’s inequality, for each i = 1 , 2 , . . . , k : ! r 6 � 2 k Pr | b µ i � µ |  � 5 / 6 . n p 6 � 2 k / n } . By Hoe ff ding’s inequality, 2. Let b i := 1 {| b µ i � µ |  k ! X Pr b i > k / 2 � 1 � exp ( � k / 4 . 5 ) . i = 1 12

  21. Analysis of median-of-means 1. Assume | S i | = k / n for simplicity. By Chebyshev’s inequality, for each i = 1 , 2 , . . . , k : ! r 6 � 2 k Pr | b µ i � µ |  � 5 / 6 . n p 6 � 2 k / n } . By Hoe ff ding’s inequality, 2. Let b i := 1 {| b µ i � µ |  k ! X Pr b i > k / 2 � 1 � exp ( � k / 4 . 5 ) . i = 1 3. In the event that more than half of the b µ i are within p 6 � 2 k / n of µ , the median b µ is as well. 12

  22. Alternative: minimize a robust loss function Alternative is to minimize a “robust” loss function [Catoni, 2012] : ✓ µ � Y i ◆ n X µ := arg min b . ` � µ 2 R i = 1 Example: ` ( z ) := log cosh ( z ) . Optimal rate and constants. Catch : need to know � 2 . 13

  23. 3. Linear regression with heavy-tail distributions 14

  24. Linear regression (for out-of-sample prediction) 1. Response variable : random variable Y 2 R . 2. Covariates : random vector X 2 R d . > � 0.) (Assume Σ := E XX 3. Given : Sample S of n iid copies of ( X , Y ) . β ( S ) 2 R d to minimize population loss 4. Goal : find b β = b L ( β ) := E ( Y � β > X ) 2 . 15

  25. Linear regression (for out-of-sample prediction) 1. Response variable : random variable Y 2 R . 2. Covariates : random vector X 2 R d . > � 0.) (Assume Σ := E XX 3. Given : Sample S of n iid copies of ( X , Y ) . β ( S ) 2 R d to minimize population loss 4. Goal : find b β = b L ( β ) := E ( Y � β > X ) 2 . Recall : Let β ? := arg min β 0 2 R d L ( β 0 ) . For any β 2 R d , � � 2 � � =: k β � β ? k 2 � Σ 1 / 2 ( β � β ? ) L ( β ) � L ( β ? ) = Σ . � 15

  26. Generalization of median-of-means 1. Split the sample S into k parts S 1 , S 2 , . . . , S k of equal size (say, randomly). 2. For each i = 1 , 2 , . . . , k : set b β i := ordinary least squares ( S i ) . ⇣n o⌘ 3. Return b b β 1 , b β 2 , . . . , b β := select good one . β k 16

  27. Generalization of median-of-means 1. Split the sample S into k parts S 1 , S 2 , . . . , S k of equal size (say, randomly). 2. For each i = 1 , 2 , . . . , k : set b β i := ordinary least squares ( S i ) . ⇣n o⌘ 3. Return b b β 1 , b β 2 , . . . , b β := select good one . β k Questions : 1. Guarantees for b β i = OLS ( S i ) ? 2. How to select a good b β i ? 16

  28. Ordinary least squares Under moment conditions ⇤ , b β i := OLS ( S i ) satisfies s ! � � � 2 d � � �b β i � β ? Σ = O � | S i | with probability at least 5 / 6 as soon as | S i | � O ( d log d ) . ⇤⇤ ⇤ Requires Kurtosis condition for this simplified bound. ⇤⇤ Can replace d log d with d under some regularity conditions [Srivastava and Vershynin, AoP 2013]. 17

  29. Ordinary least squares Under moment conditions ⇤ , b β i := OLS ( S i ) satisfies s ! � � � 2 d � � �b β i � β ? Σ = O � | S i | with probability at least 5 / 6 as soon as | S i | � O ( d log d ) . ⇤⇤ Upshot : If k := O ( log ( 1 / � )) , then with probability � 1 � � , more p than half of the b � 2 d log ( 1 / � ) / n of β ? . β i will be within " := ⇤ Requires Kurtosis condition for this simplified bound. ⇤⇤ Can replace d log d with d under some regularity conditions [Srivastava and Vershynin, AoP 2013]. 17

  30. Selecting a good b β i assuming Σ is known Consider metric ⇢ ( a , b ) := k a � b k Σ . 1. For each i = 1 , 2 , . . . , k : n o ⇢ ( b β i , b Let r i := median β j ) : j = 1 , 2 , . . . , k . 2. Let i ? := arg min r i . 3. Return b β := b β i ? . 18

  31. Selecting a good b β i assuming Σ is known Consider metric ⇢ ( a , b ) := k a � b k Σ . 1. For each i = 1 , 2 , . . . , k : n o ⇢ ( b β i , b Let r i := median β j ) : j = 1 , 2 , . . . , k . 2. Let i ? := arg min r i . 3. Return b β := b β i ? . Claim : If more than half of the b β i are within distance " of β ? , then b β is within distance 3 " of β ? . 18

  32. Selecting a good b β i when Σ is unknown General case : Σ is unknown; can’t compute distances k a � b k Σ . 19

  33. Selecting a good b β i when Σ is unknown General case : Σ is unknown; can’t compute distances k a � b k Σ . � k � Solution : Estimate distances using fresh (unlabeled) samples. 2 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend