Large Scale Machine Learning with Stochastic Gradient Descent L - PowerPoint PPT Presentation

Large Scale Machine Learning with Stochastic Gradient Descent L´ eon Bottou leon@bottou.org Microsoft (since June)

Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning. iii. Asymptotic Analysis. iv. Learning with a Single Pass. L´ eon Bottou 2/37

I. Learning with Stochastic Gradient Descent L´ eon Bottou 3/37

Example Binary classification – Patterns x . – Classes y = ± 1 . Linear model – Choose features: Φ( x ) ∈ R d � � ⊤ Φ( x ) – Linear discriminant function: f w ( x ) = sign w L´ eon Bottou 4/37

SVM training – Choose loss function ⊤ Φ( x ) � � 1 + e − y w Q ( x, y, w ) = ℓ ( y, f w ( x )) = (e.g.) log � – Cannot minimize the expected risk E ( w ) = Q ( x, y, w ) dP ( x, y ) . n – Can compute the empirical risk E n ( w ) = 1 � Q ( x i , y i , w ) . n i =1 Minimize L 2 regularized empirical risk n λ 2 � w � 2 + 1 � min Q ( x i , y i , w ) w n i =1 Choosing λ is the same setting a constraint � w � 2 < B . L´ eon Bottou 5/37

Batch versus Online Batch: process all examples together (GD) – Example: minimization by gradient descent   n  λw + 1 ∂Q � Repeat: w ← w − γ ∂w ( x i , y i , w )  n i =1 Online: process examples one by one (SGD) – Example: minimization by stochastic gradient descent Repeat: (a) Pick random example x t , y t λw + ∂Q � � (b) w ← w − γ t ∂w ( x t , y t , w ) L´ eon Bottou 6/37

Second order optimization Batch: (2GD) – Example: Newton’s algorithm   n  λw + 1 ∂Q Repeat: w ← w − H − 1 � ∂w ( x i , y i , w )  n i =1 Online: (2SGD) – Example: Second order stochastic gradient descent Repeat: (a) Pick random example x t , y t λw + ∂Q � � (b) w ← w − γ t H − 1 ∂w ( x t , y t , w ) L´ eon Bottou 7/37

More SGD Algorithms Adaline (Widrow and Hoff, 1960) y t − w ⊤ Φ( x t ) � � w ← w + γ t Φ( x t ) � 2 Q adaline = 1 � y − w ⊤ Φ( x ) 2 Φ( x ) ∈ R d , y = ± 1 Perceptron (Rosenblatt, 1957) � y t Φ( x t ) if y t w ⊤ Φ( x t ) ≤ 0 w ← w + γ t Q perceptron = max { 0 , − y w ⊤ Φ( x ) } 0 otherwise Φ( x ) ∈ R d , y = ± 1 Multilayer perceptrons (Rumelhart et al., 1986) . . . SVM (Cortes and Vapnik, 1995) . . . Lasso (Tibshirani, 1996) � � λ − ( y t − w ⊤ Φ( x t ))Φ i ( x t ) �� u i ← u i − γ t + � 2 Q lasso = λ | w | 1 + 1 � y − w ⊤ Φ( x ) � � λ + ( y t − w ⊤ �� v i ← v i − γ t t Φ( x t ))Φ i ( x t ) 2 + w = ( u 1 − v 1 , . . . , u d − v d ) with notation [ x ] + = max { 0 , x } . Φ( x ) ∈ R d , y ∈ R , λ > 0 K-Means (MacQueen, 1967) k ∗ = arg min k ( z t − w k ) 2 1 2 ( z − w k ) 2 Q kmeans = min n k ∗ ← n k ∗ + 1 k 1 z ∈ R d , w 1 . . . w k ∈ R d w k ∗ ← w k ∗ + n k ∗ ( z t − w k ∗ ) n 1 . . . n k ∈ N , initially 0 L´ eon Bottou 8/37

II. The Tradeoffs of Large Scale Learning L´ eon Bottou 9/37

The Computational Problem • Baseline large-scale learning algorithm Randomly discarding data is the simplest way to handle large datasets. – What is the statistical benefit of processing more data? – What is the computational cost of processing more data? • We need a theory that links Statistics and Computation! – 1967: Vapnik’s theory does not discuss computation. – 1981: Valiant’s learnability excludes exponential time algorithms, but (i) polynomial time already too slow, (ii) few actual results. L´ eon Bottou 10/37

Decomposition of the Error E ( ˜ f n ) − E ( f ∗ ) = E ( f ∗ F ) − E ( f ∗ ) Approximation error ( E app ) + E ( f n ) − E ( f ∗ F ) Estimation error ( E est ) + E ( ˜ f n ) − E ( f n ) Optimization error ( E opt ) Problem: Choose F , n , and ρ to make this as small as possible, � max number of examples n subject to budget constraints max computing time T Note: choosing λ is the same as choosing F . L´ eon Bottou 11/37

Small-scale Learning “The active budget constraint is the number of examples.” • To reduce the estimation error, take n as large as the budget allows. • To reduce the optimization error to zero, take ρ = 0 . • We need to adjust the size of F . Estimation error Approximation error Size of F See Structural Risk Minimization (Vapnik 74) and later works. L´ eon Bottou 12/37

Large-scale Learning “The active budget constraint is the computing time.” • More complicated tradeoffs. The computing time depends on the three variables: F , n , and ρ . • Example. If we choose ρ small, we decrease the optimization error. But we must also decrease F and/or n with adverse effects on the estimation and approximation errors. • The exact tradeoff depends on the optimization algorithm. • We can compare optimization algorithms rigorously. L´ eon Bottou 13/37

Test Error versus Learning Time Test Error Bayes Limit Computing Time L´ eon Bottou 14/37

Test Error versus Learning Time Test Error 10,000 examples 100,000 examples 1,000,000 examples Bayes limit Computing Time Vary the number of examples. . . L´ eon Bottou 15/37

Test Error versus Learning Time optimizer a Test Error optimizer b optimizer c model I model II model III model IV 10,000 examples 100,000 examples 1,000,000 examples Bayes limit Computing Time Vary the number of examples, the statistical models, the algorithms,. . . L´ eon Bottou 16/37

Test Error versus Learning Time optimizer a Test Error optimizer b optimizer c model I model II model III model IV Good Learning 10,000 examples Algorithms 100,000 examples 1,000,000 examples Bayes limit Computing Time Not all combinations are equal. Let’s compare the red curve for different optimization algorithms. L´ eon Bottou 17/37

III. Asymptotic Analysis L´ eon Bottou 18/37

Asymptotic Analysis f n ) − E ( f ∗ ) = E = E app + E est + E opt E ( ˜ Asymptotic Analysis All three errors must decrease with comparable rates. Forcing one of the errors to decrease much faster - would require additional computing efforts, - but would not significantly improve the test error. L´ eon Bottou 19/37

Statistics Asymptotics of the statistical components of the error – Thanks to refined uniform convergence arguments � α � log n E = E app + E est + E opt ∼ E app + + ρ n with exponent 1 2 ≤ α ≤ 1 . Asymptotically effective large scale learning – Must choose F , n , and ρ such that � α � log n E ∼ E app ∼ E est ∼ E opt ∼ ∼ ρ . n What about optimization times? L´ eon Bottou 20/37

Statistics and Computation GD 2GD SGD 2SGD Time per iteration : n n 1 1 log 1 log log 1 1 1 Iters to accuracy ρ : ρ ρ ρ ρ n log 1 n log log 1 1 1 Time to accuracy ρ : ρ ρ ρ ρ 2 1 1 E 1 /α log 1 1 E log log 1 1 1 E 1 /α log Time to error E : E E E E – 2GD optimizes much faster than GD. – SGD optimization speed is catastrophic. – SGD learns faster than both GD and 2GD. – 2SGD only changes the constants. L´ eon Bottou 21/37

Experiment: Text Categorization Dataset – Reuters RCV1 document corpus. – 781,265 training examples, 23,149 testing examples. Task – Recognizing documents of category CCAT . – 47,152 TF-IDF features. – Linear SVM. Same setup as (Joachims, 2006) and (Shalev-Schwartz et al., 2007) using plain SGD. L´ eon Bottou 22/37

Experiment: Text Categorization • Results: Hinge-loss SVM Q ( x, y, w ) = max { 0 , 1 − yw ⊤ Φ( x ) } λ = 0 . 0001 Training Time Primal cost Test Error SVMLight 23,642 secs 0.2275 6.02% SVMPerf 66 secs 0.2278 6.03% SGD 1.4 secs 0.2275 6.02% • Results: Log-Loss SVM Q ( x, y, w ) = log(1 + exp( − yw ⊤ Φ( x ))) λ = 0 . 00001 Training Time Primal cost Test Error TRON(LibLinear, ε = 0 . 01 ) 30 secs 0.18907 5.68% TRON(LibLinear, ε = 0 . 001 ) 44 secs 0.18890 5.70% SGD 2.3 secs 0.18893 5.66% L´ eon Bottou 23/37

The Wall 0.3 Testing cost 0.2 Training time (secs) 100 SGD 50 TRON (LibLinear) 0.1 0.01 0.001 0.0001 1e−05 1e−06 1e−07 1e−08 1e−09 Optimization accuracy (trainingCost−optimalTrainingCost) L´ eon Bottou 24/37

IV. Learning with a Single Pass L´ eon Bottou 25/37

Batch and online paths True solution, ONLINE Best generalization. one pass over * w examples {z1...zt} w t * w t w Best training 1 set error. BATCH many iterations on examples {z1...zt} L´ eon Bottou 26/37

Effect of one Additional Example (i) Compare w ∗ = arg min E n ( f w ) n w E n ( f w ) + 1 � �� w ∗ � n +1 = arg min E n +1 ( f w ) = arg min n ℓ f w ( x n +1 ) , y n +1 w w n+1 E (f ) n+1 w n E (f ) n w w* w* n+1 n L´ eon Bottou 27/37

Effect of one Additional Example (ii) • First Order Calculation � 1 � � n − 1 ∂ ℓ f w n ( x n ) , y n � n H − 1 w ∗ n +1 = w ∗ + O n +1 n 2 ∂w where H n +1 is the empirical Hessian on n + 1 examples. • Compare with Second Order Stochastic Gradient Descent � � w t +1 = w t − 1 t H − 1 ∂ ℓ f w t ( x n ) , y n ∂w • Could they converge with the same speed? • C 2 assumptions = ⇒ Accurate speed estimates. L´ eon Bottou 28/37

Large Scale Machine Learning with Stochastic Gradient Descent L - PowerPoint PPT Presentation

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning. iii. Asymptotic Analysis.

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Rapid Stochastic Gradient Descent Accelerating Machine Learning Statistical Machine Learning

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Data and Analysis Part V Statistical Analysis of Data Alex Simpson Part V: Statistical Analysis

INF4820 Algorithms for AI and NLP Common Lisp Essentials Erik Velldal & Stephan

Objectives learn about Java branching statements learn about loops Flow of Control

Symbol table applications application purpose of search key value dictionary find definition

Develop Your Data Mindset Module 11 - Student Level Goal Monitoring Part 3B - Answer By Nathan

Probability and Statistics for Computer Science How

Localization of Sensor Networks II Localization of Sensor Networks II Jie Gao Jie Gao Computer

Measuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorham

Large Scale Machine Learning with Stochastic Gradient Descent L - PowerPoint PPT Presentation

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning. iii. Asymptotic Analysis.

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Rapid Stochastic Gradient Descent Accelerating Machine Learning Statistical Machine Learning

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Data and Analysis Part V Statistical Analysis of Data Alex Simpson Part V: Statistical Analysis

INF4820 Algorithms for AI and NLP Common Lisp Essentials Erik Velldal &amp; Stephan

Objectives learn about Java branching statements learn about loops Flow of Control

Symbol table applications application purpose of search key value dictionary find definition

Develop Your Data Mindset Module 11 - Student Level Goal Monitoring Part 3B - Answer By Nathan

Probability and Statistics for Computer Science How

Localization of Sensor Networks II Localization of Sensor Networks II Jie Gao Jie Gao Computer

Measuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorham

INF4820 Algorithms for AI and NLP Common Lisp Essentials Erik Velldal & Stephan