compsci 514: algorithms for data science Cameron Musco University - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 21 0

• Finish discussion of SGD. • Understanding gradient descent and SGD as applied to least • Connections to more advanced techniques: accelerated methods summary Last Class: This Class: squares regression. and adaptive gradient methods. 1 • Stochastic gradient descent (SGD). • Online optimization and online gradient descent (OGD). • Analysis of SGD as a special case of online gradient descent.

summary Last Class: This Class: squares regression. and adaptive gradient methods. 1 • Stochastic gradient descent (SGD). • Online optimization and online gradient descent (OGD). • Analysis of SGD as a special case of online gradient descent. • Finish discussion of SGD. • Understanding gradient descent and SGD as applied to least • Connections to more advanced techniques: accelerated methods

logistics This class wraps up the optimization unit. Three remaining classes after break. Give your feedback on Piazza about what you’d like to see. projection. eigendecomposition, regression. minimization, k -means clustering,...) 2 • High dimensional geometry and connections to random • Randomized methods for fast approximate SVD, • Fourier methods, compressed sensing, sparse recovery. • More advanced optimization methods (alternating • Fairness and differential privacy.

• Applies to: f 1 f 2 presented online. d in an online fashion with • Goal: Pick 1 f i 1 f (i.e., achieve regret i . • Update Step: • Applies to: f 1 f i d with f • Goal: Find d f • Update Step: where j i is chosen uniformly d Stochastic Gradient Descent: f i i that can be written as f . n i min i 1 i f j i i at random from 1 n . 1 quick review i t Online Gradient Descent: ). d 1 t f t i i i Gradient Descent: 3 t d min • Applies to: Any differentiable f : R d → R . • Goal: Find ˆ θ ∈ R d with f (ˆ θ ∈ R d f ( ⃗ θ ) ≤ min ⃗ θ ) + ϵ. θ ( i + 1 ) = ⃗ θ ( i ) − η⃗ • Update Step: ⃗ ∇ f ( ⃗ θ ( i ) ) .

• Applies to: f 1 f i d with f • Goal: Find d f • Update Step: where j i is chosen uniformly d that can be written as f n i . quick review min Gradient Descent: i 1 i f j i i at random from 1 n . Stochastic Gradient Descent: 3 Online Gradient Descent: • Applies to: Any differentiable f : R d → R . • Goal: Find ˆ θ ∈ R d with f (ˆ θ ∈ R d f ( ⃗ θ ) ≤ min ⃗ θ ) + ϵ. θ ( i + 1 ) = ⃗ θ ( i ) − η⃗ • Update Step: ⃗ ∇ f ( ⃗ θ ( i ) ) . • Applies to: f 1 , f 2 , . . . , f t : R d → R presented online. θ ( t ) ∈ R d in an online fashion with • Goal: Pick ⃗ θ ( 1 ) , . . . , ⃗ ∑ t ∑ t i = 1 f i ( ⃗ i = 1 f ( ⃗ θ ( i ) ) ≤ min ⃗ θ ) + ϵ (i.e., achieve regret ≤ ϵ ). θ ∈ R d θ ( i + 1 ) = ⃗ θ ( i ) − η⃗ • Update Step: ⃗ ∇ f i ( ⃗ θ ( i ) ) .

quick review Online Gradient Descent: Stochastic Gradient Descent: Gradient Descent: 3 • Applies to: Any differentiable f : R d → R . • Goal: Find ˆ θ ∈ R d with f (ˆ θ ∈ R d f ( ⃗ θ ) ≤ min ⃗ θ ) + ϵ. θ ( i + 1 ) = ⃗ θ ( i ) − η⃗ • Update Step: ⃗ ∇ f ( ⃗ θ ( i ) ) . • Applies to: f 1 , f 2 , . . . , f t : R d → R presented online. θ ( t ) ∈ R d in an online fashion with • Goal: Pick ⃗ θ ( 1 ) , . . . , ⃗ ∑ t ∑ t i = 1 f i ( ⃗ i = 1 f ( ⃗ θ ( i ) ) ≤ min ⃗ θ ) + ϵ (i.e., achieve regret ≤ ϵ ). θ ∈ R d θ ( i + 1 ) = ⃗ θ ( i ) − η⃗ • Update Step: ⃗ ∇ f i ( ⃗ θ ( i ) ) . • Applies to: f : R d → R that can be written as f ( ⃗ θ ) = ∑ n i = 1 f i ( ⃗ θ ) . θ ∈ R d f ( ⃗ • Goal: Find ˆ θ ∈ R d with f (ˆ θ ) ≤ min ⃗ θ ) + ϵ. θ ( i + 1 ) = ⃗ θ ( i ) − η⃗ • Update Step: ⃗ ∇ f j i ( ⃗ θ ( i ) ) where j i is chosen uniformly at random from 1 , . . . , n .

• Stochastic gradient descent is identical to online gradient descent run on the sequence of t functions f j 1 f j 2 • These functions are picked uniformly at random, so in 1 f j i 1 f i gives only a better solution. I.e., • By convexity • Quality directly bounded by the regret analysis for online 1 t i 1 i f t t 1 f i gradient descent! i stochastic gradient analysis recap t 1 . i i t i i t expectation, f j t . 4 θ ) = ∑ n Minimizing a finite sum function: f ( ⃗ i = 1 f i ( ⃗ θ ) .

i gives only a better solution. I.e., • By convexity • Quality directly bounded by the regret analysis for online 1 t i 1 t i f 1 t i 1 f i gradient descent! t stochastic gradient analysis recap 4 . θ ) = ∑ n Minimizing a finite sum function: f ( ⃗ i = 1 f i ( ⃗ θ ) . • Stochastic gradient descent is identical to online gradient descent run on the sequence of t functions f j 1 , f j 2 , . . . , f j t . • These functions are picked uniformly at random, so in [∑ t ] [∑ t ] i = 1 f j i ( ⃗ i = 1 f ( ⃗ θ ( i ) ) θ ( i ) ) expectation, E = E

stochastic gradient analysis recap . gradient descent! t t t 4 θ ) = ∑ n Minimizing a finite sum function: f ( ⃗ i = 1 f i ( ⃗ θ ) . • Stochastic gradient descent is identical to online gradient descent run on the sequence of t functions f j 1 , f j 2 , . . . , f j t . • These functions are picked uniformly at random, so in [∑ t ] [∑ t ] i = 1 f j i ( ⃗ i = 1 f ( ⃗ θ ( i ) ) θ ( i ) ) expectation, E = E ∑ t • By convexity ˆ i = 1 ⃗ θ ( i ) gives only a better solution. I.e., θ = 1 [ ] [ ] ∑ ∑ f (ˆ f ( ⃗ θ ( i ) ) θ ) ≤ E . E i = 1 i = 1 • Quality directly bounded by the regret analysis for online

sgd vs. gd Stochastic gradient descent generally makes more iterations than gradient descent. Each iteration is much cheaper (by a factor of n ). n 5 ∑ ∇ f ( ⃗ ⃗ θ ) = ⃗ f j ( ⃗ θ ) vs. ⃗ ∇ f j ( ⃗ ∇ θ ) j = 1

sgd vs. gd 1 f 2 f 1 f n 2 n j f j G 2 2 n G n G . When would this bound be tight? I.e., SGD takes the same number of iterations as GD. When is it loose? I.e., SGD performs very poorly compared to GD. iterations 6 iterations θ ) = ∑ n Consider f ( ⃗ j = 1 f j ( ⃗ θ ) with each f j convex. Theorem – SGD: If ∥ ⃗ ∇ f j ( ⃗ n ∀ ⃗ θ ) ∥ 2 ≤ G θ , after t ≥ R 2 G 2 ϵ 2 outputs ˆ θ satisfying: E [ f (ˆ θ )] ≤ f ( θ ∗ ) + ϵ. R 2 ¯ Theorem – GD: If ∥ ⃗ ∇ f ( ⃗ θ ) ∥ 2 ≤ ¯ G ∀ ⃗ θ , after t ≥ ϵ 2 outputs ˆ θ satisfying: f (ˆ θ ) ≤ f ( θ ∗ ) + ϵ.

sgd vs. gd iterations When is it loose? I.e., SGD performs very poorly compared to GD. iterations as GD. When would this bound be tight? I.e., SGD takes the same number of iterations G 2 6 θ ) = ∑ n Consider f ( ⃗ j = 1 f j ( ⃗ θ ) with each f j convex. Theorem – SGD: If ∥ ⃗ ∇ f j ( ⃗ n ∀ ⃗ θ ) ∥ 2 ≤ G θ , after t ≥ R 2 G 2 ϵ 2 outputs ˆ θ satisfying: E [ f (ˆ θ )] ≤ f ( θ ∗ ) + ϵ. R 2 ¯ Theorem – GD: If ∥ ⃗ ∇ f ( ⃗ θ ) ∥ 2 ≤ ¯ G ∀ ⃗ θ , after t ≥ ϵ 2 outputs ˆ θ satisfying: f (ˆ θ ) ≤ f ( θ ∗ ) + ϵ. θ ) ∥ 2 ≤ ∑ n ∥ ⃗ ∇ f ( ⃗ θ ) ∥ 2 = ∥ ⃗ ∇ f 1 ( ⃗ θ ) + . . . + ⃗ ∇ f n ( ⃗ j = 1 ∥ ⃗ ∇ f j ( ⃗ θ ) ∥ 2 ≤ n · G n ≤ G .

compsci 514: algorithms for data science Cameron Musco University - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 21 0 Finish discussion of SGD. Understanding gradient descent and SGD as applied to least Connections to more advanced

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Prof. Cameron Musco University of Massachusetts Amherst.

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

Working Principle of a Semiconductor Based Solar Cell Excitation of Charge Carriers II Week

06.09.2014. Measurements Waveforms Types Analog Digital: DPO, Sampling, Mixed Domain...

From Concurrent Programs to Simulating Sequential Programs: Correctness of a Transformation VPT

Ec Eclipse Overview Luca Della Toffola and Thomas R. Gross Department Informa:k ETH Zrich To

For Monday after Spring Break Read Weiss, chapter 5, sections 1-4 Homework: Chapter 4,

Class after break Some notes on projects There are only two hard things in Computer

Computational Expression Loops Janyl Jumadinova 16 October, 2019 Janyl Jumadinova

Software Architecture of a RadFET Dosimetry System Garth Brown: 20 May 2015 SLAC National