compsci 514 algorithms for data science
play

compsci 514: algorithms for data science Cameron Musco University - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Spring 2020. Lecture 24 (Final Lecture!) 0 logistics under the schedule tab on the course page. hours this Thursday and next Tuesday during the


  1. compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Spring 2020. Lecture 24 (Final Lecture!) 0

  2. logistics under the schedule tab on the course page. hours this Thursday and next Tuesday during the regular class time 11:30am-12:45pm holding an optional SRTI for this class and would really appreciate your feedback. courseEvalSurvey/uma/ . 1 • Problem Set 4 is due Sunday 5/3 at 8pm. • Exam is at 2pm on May 6th . Open note, similar to midterm. • Exam review guide and practice problems have been posted • I will hold usual office hours today and exam review office • Regular SRTI’s are suspended this semester. But I am • http://owl.umass.edu/partners/

  3. summary Last Class: under (convex) constraints. This Class: 2 • Analysis of gradient descent for optimizing convex functions. • (The same) analysis of projected gradient descent for optimizing • Convex sets and projection functions. • Online learning, regret, and online gradient descent. • Application to analysis of stochastic gradient descent (if time). • Course summary/wrap-up

  4. online gradient descent mistakes over time. Stochastic gradient descent is a special case: when data points are streaming algorithms) In reality many learning problems are online. 3 more examples of spam over time. continuous feedback from these users. • Websites optimize ads or recommendations to show users, given • Spam filters are incrementally updated and adapt as they see • Face recognition systems, other classification systems, learn from θ, X ) = ∑ n Want to minimize some global loss L ( ⃗ i = 1 ℓ ( ⃗ θ,⃗ x i ) , when data points are presented in an online fashion ⃗ x 1 ,⃗ x 2 , . . . ,⃗ x n (like in considered a random order for computational reasons.

  5. online optimization formal setup Online Optimization: In place of a single function f , we see a different objective function at each step: 4 f 1 , . . . , f t : R d → R • At each step, first pick (play) a parameter vector ⃗ θ ( i ) . • Then are told f i and incur cost f i ( ⃗ θ ( i ) ) . • Goal: Minimize total cost ∑ t i = 1 f i ( ⃗ θ ( i ) ) . No assumptions on how f 1 , . . . , f t are related to each other!

  6. online optimization example UI design via online optimization. 5 • Parameter vector ⃗ θ ( i ) : some encoding of the layout at step i . • Functions f 1 , . . . , f t : f i ( ⃗ θ ( i ) ) = 1 if user does not click ‘add to cart’ and f i ( ⃗ θ ( i ) ) = 0 if they do click. • Want to maximize number of purchases. I.e., minimize ∑ t i = 1 f i ( ⃗ θ ( i ) )

  7. online optimization example Home pricing tools. classic least squares regression). 6 • Parameter vector ⃗ θ ( i ) : coefficients of linear model at step i . • Functions f 1 , . . . , f t : f i ( ⃗ x i , ⃗ θ ( i ) ) = ( ⟨ ⃗ θ ( i ) ⟩ − price i ) 2 revealed when home i is listed or sold. • Want to minimize total squared error ∑ t i = 1 f i ( ⃗ θ ( i ) ) (same as

  8. regret In online optimization we will ask for the same. t t t 7 In normal optimization, we seek ˆ θ satisfying: f (ˆ f ( ⃗ θ ) ≤ min θ ) + ϵ. ⃗ θ ∑ ∑ ∑ f i ( ⃗ f i ( ⃗ f i ( ⃗ θ ( i ) ) ≤ min θ ) + ϵ = θ off ) + ϵ ⃗ θ i = 1 i = 1 i = 1 ϵ is called the regret. • This error metric is a bit ‘unfair’. Why? • Comparing online solution to best fixed solution in hindsight. ϵ can be negative!

  9. intuition check an alternating pattern? no particular pattern? How can any online learning algorithm hope to achieve small regret? 8 What if for i = 1 , . . . , t , f i ( θ ) = | x − 1000 | or f i ( θ ) = | x + 1000 | in How small can the regret ϵ be? ∑ t θ ( i ) ) ≤ ∑ t i = 1 f i ( ⃗ i = 1 f i ( ⃗ θ off ) + ϵ . What if for i = 1 , . . . , t , f i ( θ ) = | x − 1000 | or f i ( θ ) = | x + 1000 | in

  10. online gradient descent Online Gradient Descent t . Assume that: R G 9 • f 1 , . . . , f t are all convex. • Each f i is G -Lipschitz (i.e., ∥ ⃗ ∇ f i ( ⃗ θ ) ∥ 2 ≤ G for all ⃗ θ .) θ ( 1 ) − ⃗ • ∥ ⃗ θ off ∥ 2 ≤ R where θ ( 1 ) is the first vector chosen. • Set step size η = √ • For i = 1 , . . . , t • Play ⃗ θ ( i ) and incur cost f i ( ⃗ θ ( i ) ) . θ ( i + 1 ) = ⃗ θ ( i ) − η · ⃗ • ⃗ ∇ f i ( ⃗ θ ( i ) )

  11. online gradient descent analysis t 2 2 t Theorem – OGD on Convex Lipschitz Functions: For convex G - t 10 G t , has regret bounded by: R Lipschitz f 1 , . . . , f t , OGD initialized with starting point θ ( 1 ) within radius R of θ off , using step size η = √ [ ] √ ∑ ∑ f i ( θ ( i ) ) − f i ( θ off ) ≤ RG i = 1 i = 1 Average regret goes to 0 and t → ∞ . No assumptions on f 1 , . . . , f t ! Step 1.1: For all i , ∇ f i ( θ ( i ) )( θ ( i ) − θ off ) ≤ ∥ θ ( i ) − θ off ∥ 2 2 −∥ θ ( i + 1 ) − θ off ∥ 2 + η G 2 2 η 2 . Convexity = ⇒ Step 1: For all i , f i ( θ ( i ) ) − f i ( θ off ) ≤ ∥ θ ( i ) − θ off ∥ 2 2 − ∥ θ ( i + 1 ) − θ off ∥ 2 + η G 2 2 . 2 η

  12. online gradient descent analysis t 2 2 t t t 2 Theorem – OGD on Convex Lipschitz Functions: For convex G - 2 t 11 G R t , has regret bounded by: t Lipschitz f 1 , . . . , f t , OGD initialized with starting point θ ( 1 ) within radius R of θ off , using step size η = √ [ ] √ ∑ ∑ f i ( θ ( i ) ) − f i ( θ off ) ≤ RG i = 1 i = 1 Step 1: For all i , f i ( θ ( i ) ) − f i ( θ off ) ≤ ∥ θ ( i ) − θ off ∥ 2 2 −∥ θ ( i + 1 ) − θ off ∥ 2 + η G 2 = ⇒ 2 η [ ] ∥ θ ( i ) − θ off ∥ 2 2 − ∥ θ ( i + 1 ) − θ off ∥ 2 + t · η G 2 ∑ ∑ ∑ f i ( θ ( i ) ) − f i ( θ off ) ≤ . 2 η i = 1 i = 1 i = 1

  13. stochastic gradient descent Stochastic gradient descent is an efficient offline optimization learning. 12 method, seeking ˆ θ with f (ˆ f ( ⃗ θ ) + ϵ = f ( ⃗ θ ) ≤ min θ ∗ ) + ϵ. ⃗ θ • The most popular optimization method in modern machine • Easily analyzed as a special case of online gradient descent!

  14. stochastic gradient descent Assume that: t t . G R Stochastic Gradient Descent 13 θ ) = ∑ n • f is convex and decomposable as f ( ⃗ j = 1 f j ( ⃗ θ ) . • E.g., L ( ⃗ θ, X ) = ∑ n j = 1 ℓ ( ⃗ θ,⃗ x j ) . n -Lipschitz (i.e., ∥ ⃗ ∇ f j ( ⃗ n for all ⃗ θ ) ∥ 2 ≤ G θ .) • Each f j is G • What does this imply about how Lipschitz f is? θ ( 1 ) − ⃗ • Initialize with θ ( 1 ) satisfying ∥ ⃗ θ ∗ ∥ 2 ≤ R . • Set step size η = √ • For i = 1 , . . . , t • Pick random j i ∈ 1 , . . . , n . θ ( i + 1 ) = ⃗ θ ( i ) − η · ⃗ • ⃗ ∇ f j i ( ⃗ θ ( i ) ) ∑ t i = 1 ⃗ • Return ˆ θ ( i ) . θ = 1

  15. stochastic gradient descent in expectation (batch GD, randomly quantized, measurement noise, differentially private, etc.) 14 θ ( i + 1 ) = ⃗ θ ( i ) − η · ⃗ θ ( i + 1 ) = ⃗ θ ( i ) − η · ⃗ ⃗ ∇ f j i ( ⃗ θ ( i ) ) vs. ⃗ ∇ f ( ⃗ θ ( i ) ) Note that: E [ ⃗ ∇ f j i ( ⃗ n ⃗ ∇ f ( ⃗ θ ( i ) )] = 1 θ ( i ) ) . Analysis extends to any algorithm that takes the gradient step

  16. test of intuition A sum of convex functions is always convex (good exercise). 15 What does f 1 ( θ ) + f 2 ( θ ) + f 3 ( θ ) look like? 12000 12000 f 2 10000 10000 f 1 8000 8000 6000 6000 f 3 4000 4000 2000 2000 0 -10 -5 0 5 10

  17. stochastic gradient descent analysis t , and starting point within radius R OGD bound t Theorem – SGD on Convex Lipschitz Functions: SGD run with t 16 G R t ≥ R 2 G 2 iterations, η = √ ϵ 2 of θ ∗ , outputs ˆ θ satisfying: E [ f (ˆ θ )] ≤ f ( θ ∗ ) + ϵ. ∑ t Step 1: f (ˆ θ ) − f ( θ ∗ ) ≤ 1 i = 1 [ f ( θ ( i ) ) − f ( θ ∗ )] [∑ t ] Step 2: E [ f (ˆ θ ) − f ( θ ∗ )] ≤ n i = 1 [ f j i ( θ ( i ) ) − f j i ( θ ∗ )] t · E . [∑ t ] Step 3: E [ f (ˆ θ ) − f ( θ ∗ )] ≤ n i = 1 [ f j i ( θ ( i ) ) − f j i ( θ off )] t · E . √ Step 4: E [ f (ˆ θ ) − f ( θ ∗ )] ≤ n t · R · G n · = RG t . √ � �� �

  18. sgd vs. gd Stochastic gradient descent generally makes more iterations than gradient descent. Each iteration is much cheaper (by a factor of n ). n 17 ∑ ⃗ f j ( ⃗ θ ) vs. ⃗ ∇ f j ( ⃗ ∇ θ ) j = 1

  19. sgd vs. gd n : When would this bound be tight? G : G 2 18 When f ( ⃗ θ ) = ∑ n j = 1 f j ( ⃗ ∇ f j ( ⃗ θ ) and ∥ ⃗ θ ) ∥ 2 ≤ G iterations outputs ˆ Theorem – SGD: After t ≥ R 2 G 2 θ satisfying: ϵ 2 E [ f (ˆ θ )] ≤ f ( θ ∗ ) + ϵ. When ∥ ⃗ ∇ f ( ⃗ θ ) ∥ 2 ≤ ¯ Theorem – GD: After t ≥ R 2 ¯ iterations outputs ˆ θ satisfying: ϵ 2 f (ˆ θ ) ≤ f ( θ ∗ ) + ϵ. ∥ ⃗ ∇ f ( ⃗ θ ) ∥ 2 = ∥ ⃗ ∇ f 1 ( ⃗ θ ) + . . . + ⃗ ∇ f n ( ⃗ θ ) ∥ 2 ≤ ∑ n j = 1 ∥ ⃗ ∇ f j ( ⃗ θ ) ∥ 2 ≤ n · G n ≤ G .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend