compsci 514 algorithms for data science
play

compsci 514: algorithms for data science Cameron Musco University - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 19 0 logistics 1 Problem Set 3 on Spectral Methods due this Friday at 8pm . Can turn in without penalty until Sunday at


  1. compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 19 0

  2. logistics 1 • Problem Set 3 on Spectral Methods due this Friday at 8pm . • Can turn in without penalty until Sunday at 11:59pm.

  3. • Analysis of gradient descent for optimizing convex functions. • Analysis of projected gradient descent for optimizing under summary Last Class: This Class: constraints. 2 • Intro to continuous optimization. • Multivariable calculus review. • Intro to Gradient Descent.

  4. summary Last Class: This Class: constraints. 2 • Intro to continuous optimization. • Multivariable calculus review. • Intro to Gradient Descent. • Analysis of gradient descent for optimizing convex functions. • Analysis of projected gradient descent for optimizing under

  5. 0 . • Choose some initialization • For i t , as an approximate minimizer of f • Return i t • i i 1 f 1 . Step size is chosen ahead of time or adapted during the algorithm. 1 gradient descent motivation Gradient descent greedy motivation: At each step, make a small Psuedocode: gradient: Gradient descent step: When the step size is small, this is 3 change to ⃗ θ ( i − 1 ) to give ⃗ θ ( i ) , with minimum value of f ( ⃗ θ ( i ) ) . approximate optimized by stepping in the opposite direction of the θ ( i ) = ⃗ θ ( i − 1 ) − η · ⃗ ⃗ ∇ f ( ⃗ θ ( i − 1 ) ) .

  6. gradient descent motivation Gradient descent step: When the step size is small, this is Psuedocode: Gradient descent greedy motivation: At each step, make a small gradient: 3 change to ⃗ θ ( i − 1 ) to give ⃗ θ ( i ) , with minimum value of f ( ⃗ θ ( i ) ) . approximate optimized by stepping in the opposite direction of the θ ( i ) = ⃗ θ ( i − 1 ) − η · ⃗ ⃗ ∇ f ( ⃗ θ ( i − 1 ) ) . • Choose some initialization ⃗ θ ( 0 ) . • For i = 1 , . . . , t θ ( i ) = ⃗ θ ( i − 1 ) − η ∇ f ( ⃗ • ⃗ θ ( i − 1 ) ) • Return ⃗ θ ( t ) , as an approximate minimizer of f ( ⃗ θ ) . Step size η is chosen ahead of time or adapted during the algorithm.

  7. 4 θ ( i ) = ⃗ θ ( i − 1 ) − η ∇ f ( ⃗ Gradient Descent Update: ⃗ θ ( i − 1 ) )

  8. convexity 5 Definition – Convex Function: A function f : R d → R is convex if and only if, for any ⃗ θ 1 , ⃗ θ 2 ∈ R d and λ ∈ [ 0 , 1 ] : ( ) ( 1 − λ ) · f ( ⃗ θ 1 ) + λ · f ( ⃗ ( 1 − λ ) · ⃗ θ 1 + λ · ⃗ θ 2 ) ≥ f θ 2

  9. convexity 5 Definition – Convex Function: A function f : R d → R is convex if and only if, for any ⃗ θ 1 , ⃗ θ 2 ∈ R d and λ ∈ [ 0 , 1 ] : ( ) ( 1 − λ ) · f ( ⃗ θ 1 ) + λ · f ( ⃗ ( 1 − λ ) · ⃗ θ 1 + λ · ⃗ θ 2 ) ≥ f θ 2

  10. 6 convexity Corollary – Convex Function: A function f : R d → R is convex if and only if, for any ⃗ θ 1 , ⃗ θ 2 ∈ R d and λ ∈ [ 0 , 1 ] : θ 1 ) T ( ) f ( ⃗ θ 2 ) − f ( ⃗ θ 1 ) ≥ ⃗ ∇ f ( ⃗ ⃗ θ 2 − ⃗ θ 1

  11. • Lipschitz (size of gradient is bounded): For all • Smooth (direction/size of gradient is not changing too quickly): 2 and some f 2 2 1 2 2 f 1 other assumptions , 1 For all G 2 f and some G , 7 We will also assume that f ( · ) is ‘well-behaved’ in some way.

  12. other assumptions 7 We will also assume that f ( · ) is ‘well-behaved’ in some way. • Lipschitz (size of gradient is bounded): For all ⃗ θ and some G , ∥ ⃗ ∇ f ( ⃗ θ ) ∥ 2 ≤ G . • Smooth (direction/size of gradient is not changing too quickly): For all ⃗ θ 1 , ⃗ θ 2 and some β , ∥ ⃗ ∇ f ( ⃗ θ 1 ) − ⃗ ∇ f ( ⃗ θ 2 ) ∥ 2 ≤ β · ∥ ⃗ θ 1 − ⃗ θ 2 ∥ 2 .

  13. lipschitz assumption 8

  14. gd analysis – convex functions Gradient Descent t . Assume that: R G 9 • f is convex. • f is G -Lipschitz (i.e., ∥ ⃗ ∇ f ( ⃗ θ ) ∥ 2 ≤ G for all ⃗ θ .) • ∥ ⃗ θ 0 − ⃗ θ ∗ ∥ 2 ≤ R where θ 0 is the initialization point. • Choose some initialization ⃗ θ 0 and set η = √ • For i = 1 , . . . , t • ⃗ θ i = ⃗ θ i − 1 − η · ∇ f ( ⃗ θ i − 1 ) • Return ˆ θ t f ( ⃗ θ = arg min ⃗ θ i ) . θ 0 ,...,⃗

  15. 2 . Visually: gd analysis proof i G 2 2 2 2 1 i 2 2 i f Step 1: For all i , f Theorem – GD on Convex Lipschitz Functions: For convex G - t , G R 10 Lipschitz function f , GD run with t ≥ R 2 G 2 iterations, η = √ ϵ 2 and starting point within radius R of θ ∗ , outputs ˆ θ satisfying: f (ˆ θ ) ≤ f ( θ ∗ ) + ϵ.

  16. gd analysis proof G 2 Theorem – GD on Convex Lipschitz Functions: For convex G - t , 10 R Lipschitz function f , GD run with t ≥ R 2 G 2 iterations, η = √ ϵ 2 and starting point within radius R of θ ∗ , outputs ˆ θ satisfying: f (ˆ θ ) ≤ f ( θ ∗ ) + ϵ. Step 1: For all i , f ( θ i ) − f ( θ ∗ ) ≤ ∥ θ i − θ ∗ ∥ 2 2 −∥ θ i + 1 − θ ∗ ∥ 2 + η G 2 2 η 2 . Visually:

  17. gd analysis proof G 2 Theorem – GD on Convex Lipschitz Functions: For convex G - t , 11 R Lipschitz function f , GD run with t ≥ R 2 G 2 iterations, η = √ ϵ 2 and starting point within radius R of θ ∗ , outputs ˆ θ satisfying: f (ˆ θ ) ≤ f ( θ ∗ ) + ϵ. Step 1: For all i , f ( θ i ) − f ( θ ∗ ) ≤ ∥ θ i − θ ∗ ∥ 2 2 −∥ θ i + 1 − θ ∗ ∥ 2 + η G 2 2 η 2 . Formally:

  18. gd analysis proof t , Step 1. 2 2 2 Theorem – GD on Convex Lipschitz Functions: For convex G - 12 G R Lipschitz function f , GD run with t ≥ R 2 G 2 iterations, η = √ ϵ 2 and starting point within radius R of θ ∗ , outputs ˆ θ satisfying: f (ˆ θ ) ≤ f ( θ ∗ ) + ϵ. Step 1: For all i , f ( θ i ) − f ( θ ∗ ) ≤ ∥ θ i − θ ∗ ∥ 2 2 −∥ θ i + 1 − θ ∗ ∥ 2 + η G 2 2 η 2 . Step 1.1: ∇ f ( θ i )( θ i − θ ∗ ) ≤ ∥ θ i − θ ∗ ∥ 2 2 −∥ θ i + 1 − θ ∗ ∥ 2 + η G 2 2 η

  19. gd analysis proof t , 2 2 2 Theorem – GD on Convex Lipschitz Functions: For convex G - 12 G R Lipschitz function f , GD run with t ≥ R 2 G 2 iterations, η = √ ϵ 2 and starting point within radius R of θ ∗ , outputs ˆ θ satisfying: f (ˆ θ ) ≤ f ( θ ∗ ) + ϵ. Step 1: For all i , f ( θ i ) − f ( θ ∗ ) ≤ ∥ θ i − θ ∗ ∥ 2 2 −∥ θ i + 1 − θ ∗ ∥ 2 + η G 2 2 η 2 . Step 1.1: ∇ f ( θ i )( θ i − θ ∗ ) ≤ ∥ θ i − θ ∗ ∥ 2 2 −∥ θ i + 1 − θ ∗ ∥ 2 + η G 2 = ⇒ Step 1. 2 η

  20. Step 2: 1 1 f 2 . gd analysis proof Theorem – GD on Convex Lipschitz Functions: For convex G - G 2 t 2 R 2 f i i t t 2 13 t , 2 G R Lipschitz function f , GD run with t ≥ R 2 G 2 iterations, η = √ ϵ 2 and starting point within radius R of θ ∗ , outputs ˆ θ satisfying: f (ˆ θ ) ≤ f ( θ ∗ ) + ϵ. Step 1: For all i , f ( θ i ) − f ( θ ∗ ) ≤ ∥ θ i − θ ∗ ∥ 2 2 −∥ θ i + 1 − θ ∗ ∥ 2 + η G 2 2 η

  21. gd analysis proof t , R 2 t 2 2 Theorem – GD on Convex Lipschitz Functions: For convex G - 13 G R Lipschitz function f , GD run with t ≥ R 2 G 2 iterations, η = √ ϵ 2 and starting point within radius R of θ ∗ , outputs ˆ θ satisfying: f (ˆ θ ) ≤ f ( θ ∗ ) + ϵ. Step 1: For all i , f ( θ i ) − f ( θ ∗ ) ≤ ∥ θ i − θ ∗ ∥ 2 2 −∥ θ i + 1 − θ ∗ ∥ 2 + η G 2 2 η 2 η · t + η G 2 ∑ t i = 1 f ( θ i ) − f ( θ ∗ ) ≤ Step 2: 1 2 .

  22. gd analysis proof t , R 2 t Theorem – GD on Convex Lipschitz Functions: For convex G - 14 G R Lipschitz function f , GD run with t ≥ R 2 G 2 iterations, η = √ ϵ 2 and starting point within radius R of θ ∗ , outputs ˆ θ satisfying: f (ˆ θ ) ≤ f ( θ ∗ ) + ϵ. 2 η · t + η G 2 ∑ t i = 1 f ( θ i ) − f ( θ ∗ ) ≤ Step 2: 1 2 .

  23. d is convex if and only if, constrained convex optimization 0 1 : 1 . 2 d E.g. 2 1 1 2 and Often want to perform convex optimization with convex constraints. 1 for any Definition – Convex Set: A set 15 θ ∗ = arg min f ( θ ) , θ ∈S where S is a convex set.

  24. constrained convex optimization Often want to perform convex optimization with convex constraints. 1 . 2 d E.g. 15 θ ∗ = arg min f ( θ ) , θ ∈S where S is a convex set. Definition – Convex Set: A set S ⊆ R d is convex if and only if, for any ⃗ θ 1 , ⃗ θ 2 ∈ S and λ ∈ [ 0 , 1 ] : ( 1 − λ ) ⃗ θ 1 + λ · ⃗ θ 2 ∈ S

  25. constrained convex optimization Often want to perform convex optimization with convex constraints. 15 θ ∗ = arg min f ( θ ) , θ ∈S where S is a convex set. Definition – Convex Set: A set S ⊆ R d is convex if and only if, for any ⃗ θ 1 , ⃗ θ 2 ∈ S and λ ∈ [ 0 , 1 ] : ( 1 − λ ) ⃗ θ 1 + λ · ⃗ θ 2 ∈ S θ ∈ R d : ∥ ⃗ E.g. S = { ⃗ θ ∥ 2 ≤ 1 } .

  26. • For • For • Choose some initialization 0 and set • For i • Return t f i . arg min i i 1 f i • 1 i • P out i . 0 out projected gradient descent t d , what is P d 2 1 what is P 1 being a k dimensional subspace of y ? y ? Projected Gradient Descent R G t . 16 For any convex set let P S ( · ) denote the projection function onto S . θ ∈S ∥ ⃗ • P S ( ⃗ θ − ⃗ y ) = arg min ⃗ y ∥ 2 .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend