compsci 514: algorithms for data science Cameron Musco University - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Spring 2020. Lecture 23 0

summary Last Class: convexity and Lipschitzness. This Class: optimization. 1 • Multivariable calculus review and gradient computation. • Introduction to gradient descent. Motivation as a greedy algorithm. • Conditions under which we will analyze gradient descent: • Analysis of gradient descent for Lipschitz, convex functions. • Simple extension to projected gradient descent for constrained

convexity 2 Definition – Convex Function: A function f : R d → R is convex if and only if, for any ⃗ θ 1 , ⃗ θ 2 ∈ R d and λ ∈ [ 0 , 1 ] : ( ) ( 1 − λ ) · f ( ⃗ θ 1 ) + λ · f ( ⃗ ( 1 − λ ) · ⃗ θ 1 + λ · ⃗ θ 2 ) ≥ f θ 2 Corollary – Convex Function: A function f : R d → R is convex if and only if, for any ⃗ θ 1 , ⃗ θ 2 ∈ R d and λ ∈ [ 0 , 1 ] : θ 1 ) T ( ) f ( ⃗ θ 2 ) − f ( ⃗ θ 1 ) ≥ ⃗ ∇ f ( ⃗ ⃗ θ 2 − ⃗ θ 1 Definition – Lipschitz Function: A function f : R d → R is G - ∇ f ( ⃗ θ ) ∥ 2 ≤ G for all ⃗ Lipschitz if ∥ ⃗ θ .

gd analysis – convex functions Gradient Descent t . Assume that: G R 3 • f is convex. • f is G -Lipschitz. • ∥ ⃗ θ 1 − ⃗ θ ∗ ∥ 2 ≤ R where ⃗ θ 1 is the initialization point. • Choose some initialization ⃗ θ 1 and set η = √ • For i = 1 , . . . , t − 1 • ⃗ θ i + 1 = ⃗ θ i − η⃗ ∇ f ( ⃗ θ i ) • Return ˆ θ t f ( ⃗ θ = arg min ⃗ θ i ) . θ 1 ,...,⃗

gd analysis proof G 2 Theorem – GD on Convex Lipschitz Functions: For convex G - t , 4 R Lipschitz function f , GD run with t ≥ R 2 G 2 iterations, η = √ ϵ 2 θ ∗ , outputs ˆ and starting point within radius R of ⃗ θ satisfying: f (ˆ θ ) ≤ f ( ⃗ θ ∗ ) + ϵ. θ ∗ ) ≤ ∥ ⃗ θ i − ⃗ 2 −∥ ⃗ θ i + 1 − ⃗ θ ∗ ∥ 2 θ ∗ ∥ 2 Step 1: For all i , f ( ⃗ θ i ) − f ( ⃗ + η G 2 2 η 2 . Visually:

gd analysis proof G 2 Theorem – GD on Convex Lipschitz Functions: For convex G - t , 5 R Lipschitz function f , GD run with t ≥ R 2 G 2 iterations, η = √ ϵ 2 θ ∗ , outputs ˆ and starting point within radius R of ⃗ θ satisfying: f (ˆ θ ) ≤ f ( ⃗ θ ∗ ) + ϵ. θ ∗ ) ≤ ∥ ⃗ 2 −∥ ⃗ θ i + 1 − ⃗ θ i − θ ∗ ∥ 2 θ ∗ ∥ 2 Step 1: For all i , f ( ⃗ θ i ) − f ( ⃗ + η G 2 2 η 2 . Formally:

gd analysis proof t , 2 2 2 Theorem – GD on Convex Lipschitz Functions: For convex G - 6 G R Lipschitz function f , GD run with t ≥ R 2 G 2 iterations, η = √ ϵ 2 and starting point within radius R of ⃗ θ ∗ , outputs ˆ θ satisfying: f (ˆ θ ) ≤ f ( ⃗ θ ∗ ) + ϵ. θ ∗ ) ≤ ∥ ⃗ θ i − ⃗ 2 −∥ ⃗ θ i + 1 − ⃗ Step 1: For all i , f ( ⃗ θ i ) − f ( ⃗ θ ∗ ∥ 2 θ ∗ ∥ 2 + η G 2 2 η 2 . θ ∗ ) ≤ ∥ ⃗ θ i − ⃗ 2 −∥ ⃗ θ i + 1 − ⃗ θ ∗ ∥ 2 θ ∗ ∥ 2 Step 1.1: ⃗ ∇ f ( ⃗ θ i ) T ( ⃗ θ i − ⃗ + η G 2 = ⇒ Step 1. 2 η

gd analysis proof t , R 2 t 2 2 Theorem – GD on Convex Lipschitz Functions: For convex G - 7 R G Lipschitz function f , GD run with t ≥ R 2 G 2 iterations, η = √ ϵ 2 and starting point within radius R of ⃗ θ ∗ , outputs ˆ θ satisfying: f (ˆ θ ) ≤ f ( ⃗ θ ∗ ) + ϵ. θ ∗ ) ≤ ∥ ⃗ θ i − ⃗ 2 −∥ ⃗ θ i + 1 − ⃗ Step 1: For all i , f ( ⃗ θ i ) − f ( ⃗ θ ∗ ∥ 2 θ ∗ ∥ 2 + η G 2 = ⇒ 2 η i = 1 f ( ⃗ θ i ) − f ( ⃗ 2 η · t + η G 2 ∑ t θ ∗ ) ≤ Step 2: 1 2 .

gd analysis proof t , R 2 t Theorem – GD on Convex Lipschitz Functions: For convex G - 8 G R Lipschitz function f , GD run with t ≥ R 2 G 2 iterations, η = √ ϵ 2 θ ∗ , outputs ˆ and starting point within radius R of ⃗ θ satisfying: f (ˆ θ ) ≤ f ( ⃗ θ ∗ ) + ϵ. i = 1 f ( ⃗ θ i ) − f ( ⃗ 2 η · t + η G 2 ∑ t θ ∗ ) ≤ Step 2: 1 2 .

constrained convex optimization Often want to perform convex optimization with convex constraints. 9 θ ∗ = arg min ⃗ f ( ⃗ θ ) , ⃗ θ ∈S where S is a convex set. Definition – Convex Set: A set S ⊆ R d is convex if and only if, for any ⃗ θ 1 , ⃗ θ 2 ∈ S and λ ∈ [ 0 , 1 ] : ( 1 − λ ) ⃗ θ 1 + λ · ⃗ θ 2 ∈ S θ ∈ R d : ∥ ⃗ E.g. S = { ⃗ θ ∥ 2 ≤ 1 } .

projected gradient descent Projected Gradient Descent t . G R 10 For any convex set let P S ( · ) denote the projection function onto S . θ ∈S ∥ ⃗ • P S ( ⃗ θ − ⃗ y ) = arg min ⃗ y ∥ 2 . θ ∈ R d : ∥ ⃗ • For S = { ⃗ θ ∥ 2 ≤ 1 } what is P S ( ⃗ y ) ? • For S being a k dimensional subspace of R d , what is P S ( ⃗ y ) ? • Choose some initialization ⃗ θ 1 and set η = √ • For i = 1 , . . . , t − 1 θ ( out ) • ⃗ = ⃗ θ i − η · ⃗ ∇ f ( ⃗ θ i ) i + 1 θ ( out ) • ⃗ θ i + 1 = P S ( ⃗ i + 1 ) . • Return ˆ θ i f ( ⃗ θ = arg min ⃗ θ i ) .

convex projections Projected gradient descent can be analyzed identically to gradient descent! 11 Theorem – Projection to a convex set: For any convex set S ⊆ y ∈ R d , and ⃗ R d , ⃗ θ ∈ S , y ) − ⃗ y − ⃗ ∥ P S ( ⃗ θ ∥ 2 ≤ ∥ ⃗ θ ∥ 2 .

projected gradient descent analysis Theorem – Projected GD: For convex G -Lipschitz function f , and 2 R 2 t 2 2 12 G t , R convex set S , Projected GD run with t ≥ R 2 G 2 ϵ 2 iterations, η = √ θ ∗ , outputs ˆ and starting point within radius R of ⃗ θ satisfying: f (ˆ θ ) ≤ f ( ⃗ f ( ⃗ θ ∗ ) + ϵ = min θ ) + ϵ ⃗ θ ∈S θ ( out ) θ ( out ) Recall: ⃗ = ⃗ θ i − η · ⃗ ∇ f ( ⃗ θ i ) and ⃗ θ i + 1 = P S ( ⃗ i + 1 ) . i + 1 ∥ ⃗ 2 −∥ ⃗ θ ( out ) i + 1 − ⃗ θ i − θ ∗ ∥ 2 θ ∗ ∥ 2 Step 1: For all i , f ( ⃗ θ i ) − f ( ⃗ + η G 2 θ ∗ ) ≤ 2 η 2 . θ ∗ ) ≤ ∥ ⃗ θ i − ⃗ 2 −∥ ⃗ θ i + 1 − ⃗ θ ∗ ∥ 2 θ ∗ ∥ 2 Step 1.a: For all i , f ( ⃗ θ i ) − f ( ⃗ + η G 2 2 η 2 . i = 1 f ( ⃗ θ i ) − f ( ⃗ 2 η · t + η G 2 ∑ t θ ∗ ) ≤ = ⇒ Theorem. Step 2: 1

compsci 514: algorithms for data science Cameron Musco University - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Spring 2020. Lecture 23 0 summary Last Class: convexity and Lipschitzness. This Class: optimization. 1 Multivariable calculus review and

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Prof. Cameron Musco University of Massachusetts Amherst.

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

CS7015 (Deep Learning) : Lecture 5 Gradient Descent (GD), Momentum Based GD, Nesterov Accelerated

Amir Ali Kouzeh Geran and Arash Reyhani-Masoleh Presented by: Arash Reyhani-Masoleh Department

Data Link Layer Understand principles behind data link layer services: Error detection,

Precise Neutron Lifetime Measurement Using Pulsed Neutron Beams at J-PARC Motivation 8.4 sec

SeparatingThickness fromGeometricThickness DavidEppstein

LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning Tianyi Chen

PA-GD: On the Convergence of Perturbed Alternating Gradient Descent to Second-Order Stationary

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima