zeroth order optimization
play

Zeroth-order Optimization Yining Wang in High Dimensions Carnegie - PowerPoint PPT Presentation

AISTATS 2018, Playa Blanca, Lanzarote, Spain Zeroth-order Optimization Yining Wang in High Dimensions Carnegie Mellon University Joint work with Simon Du, Sivaraman Balakrishnan and Aarti Singh B ACKGROUND Optimization: min x X f ( x )


  1. AISTATS 2018, Playa Blanca, Lanzarote, Spain Zeroth-order Optimization Yining Wang in High Dimensions Carnegie Mellon University Joint work with Simon Du, Sivaraman Balakrishnan and Aarti Singh

  2. B ACKGROUND ❖ Optimization: min x ∈ X f ( x ) ❖ Classical setting (first-order): ✴ f is known (e.g., a likelihood function or an NN objective) ✴ can be evaluated, or unbiasedly approximated r f ( x ) ❖ Zeroth-order setting: ✴ f is unknown, or very complicated ✴ is unknown, or very difficult to evaluate. r f ( x )

  3. A PPLICATIONS ❖ Hyper-parameter tuning ✴ f maps hyper-parameter x to system performance f(x). ❖ Experimental design ✴ f maps experimental setting to experimental results. ❖ Communication-efficient optimization ✴ Data defining the objective scattered throughout machines ✴ Communicating is expensive, but f(x) ok. r f ( x )

  4. F ORMULATION ❖ Convexity: the objective f is convex. ❖ Noisy observation model: i.i.d. ∼ N (0 , σ 2 ) . y t = f ( x t ) + ξ t , ξ t ❖ Evaluation measure: f ( b x T +1 ) − f ∗ ✴ Simple regret: T ✴ Cumulative regret: X f ( x t ) − f ∗ t =1

  5. M ETHODS ❖ Classical method: Estimating Gradient Descent (EGD) ❖ Gradient descent / Mirror descent: x t +1 ← x t − η t b g t ( x t ) x t +1 2 arg min z ∈ R d { η t h b g t ( x t ) , z i + ∆ ψ ( z, x t ) } ❖ Estimating gradient: g t ( x t ) = d ✴ δ · E [ f ( x t + δ v t ) v t ] b ✴ Gained popularity from (Nemirovski & Yudin’83, Flaxman et al.’05)

  6. M ETHODS ❖ Classical method: Estimating Gradient Descent (EGD) ❖ Gradient descent / Mirror descent: b g t ( x t ) ⇡ r f ( x t ) x t +1 ← x t − η t b g t ( x t ) x t +1 2 arg min z ∈ R d { η t h b g t ( x t ) , z i + ∆ ψ ( z, x t ) } ❖ Estimating gradient: g t ( x t ) = d ✴ δ · E [ f ( x t + δ v t ) v t ] b ✴ Gained popularity from (Nemirovski & Yudin’83, Flaxman et al.’05)

  7. M ETHODS ❖ Classical method: Estimating Gradient Descent (EGD) ❖ Gradient descent / Mirror descent: g t ( x t ) ⇡ r f ( x t ) b x t +1 ← x t − η t b g t ( x t ) x t +1 2 arg min z ∈ R d { η t h b g t ( x t ) , z i + ∆ ψ ( z, x t ) } v t δ ❖ Estimating gradient: x t g t ( x t ) = d ✴ δ · E [ f ( x t + δ v t ) v t ] b ✴ Gained popularity from (Nemirovski & Yudin’83, Flaxman et al.’05)

  8. M ETHODS ❖ Classical method: Estimating Gradient Descent (EGD) ❖ Classical analysis ✴ Supposing and k x ∗ k ∗  B kr f k  H x ) − f ∗ . √ ✴ Stochastic GD/MD: BH/ T f ( b x ) − f ∗ . √ d · BH/T 1 / 4 ✴ Estimating GD/MD: f ( b ❖ Problem: cannot exploit (sparse) structure in x*

  9. M ETHODS ❖ Classical method: Estimating Gradient Descent (EGD) ❖ Classical analysis E b g t ( x t ) = r f ( x t ) ✴ Supposing and k x ∗ k ∗  B kr f k  H First-order x ) − f ∗ . √ ✴ Stochastic GD/MD: BH/ T f ( b x ) − f ∗ . √ d · BH/T 1 / 4 ✴ Estimating GD/MD: f ( b ❖ Problem: cannot exploit (sparse) structure in x*

  10. M ETHODS ❖ Classical method: Estimating Gradient Descent (EGD) ❖ Classical analysis E b g t ( x t ) = r f ( x t ) ✴ Supposing and k x ∗ k ∗  B kr f k  H First-order x ) − f ∗ . √ ✴ Stochastic GD/MD: BH/ T f ( b x ) − f ∗ . √ d · BH/T 1 / 4 ✴ Estimating GD/MD: f ( b Zeroth-order ❖ Problem: cannot exploit (sparse) structure in x* g t ( x t ) � r f ( x t ) k 2 E k b small, but 2 E b g t ( x t ) 6 = r f ( x t )

  11. M ETHODS ❖ Classical method: Estimating Gradient Descent (EGD) ❖ Classical analysis E b g t ( x t ) = r f ( x t ) ✴ Supposing and k x ∗ k ∗  B kr f k  H First-order x ) − f ∗ . √ ✴ Stochastic GD/MD: BH/ T f ( b x ) − f ∗ . √ d · BH/T 1 / 4 ✴ Estimating GD/MD: f ( b Zeroth-order ❖ Problem: cannot exploit (sparse) structure in x* g t ( x t ) � r f ( x t ) k 2 E k b small, but 2 E b g t ( x t ) 6 = r f ( x t )

  12. A SSUMPTIONS ❖ The “function sparsity” assumption: f ( x ) ≡ f ( x S ) S ✓ [ d ] , | S | = s ⌧ d ❖ Strong theoretically, but slightly acceptable in practice ✴ Hyper-parameter tuning: performance not sensitive to many input parameters ✴ Visual stimuli optimization: most brain activities are not related to visual reactions.

  13. L ASSO GRADIENT ESTIMATE ❖ Local linear approximation: f ( x t + δ v t ) ⇡ f ( x t ) + δ hr f ( x t ) , v t i ❖ Lasso gradient estimate: ✴ Sample and observe y i ≈ f ( x t + δ v i ) − f ( x t ) v 1 , · · · , v n ✴ Construct a sparse linear system: e Y = Y/ δ = V r f ( x t ) + ε

  14. L ASSO GRADIENT ESTIMATE ❖ Local linear approximation: f ( x t + δ v t ) ⇡ f ( x t ) + δ hr f ( x t ) , v t i ❖ Lasso gradient estimate: ✴ Construct a sparse linear system: e Y = Y/ δ = V r f ( x t ) + ε ✴ Because is sparse , one can use the Lasso r f ( x t ) n o k e Y � V g k 2 b g t ( x t ) 2 arg min 2 + λ k g k 1 g ∈ R d

  15. L ASSO GRADIENT ESTIMATE ❖ Local linear approximation: f ( x t + δ v t ) ⇡ f ( x t ) + δ hr f ( x t ) , v t i ❖ Lasso gradient estimate: ✴ Construct a sparse linear system: e certain “de-biasing” Y = Y/ δ = V r f ( x t ) + ε required … see paper ✴ Because is sparse , one can use the Lasso r f ( x t ) n o k e Y � V g k 2 g t ( x t ) 2 arg min b 2 + λ k g k 1 g ∈ R d

  16. M AIN RESULTS f ( x ) ≡ f ( x S ) | S | = s ⌧ d Theorem. Suppose for some , and other smoothness conditions on f hold. Then T 1 f ( x t ) − f ∗ . poly( s, log d ) · T − 1 / 4 X T t =1 T − 1 / 3 Furthermore, for smoother f the can be improved to T − 1 / 4

  17. M AIN RESULTS f ( x ) ≡ f ( x S ) | S | = s ⌧ d Theorem. Suppose for some , and other smoothness conditions on f hold. Then T 1 f ( x t ) − f ∗ . poly( s, log d ) · T − 1 / 4 X T t =1 T − 1 / 3 Furthermore, for smoother f the can be improved to T − 1 / 4 Can handle “high-dimensional” setting d � T

  18. S IMULATION RESULTS

  19. S IMULATION RESULTS

  20. O PEN QUESTIONS ❖ Is function/gradient sparsity absolutely necessary? ✴ Recall in first-order case, only solution x* sparsity required ✴ More specifically, only need k x ∗ k 1  B, kr f k ∞  H ✴ Conjecture : if f only satisfies the above condition, then inf x T sup E [ f ( b x T ) − f ∗ ] & poly( d, 1 /T ) b f

  21. O PEN QUESTIONS ❖ Is convergence achievable in high dimensions? T − 1 / 2 ✴ Challenge 1: MD is awkward in exploiting strong convexity: f ( x 0 ) � f ( x ) + hr f ( x ) , x 0 � x i + ν 2 2 ∆ ψ ( x 0 , x ) ✴ Challenge 2: the Lasso gradient estimate is less efficient — can we design convex body K such that Z g t ( x t ) = ρ ( K ) f ( x t + δ v ) n ( v )d µ ( v ) b δ ∂ K is a good gradient estimator in high dimensions?

  22. O PEN QUESTIONS Wish to replace with k x 0 � x k 2 1 ❖ Is convergence achievable in high dimensions? T − 1 / 2 ✴ Challenge 1: MD is awkward in exploiting strong convexity: f ( x 0 ) � f ( x ) + hr f ( x ) , x 0 � x i + ν 2 2 ∆ ψ ( x 0 , x ) ✴ Challenge 2: the Lasso gradient estimate is less efficient — can we design convex body K such that Z g t ( x t ) = ρ ( K ) f ( x t + δ v ) n ( v )d µ ( v ) b δ ∂ K is a good gradient estimator in high dimensions?

  23. Thank you! Questions

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend