Zeroth-order Optimization Yining Wang in High Dimensions Carnegie - PowerPoint PPT Presentation

AISTATS 2018, Playa Blanca, Lanzarote, Spain Zeroth-order Optimization Yining Wang in High Dimensions Carnegie Mellon University Joint work with Simon Du, Sivaraman Balakrishnan and Aarti Singh

B ACKGROUND ❖ Optimization: min x ∈ X f ( x ) ❖ Classical setting (first-order): ✴ f is known (e.g., a likelihood function or an NN objective) ✴ can be evaluated, or unbiasedly approximated r f ( x ) ❖ Zeroth-order setting: ✴ f is unknown, or very complicated ✴ is unknown, or very difficult to evaluate. r f ( x )

A PPLICATIONS ❖ Hyper-parameter tuning ✴ f maps hyper-parameter x to system performance f(x). ❖ Experimental design ✴ f maps experimental setting to experimental results. ❖ Communication-efficient optimization ✴ Data defining the objective scattered throughout machines ✴ Communicating is expensive, but f(x) ok. r f ( x )

F ORMULATION ❖ Convexity: the objective f is convex. ❖ Noisy observation model: i.i.d. ∼ N (0 , σ 2 ) . y t = f ( x t ) + ξ t , ξ t ❖ Evaluation measure: f ( b x T +1 ) − f ∗ ✴ Simple regret: T ✴ Cumulative regret: X f ( x t ) − f ∗ t =1

M ETHODS ❖ Classical method: Estimating Gradient Descent (EGD) ❖ Gradient descent / Mirror descent: x t +1 ← x t − η t b g t ( x t ) x t +1 2 arg min z ∈ R d { η t h b g t ( x t ) , z i + ∆ ψ ( z, x t ) } ❖ Estimating gradient: g t ( x t ) = d ✴ δ · E [ f ( x t + δ v t ) v t ] b ✴ Gained popularity from (Nemirovski & Yudin’83, Flaxman et al.’05)

M ETHODS ❖ Classical method: Estimating Gradient Descent (EGD) ❖ Gradient descent / Mirror descent: b g t ( x t ) ⇡ r f ( x t ) x t +1 ← x t − η t b g t ( x t ) x t +1 2 arg min z ∈ R d { η t h b g t ( x t ) , z i + ∆ ψ ( z, x t ) } ❖ Estimating gradient: g t ( x t ) = d ✴ δ · E [ f ( x t + δ v t ) v t ] b ✴ Gained popularity from (Nemirovski & Yudin’83, Flaxman et al.’05)

M ETHODS ❖ Classical method: Estimating Gradient Descent (EGD) ❖ Gradient descent / Mirror descent: g t ( x t ) ⇡ r f ( x t ) b x t +1 ← x t − η t b g t ( x t ) x t +1 2 arg min z ∈ R d { η t h b g t ( x t ) , z i + ∆ ψ ( z, x t ) } v t δ ❖ Estimating gradient: x t g t ( x t ) = d ✴ δ · E [ f ( x t + δ v t ) v t ] b ✴ Gained popularity from (Nemirovski & Yudin’83, Flaxman et al.’05)

M ETHODS ❖ Classical method: Estimating Gradient Descent (EGD) ❖ Classical analysis ✴ Supposing and k x ∗ k ∗  B kr f k  H x ) − f ∗ . √ ✴ Stochastic GD/MD: BH/ T f ( b x ) − f ∗ . √ d · BH/T 1 / 4 ✴ Estimating GD/MD: f ( b ❖ Problem: cannot exploit (sparse) structure in x*

M ETHODS ❖ Classical method: Estimating Gradient Descent (EGD) ❖ Classical analysis E b g t ( x t ) = r f ( x t ) ✴ Supposing and k x ∗ k ∗  B kr f k  H First-order x ) − f ∗ . √ ✴ Stochastic GD/MD: BH/ T f ( b x ) − f ∗ . √ d · BH/T 1 / 4 ✴ Estimating GD/MD: f ( b ❖ Problem: cannot exploit (sparse) structure in x*

M ETHODS ❖ Classical method: Estimating Gradient Descent (EGD) ❖ Classical analysis E b g t ( x t ) = r f ( x t ) ✴ Supposing and k x ∗ k ∗  B kr f k  H First-order x ) − f ∗ . √ ✴ Stochastic GD/MD: BH/ T f ( b x ) − f ∗ . √ d · BH/T 1 / 4 ✴ Estimating GD/MD: f ( b Zeroth-order ❖ Problem: cannot exploit (sparse) structure in x* g t ( x t ) � r f ( x t ) k 2 E k b small, but 2 E b g t ( x t ) 6 = r f ( x t )

A SSUMPTIONS ❖ The “function sparsity” assumption: f ( x ) ≡ f ( x S ) S ✓ [ d ] , | S | = s ⌧ d ❖ Strong theoretically, but slightly acceptable in practice ✴ Hyper-parameter tuning: performance not sensitive to many input parameters ✴ Visual stimuli optimization: most brain activities are not related to visual reactions.

L ASSO GRADIENT ESTIMATE ❖ Local linear approximation: f ( x t + δ v t ) ⇡ f ( x t ) + δ hr f ( x t ) , v t i ❖ Lasso gradient estimate: ✴ Sample and observe y i ≈ f ( x t + δ v i ) − f ( x t ) v 1 , · · · , v n ✴ Construct a sparse linear system: e Y = Y/ δ = V r f ( x t ) + ε

L ASSO GRADIENT ESTIMATE ❖ Local linear approximation: f ( x t + δ v t ) ⇡ f ( x t ) + δ hr f ( x t ) , v t i ❖ Lasso gradient estimate: ✴ Construct a sparse linear system: e Y = Y/ δ = V r f ( x t ) + ε ✴ Because is sparse , one can use the Lasso r f ( x t ) n o k e Y � V g k 2 b g t ( x t ) 2 arg min 2 + λ k g k 1 g ∈ R d

L ASSO GRADIENT ESTIMATE ❖ Local linear approximation: f ( x t + δ v t ) ⇡ f ( x t ) + δ hr f ( x t ) , v t i ❖ Lasso gradient estimate: ✴ Construct a sparse linear system: e certain “de-biasing” Y = Y/ δ = V r f ( x t ) + ε required … see paper ✴ Because is sparse , one can use the Lasso r f ( x t ) n o k e Y � V g k 2 g t ( x t ) 2 arg min b 2 + λ k g k 1 g ∈ R d

M AIN RESULTS f ( x ) ≡ f ( x S ) | S | = s ⌧ d Theorem. Suppose for some , and other smoothness conditions on f hold. Then T 1 f ( x t ) − f ∗ . poly( s, log d ) · T − 1 / 4 X T t =1 T − 1 / 3 Furthermore, for smoother f the can be improved to T − 1 / 4

M AIN RESULTS f ( x ) ≡ f ( x S ) | S | = s ⌧ d Theorem. Suppose for some , and other smoothness conditions on f hold. Then T 1 f ( x t ) − f ∗ . poly( s, log d ) · T − 1 / 4 X T t =1 T − 1 / 3 Furthermore, for smoother f the can be improved to T − 1 / 4 Can handle “high-dimensional” setting d � T

S IMULATION RESULTS

O PEN QUESTIONS ❖ Is function/gradient sparsity absolutely necessary? ✴ Recall in first-order case, only solution x* sparsity required ✴ More specifically, only need k x ∗ k 1  B, kr f k ∞  H ✴ Conjecture : if f only satisfies the above condition, then inf x T sup E [ f ( b x T ) − f ∗ ] & poly( d, 1 /T ) b f

O PEN QUESTIONS ❖ Is convergence achievable in high dimensions? T − 1 / 2 ✴ Challenge 1: MD is awkward in exploiting strong convexity: f ( x 0 ) � f ( x ) + hr f ( x ) , x 0 � x i + ν 2 2 ∆ ψ ( x 0 , x ) ✴ Challenge 2: the Lasso gradient estimate is less efficient — can we design convex body K such that Z g t ( x t ) = ρ ( K ) f ( x t + δ v ) n ( v )d µ ( v ) b δ ∂ K is a good gradient estimator in high dimensions?

O PEN QUESTIONS Wish to replace with k x 0 � x k 2 1 ❖ Is convergence achievable in high dimensions? T − 1 / 2 ✴ Challenge 1: MD is awkward in exploiting strong convexity: f ( x 0 ) � f ( x ) + hr f ( x ) , x 0 � x i + ν 2 2 ∆ ψ ( x 0 , x ) ✴ Challenge 2: the Lasso gradient estimate is less efficient — can we design convex body K such that Z g t ( x t ) = ρ ( K ) f ( x t + δ v ) n ( v )d µ ( v ) b δ ∂ K is a good gradient estimator in high dimensions?

Thank you! Questions

Zeroth-order Optimization Yining Wang in High Dimensions Carnegie - PowerPoint PPT Presentation

AISTATS 2018, Playa Blanca, Lanzarote, Spain Zeroth-order Optimization Yining Wang in High Dimensions Carnegie Mellon University Joint work with Simon Du, Sivaraman Balakrishnan and Aarti Singh B ACKGROUND Optimization: min x X f ( x )

Improved Zeroth-Order Variance Reduced Algorithms and Analysis for Nonconvex Optimization Kaiyi Ji

ZEROTH-ORDER NON-CONVEX SMOOTH OPTIMIZATION: LOCAL MINIMAX RATES Yining Wang, CMU joint work

Attractors and the Spectrum of a Zeroth-order Pseudo-differential Operator Miniconference on

supernova Ia progenitors overview of light curves and spectra daniel kasen zeroth order SNIa

Reorder Buffer Method Issue Execute Write Classic 5-stage pipeline In-order In-order

Temperature & Thermal Expansion Temperature Zeroth Law of Thermodynamics Temperature

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

ORDER-VALUE OPTIMIZATION AND NEW APPLICATIONS Jos e Mario Mart nez www.ime.unicamp.br/

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

BEEM103 Optimization Techniques for Economists Level Curves Multivariate Functions Isoquants

Accelerating PDE-Constrained Optimization using Progressively-Constructed Reduced-Order Models

Generalized Order-Value Optimization Jos e Mario Mart nez www.ime.unicamp.br/ martinez

Automated Coding of Stream-Order or: SQL Magic in GIS By Gido Langen The sample network -

ORDER PROCESSING COVER PAGE ORDER PROCESSING FLOW Optimizing the Flow Print Sales order

ORDER PROCESSING COVER PAGE ORDER PROCESSING FLOW Optimizing the Flow Print Sales order

Distributed motion coordination of robotic networks Lecture 2 models and complexity notions

Quality and Access Issues 28 th August 2012 UICC World Cancer Congress Montral, Qubec David

European emissions of the powerful greenhouse gases hydrofluorocarbons inferred from atmospheric

Z 0 Z B 1 X Z 2 Z 3 Y

Adopt Open J9 for Spring Boot performance! Charlie Gracie Michael Thompson

A new perspective on machine learning H. N. Mhaskar Claremont Graduate University ICERM

Coping with Inconsistent Databases Semantics, Algorithms, and Complexity Phokion G. Kolaitis

1-bROPUWicTef had4g

Zeroth-order Optimization Yining Wang in High Dimensions Carnegie - PowerPoint PPT Presentation

AISTATS 2018, Playa Blanca, Lanzarote, Spain Zeroth-order Optimization Yining Wang in High Dimensions Carnegie Mellon University Joint work with Simon Du, Sivaraman Balakrishnan and Aarti Singh B ACKGROUND Optimization: min x X f ( x )

Improved Zeroth-Order Variance Reduced Algorithms and Analysis for Nonconvex Optimization Kaiyi Ji

ZEROTH-ORDER NON-CONVEX SMOOTH OPTIMIZATION: LOCAL MINIMAX RATES Yining Wang, CMU joint work

Attractors and the Spectrum of a Zeroth-order Pseudo-differential Operator Miniconference on

supernova Ia progenitors overview of light curves and spectra daniel kasen zeroth order SNIa

Reorder Buffer Method Issue Execute Write Classic 5-stage pipeline In-order In-order

Temperature &amp; Thermal Expansion Temperature Zeroth Law of Thermodynamics Temperature

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

ORDER-VALUE OPTIMIZATION AND NEW APPLICATIONS Jos e Mario Mart nez www.ime.unicamp.br/

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

BEEM103 Optimization Techniques for Economists Level Curves Multivariate Functions Isoquants

Accelerating PDE-Constrained Optimization using Progressively-Constructed Reduced-Order Models

Generalized Order-Value Optimization Jos e Mario Mart nez www.ime.unicamp.br/ martinez

Automated Coding of Stream-Order or: SQL Magic in GIS By Gido Langen The sample network -

ORDER PROCESSING COVER PAGE ORDER PROCESSING FLOW Optimizing the Flow Print Sales order

ORDER PROCESSING COVER PAGE ORDER PROCESSING FLOW Optimizing the Flow Print Sales order

Distributed motion coordination of robotic networks Lecture 2 models and complexity notions

Quality and Access Issues 28 th August 2012 UICC World Cancer Congress Montral, Qubec David

European emissions of the powerful greenhouse gases hydrofluorocarbons inferred from atmospheric

Z 0 Z B 1 X Z 2 Z 3 Y

Adopt Open J9 for Spring Boot performance! Charlie Gracie Michael Thompson

A new perspective on machine learning H. N. Mhaskar Claremont Graduate University ICERM

Coping with Inconsistent Databases Semantics, Algorithms, and Complexity Phokion G. Kolaitis

1-bROPUWicTef had4g

Temperature & Thermal Expansion Temperature Zeroth Law of Thermodynamics Temperature