Optimal Algorithms for Online Convex Optimization with Multi-Point - PowerPoint PPT Presentation

Optimal Algorithms for Online Convex Optimization with Multi-Point Bandit Feedback Alekh Agarwal Ofer Dekel Lin Xiao UC Berkeley Microsoft Research

Online Convex Optimization (Full-Info) Adversary Player

Online Convex Optimization (Full-Info) Adversary Player x 1 K x 1

Online Convex Optimization (Full-Info) Adversary Player x 1 ℓ 1 K x 1

Online Convex Optimization (Full-Info) Player updates x t +1 = Π K ( x t − η ∇ ℓ t ( x t )). Adversary Player x 1 ℓ 1 ∇ ℓ 1 ( x 1 ) K x 1 x 2

Online Convex Optimization (Full-Info) Adversary Player x 1 ℓ 1 x T ℓ T K x T x 1 x 2 Minimize regret: R T = � T � T t =1 ℓ t ( x t ) − min x ∈K t =1 ℓ t ( x ).

Bandit Convex Optimization Adversary Player

Bandit Convex Optimization Adversary Player x 1 K x 1

Bandit Convex Optimization Adversary Player x 1 ℓ 1 ( x 1 ) ℓ 1 K x 1

Bandit Gradient Descent [FKM’05] Adversary Player x 1 Full−Info K x 1

Bandit Gradient Descent [FKM’05] Adversary Player y 1 x 1 Full−Info K y 1 x 1

Bandit Gradient Descent [FKM’05] Adversary Player y 1 ℓ 1 ( y 1 ) x 1 Full−Info K y 1 ℓ 1 x 1

Bandit Gradient Descent [FKM’05] Updates x t +1 = Π (1 − ξ ) K ( x t − η t g t ). Adversary Player y 1 ℓ 1 ( y 1 ) x 1 g 1 Full−Info K y 1 ℓ 1 x 1 Minimize regret: R T = � T � T t =1 ℓ t ( y t ) − min x ∈K t =1 ℓ t ( x ).

A survey of known regret bounds Linear Convex Strongly Convex Upper Lower Upper Lower Upper Lower √ √ √ √ Full-Info O ( T ) O ( T ) O ( T ) O ( T ) O (log T ) O (log T ) Deterministic results against completely adaptive adversaries in Full-Info.

A survey of known regret bounds Linear Convex Strongly Convex Upper Lower Upper Lower Upper Lower √ √ √ √ Full-Info O ( T ) O ( T ) O ( T ) O ( T ) O (log T ) O (log T ) √ √ √ √ O ( T 3 / 4 ) O ( T 2 / 3 ) Bandit O ( T ) O ( T ) O ( T ) O ( T )? Deterministic results against completely adaptive adversaries in Full-Info. High probability results against adaptive adversaries for Bandit.

The Multi-Point (MP) feedback setup Want to interpolate between bandit and full information. Player allowed several queries per round. Adversary reveals value of ℓ t at all points picked. Average regret on points played: T k 1 � � R T = ℓ t ( y t , i ) − min x ∈K ℓ t ( x ) . k t =1 i =1

A survey of known regret bounds Linear Convex Strongly Convex Upper Lower Upper Lower Upper Lower √ √ √ √ Full-Info O ( T ) O ( T ) O ( T ) O ( T ) O (log T ) O (log T ) √ √ √ √ O ( T 3 / 4 ) O ( T 2 / 3 ) Bandit O ( T ) O ( T ) O ( T ) O ( T )? √ √ √ √ MP Bandit O ( T ) O ( T ) O ( T ) O ( T ) O (log T ) O (log T ) Deterministic results against completely adaptive adversaries in Full-Info. High probability results against adaptive adversaries for Bandit.

Properties of gradient estimator g t [FKM’05] g t = d δ ℓ t ( x t + δ u t ) u t . Unbiased for linear functions. Nearly unbiased for general convex functions. ℓ t x t − δ x t + δ x t

Properties of gradient estimator g t [FKM’05] g t = d δ ℓ t ( x t + δ u t ) u t . Unbiased for linear functions. Nearly unbiased for general convex functions. ℓ t ( x t + δ ) ℓ t 2 δℓ t ℓ t ( x t − δ ) x t − δ x t x t + δ Regret bounds scale with � g t � . � g t � grows as 1 /δ .

Gradient Descent Algorithm with two queries per round (GD2P) g t = d Estimates gradient ˜ 2 δ ( ℓ t ( x t + δ u t ) − ℓ t ( x t − δ u t )) u t . Updates x t +1 = Π (1 − ξ ) K ( x t − η ˜ g t ) . Adversary Player x 1 Full−Info K x 1

Gradient Descent Algorithm with two queries per round (GD2P) g t = d Estimates gradient ˜ 2 δ ( ℓ t ( x t + δ u t ) − ℓ t ( x t − δ u t )) u t . Updates x t +1 = Π (1 − ξ ) K ( x t − η ˜ g t ) . Adversary Player { y 1 , 1 , y 1 , 2 } x 1 Full−Info K y 1 , 1 x 1 y 1 , 2

Gradient Descent Algorithm with two queries per round (GD2P) g t = d Estimates gradient ˜ 2 δ ( ℓ t ( x t + δ u t ) − ℓ t ( x t − δ u t )) u t . Updates x t +1 = Π (1 − ξ ) K ( x t − η ˜ g t ) . Adversary Player { y 1 , 1 , y 1 , 2 } { ℓ 1 ( y 1 , 1 ) , ℓ 1 ( y 1 , 2 ) } x 1 Full−Info K ℓ 1 y 1 , 1 x 1 y 1 , 2

Gradient Descent Algorithm with two queries per round (GD2P) g t = d Estimates gradient ˜ 2 δ ( ℓ t ( x t + δ u t ) − ℓ t ( x t − δ u t )) u t . Updates x t +1 = Π (1 − ξ ) K ( x t − η ˜ g t ) . Adversary Player { y 1 , 1 , y 1 , 2 } { ℓ 1 ( y 1 , 1 ) , ℓ 1 ( y 1 , 2 ) } x 1 Full−Info g 1 ˜ K ℓ 1 y 1 , 1 x 1 y 1 , 2

Properties of the gradient estimator ˜ g t g t = d g t = d δ ℓ t ( x t + δ u t ) u t , ˜ 2 δ ( ℓ t ( x t + δ u t ) − ℓ t ( x t − δ u t )) u t . Identical to g t in expectation, E ˜ g t = E g t . Bounded norm � ˜ g t � ≤ dG . g t � = d � ˜ 2 δ � ( ℓ t ( x t + δ u t ) − ℓ t ( x t − δ u t )) u t �

Properties of the gradient estimator ˜ g t g t = d g t = d δ ℓ t ( x t + δ u t ) u t , ˜ 2 δ ( ℓ t ( x t + δ u t ) − ℓ t ( x t − δ u t )) u t . Identical to g t in expectation, E ˜ g t = E g t . Bounded norm � ˜ g t � ≤ dG . g t � = d � ˜ 2 δ � ( ℓ t ( x t + δ u t ) − ℓ t ( x t − δ u t )) u t � = d 2 δ | ℓ t ( x t + δ u t ) − ℓ t ( x t − δ u t ) |

Properties of the gradient estimator ˜ g t g t = d g t = d δ ℓ t ( x t + δ u t ) u t , ˜ 2 δ ( ℓ t ( x t + δ u t ) − ℓ t ( x t − δ u t )) u t . Identical to g t in expectation, E ˜ g t = E g t . Bounded norm � ˜ g t � ≤ dG . g t � = d � ˜ 2 δ � ( ℓ t ( x t + δ u t ) − ℓ t ( x t − δ u t )) u t � = d 2 δ | ℓ t ( x t + δ u t ) − ℓ t ( x t − δ u t ) | ≤ dG 2 δ � 2 δ u t � = Gd .

Regret analysis for gradient descent with two queries Bounded non-empty set: r B ⊆ K ⊆ D B . Lipschitz loss functions: | ℓ t ( x ) − ℓ t ( y ) | ≤ G � x − y � , ∀ x , y ∈ K , ∀ t . σ t -strong convexity: ℓ t ( y ) ≥ ℓ t ( x ) + �∇ ℓ t ( x ) , y − x � + σ t 2 � x − y � 2 . Theorem Under above assumptions, let σ 1 > 0 . If the GD2P algorithm is σ 1: t , δ = log T 1 and ξ = δ run with η t = r , then for any x ∈ K , T T T 1 � � 2( ℓ t ( y t , 1 ) + ℓ t ( y t , 2 )) − E ℓ t ( x ) ≤ E t =1 t =1 T d 2 G 2 1 � � 3 + D � + G log( T ) . 2 σ 1: t r t =1

Regret bound for convex, Lipschitz functions Corollary Suppose the set K is bounded and non-empty, and ℓ t is convex, G 1 Lipschitz for all t. If the GD2P algorithm is run with η t = T , √ δ = log T and ξ = δ r , then T T T 1 � � E 2( ℓ t ( y t , 1 ) + ℓ t ( y t , 2 )) − min x ∈K E ℓ t ( x ) ≤ t =1 t =1 √ � � 3 + D ( d 2 G 2 + D 2 ) T + G log( T ) . r Optimal due to matching lower bound in full-information setup. Bound also holds with high probability for adaptive adversaries.

Regret bound for strongly convex, Lipschitz functions Corollary Suppose the set K is bounded and non-empty, and ℓ t is σ -strongly convex, G Lipschitz for all t. If the GD2P algorithm is run with σ t , δ = log T 1 and ξ = δ η t = r , then T T T 1 � � 2( ℓ t ( y t , 1 ) + ℓ t ( y t , 2 )) − min ℓ t ( x ) ≤ E x ∈K E t =1 t =1 � d 2 G � + 3 + D G log( T ) . σ r Optimal due to matching lower bound in full-information setup.

Extension to other gradient estimators Bounded exploration (BE): � x t − y i , t � ≤ δ . Bounded gradient estimator (BG): � ˜ g t � ≤ G 1 . Approximately unbiased (AU): � E t ˜ g t − ∇ ℓ t ( x t ) � ≤ c δ . Theorem Let K be bounded, non-empty and ℓ t be σ t -strongly convex with for σ 1 > 0 . For any gradient estimator satisfying above conditions, the regret of GD2P algorithm is bounded as: T T 1 � � 2( ℓ t ( y t , 1 ) + ℓ t ( y t , 2 )) − E ℓ t ( x ) ≤ E t =1 t =1 T G 2 1 � 1 + 2 c + D � 1 � + G log( T ) . 2 σ 1: t r t =1

Analysis of other estimators for smooth functions Need to establish conditions (BE), (BG) and (AU). Smoothness assumption: ℓ t ( y ) ≤ ℓ t ( x ) + �∇ ℓ t ( x ) , y − x � + L 2 � x − y � 2 . Examples: Squared ℓ p norm � x − θ � 2 p for p ≥ 2. Quadratic loss ( y − w T x ) 2 for bounded x . Logistic loss log(1 + exp( − w T x )). ℓ ( x )

A Randomized Co-ordinate Descent algorithm Pick a co-ordinate i t ∈ { i , . . . , d } u.a.r. Play y t , 1 = x t + δ e i t , y t , 2 = x t − δ e i t . g t = d Set ˜ 2 δ ( ℓ t ( y t , 1 ) − ℓ t ( y t , 2 )) e i t .

A Randomized Co-ordinate Descent algorithm Pick a co-ordinate i t ∈ { i , . . . , d } u.a.r. Play y t , 1 = x t + δ e i t , y t , 2 = x t − δ e i t . g t = d Set ˜ 2 δ ( ℓ t ( y t , 1 ) − ℓ t ( y t , 2 )) e i t . √ dL δ (AU) holds: � E t ˜ g t − ∇ ℓ t ( x t ) � ≤ . 4 Same regret bound as before, with 1-dimensional gradient updates.

Optimal Algorithms for Online Convex Optimization with Multi-Point - PowerPoint PPT Presentation

Optimal Algorithms for Online Convex Optimization with Multi-Point Bandit Feedback Alekh Agarwal Ofer Dekel Lin Xiao UC Berkeley Microsoft Research Online Convex Optimization (Full-Info) Adversary Player Online Convex Optimization

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Optimization Problems Instructor:

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Optimization Problems

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

Some Recent Advances in Non-convex Optimization Purushottam Kar IIT KANPUR Outline of the Talk

A Primer in Convex Optimization Moritz Diehl partly based on material by Colin Jones, Stephen

16. Review of convex optimization Convex sets and functions Convex programming models

4.1 Online Convex Optimization Definition 4.1.1 In Euclidian space, a set C is said to be convex ,

Convex hull: basic facts Convex hull: basic facts CG Lecture 1 CG Lecture 1 Problem : give a set

CARBON NANOTUBE GRADIENT LAYERS REINFORCED ALUMINUM MATRIX COMPOSITE MATERIALS H. Kwon 1,2 *, S.

Gravel Transport Study Gravel Transport Study Lower Lake Creek Gravel, Wood, Gradient, and Study

Improving Mine Haul Roads by Using Advanced Instrument to Measure Haul Road Parameters Alok

KSVD - Gradient Descent Method For Compressive Sensing Optimization Endra Department of Computer

Grass Lake Luther Pass, California Wes Christensen Graham Fogg University of California, Davis

the economic, social and health consequences for well- being of children. Dorota Sienkiewicz

EXISTENCE OF SOLUTIONS TO NONLOCAL ELLIPTIC PROBLEM WITH DEPENDENCE ON THE GRADIENT VIA

Machine Learning for rare events Peter Condon Business Intelligence and Data Analytics Todays

Sambuz

Useful Links

Newsletter

Mail Us