Generalization Bounds in the Predict-then-Optimize Framework Othman - PowerPoint PPT Presentation

Framework and Preliminaries Combinatorial Dimension-based Bounds Improved Results Under Strong Convexity Generalization Bounds in the Predict-then-Optimize Framework Othman El Balghiti (Rayens Capital), Adam N. Elmachtoub (Columbia University), Paul Grigas (University of California, Berkeley) and Ambuj Tewari (University of Michigan) NeurIPS 2019 1

Framework and Preliminaries Combinatorial Dimension-based Bounds Improved Results Under Strong Convexity Outline of Topics Predict-then-optimize framework and preliminaries Combinatorial dimension based generalization bounds Margin-based generalization bounds under strong convexity Conclusions and future directions 2

Framework and Preliminaries Combinatorial Dimension-based Bounds Improved Results Under Strong Convexity Motivation Large-scale optimization problems arising in practice almost always involve unknown parameters Often there is a relationship between the unknown parameters and some contextual/auxiliary data Given historical data, one approach is to build a predictive statistical/machine learning model from data (e.g. using linear regression) First predict the unknown parameters, then optimize given the predictions Predict phase and optimize phase are naively decoupled There is an opportunity for the prediction model to be informed by the downstream optimization task 3

Framework and Preliminaries Combinatorial Dimension-based Bounds Improved Results Under Strong Convexity Contextual Stochastic Linear Optimization We consider stochastic optimization problems of the form: c T w � � min E c ∼D x w s.t. w ∈ S Notation: S is a given convex and compact set c is an unknown cost vector of the linear objective function D x is the conditional distribution of c given an auxiliary feature/context vector x ∈ R p Various approaches for dealing with the above problem in the literature: often without constraints, with very simple constraints, or without directly accounting for the optimization structure 4

Framework and Preliminaries Combinatorial Dimension-based Bounds Improved Results Under Strong Convexity Contextual Stochastic Linear Optimization, cont. � c T w � min E c ∼D x w s.t. w ∈ S Notice that the linearity assumption implies that w ∈ S E c ∼D x [ c T w ] = min w ∈ S E c ∼D x [ c | x ] T w min Hence, it is sufficient to focus on estimating/predicting the vector E c ∼D x [ c | x ] 5

Framework and Preliminaries Combinatorial Dimension-based Bounds Improved Results Under Strong Convexity Predict-then-optimize (PO) Paradigm We define P (ˆ c ) to be the optimization task with predicted cost vector ˆ c c T w P (ˆ c ) := min w s.t. w ∈ S w ∗ (ˆ c ) denotes an arbitrary optimal solution of P (ˆ c ) Predict-then-Optimize (PO) Paradigm Given a new feature vector x , predict ˆ c based on x Make decision w ∗ (ˆ c ) Incur cost c T w ∗ (ˆ c ) with respect to the actual (“true”) realized c 6

Framework and Preliminaries Combinatorial Dimension-based Bounds Improved Results Under Strong Convexity Predict-then-Optimize (PO) Loss Function Within the predict-then-optimize paradigm, we can naturally define a loss function referred to as the “Smart predict-then-optimize” (SPO) loss function [Elmachtoub and G 2017]: c , c ) := c T ( w ∗ (ˆ c ) − w ∗ ( c )) ℓ SPO (ˆ Given historical training data ( x 1 , c 1 ) , . . . , ( x n , c n ) and a hypothesis class H of cost vector prediction models (i.e., f : R p → R d for f ∈ H ), the ERM principle yields: Empirical Risk Minimization with the SPO Loss n 1 � min ℓ SPO ( f ( x i ) , c i ) n f ∈H i =1 7

Framework and Preliminaries Combinatorial Dimension-based Bounds Improved Results Under Strong Convexity Binary and Multiclass Classification as a Special Case It turns out that the SPO loss is a special case of the classical 0-1 loss in binary classification This equivalence happens with S = [ − 1 / 2 , +1 / 2] and c ∈ {− 1 , +1 } This example can also be generalized to multiclass classification where S is now the unit simplex 8

Framework and Preliminaries Combinatorial Dimension-based Bounds Improved Results Under Strong Convexity Empirical Risk Minimization with the SPO Loss n 1 � min ℓ SPO ( f ( x i ) , c i ) n f ∈H i =1 It turns out that the SPO loss is nonconvex, and in fact may be discontinuous depending on the structure of S Thus, the above optimization problem is challenging even for simple hypothesis classes such as linear functions H = { x �→ Bx : B ∈ R d × p } There are several approaches for addressing this problem computationally An appealing idea is based on a surrogate loss function approach (see, e.g., [Elmachtoub and G 2017], [Ho-Nguyen and Kilinc-Karzan 2019]) 9

Framework and Preliminaries Combinatorial Dimension-based Bounds Improved Results Under Strong Convexity Generalization Bounds for the SPO Loss n 1 � min ℓ SPO ( f ( x i ) , c i ) n f ∈H i =1 The focus of this work is not on optimization for the above problem, but on generalization Generalization bounds verify that trying to solve the above problem (based on training data) is at all reasonable Let us define the empirical and expected SPO loss as: n R SPO ( f ) := 1 ˆ � ℓ SPO ( f ( x i ) , c i ) , and R SPO ( f ) := E ( x , c ) ∼D [ ℓ SPO ( f ( x ) , c )] n i =1 10

Framework and Preliminaries Combinatorial Dimension-based Bounds Improved Results Under Strong Convexity Generalization Bounds for the SPO Loss n R SPO ( f ) := 1 ˆ � ℓ SPO ( f ( x i ) , c i ) , R SPO ( f ) := E ( x , c ) ∼D [ ℓ SPO ( f ( x ) , c )] n i =1 A generalization bound relates the above two quantities and verifies that minimizing the empirical loss also (approximately) minimizes the expected loss Importantly the bound should hold uniformly over f ∈ H and with high probability over ( x i , c i ) ∼ D n A generalization bound implies an “on average” (over x ) guarantee for the problem of interest: w ∈ S E c ∼D x [ c T w | x ] min 11

Framework and Preliminaries Combinatorial Dimension-based Bounds Improved Results Under Strong Convexity Rademacher Complexity and Generalization We follow a standard approach to establishing generalization bounds based on Rademacher compelxity Given the observed data ( x 1 , c 1 ) , . . . , ( x n , c n ), define the empirical Rademacher complexity of H w.r.t. to the SPO loss as: � n � 1 ˆ � R n SPO ( H ) := E σ sup σ i ℓ SPO ( f ( x i ) , c i ) , n f ∈H i =1 where σ i are i.i.d. Rademacher random variables uniformly distributed on {− 1 , +1 } Let us also assume that ℓ SPO ∈ [0 , ω ] for some ω > 0, which follows from the boundedness of S and the distribution of c 12

Framework and Preliminaries Combinatorial Dimension-based Bounds Improved Results Under Strong Convexity Rademacher Complexity and Generalization, cont. The following is a celebrated result yielding a generalization bound based on Rademacher complexity Theorem [Bartlett and Mendelson 2002] Let H be a family of functions mapping from R p to R d . Then, for any δ > 0, with probability at least 1 − δ over an i.i.d. sample drawn from the distribution D , the following holds for all f ∈ H � log(2 /δ ) R SPO ( f ) ≤ ˆ R SPO ( f ) + 2 ˆ R n SPO ( H ) + 3 ω . 2 n The remaining challenge is to bound ˆ R n SPO ( H ), which is difficult due to the nonconvex and discontinuous nature of the SPO loss 13

Framework and Preliminaries Combinatorial Dimension-based Bounds Improved Results Under Strong Convexity Bounds Based on Combinatorial Dimension Let us first consider the case where: S is a polytope with set of extreme points S H = H lin := { x �→ Bx : B ∈ R d × p } is the set of linear predictors Theorem Under the above two conditions, for any δ > 0, with probability at least 1 − δ over an i.i.d. sample drawn from the distribution D , the following holds for all f ∈ H lin � � 2 dp log( n | S | 2 ) log(1 /δ ) R SPO ( f ) ≤ ˆ R SPO ( f ) + 2 ω + ω n 2 n 14

Framework and Preliminaries Combinatorial Dimension-based Bounds Improved Results Under Strong Convexity Bounds Based on Combinatorial Dimension, cont. Proof of the previous theorem is based on “reducing” the problem to a multiclass classification problem where the classes correspond to the extreme points of S This is not a complete reduction, since the SPO loss function is more complicated We can then leverage the notion of Natarajan dimension [Natarajan 1989], which is an extension of VC-dimension to the multiclass case Key result is relating the SPO Rademacher complexity to the Natarajan dimension Related techniques appeared recently in [Gupta and Kallus 2019] 15

Framework and Preliminaries Combinatorial Dimension-based Bounds Improved Results Under Strong Convexity Extension to Convex Sets Using a discretization argument, we can extend the previous result to any bounded convex set S We presume that � w � 2 ≤ ρ w for all w ∈ S Theorem In the case of linear predictors and general compact and convex S , for any δ > 0, with probability at least 1 − δ over an i.i.d. sample drawn from the distribution D , the following holds for all f ∈ H lin � � 2 p log(2 n ρ w d ) log(2 /δ ) � 1 � R SPO ( f ) ≤ ˆ R SPO ( f ) + 4 d ω + 3 ω + O 2 n n n Question: Can we improve the dependence on the dimensions d and p and replace them with more “natural” quantities? 16

Generalization Bounds in the Predict-then-Optimize Framework Othman - PowerPoint PPT Presentation

Framework and Preliminaries Combinatorial Dimension-based Bounds Improved Results Under Strong Convexity Generalization Bounds in the Predict-then-Optimize Framework Othman El Balghiti (Rayens Capital), Adam N. Elmachtoub (Columbia

Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 6 February, 23

Circuit Lower-bounds Lecture 24 Weak circuits are indeed weak 1 Circuit Lower-bounds 2

PREDICT- -HD HD PREDICT BIG QUESTION: What do we need before we can treat HD ? How does

CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 /

MINUTE OPTIMIZE YOUR PH MONITORING OPTIMIZE WITH HAVING CHALLENGES MEASURING

Michael Spece Departments of Machine Learning and Statistics Carnegie Mellon University June 11,

if-then-else Statements if-then Statements General form of an if-then statement: if [boolean

Solar Cycle 25 in V2 of SSN If possible, also provide: Predict north/south hemispheres

tail bounds tail bounds For a random variable X, the tails of X are the parts of the PMF/density

Randomness in Computing L ECTURE 10 Last time Chernoff Bounds Today Hoeffding Bounds

Local Substitutability for Sequence Generalization Fran cois Coste , Ga elle Garet , Jacques

Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and

Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity

VC GENERALIZATION BOUND VC GENERALIZATION BOUND Matthieu Bloch March 12, 2020 1 LOGISTICS (AND

Deep learning: Challenges in learning and generalization Tomas Mikolov, Facebook AI What is

Generalization of Cycle-Covering Heuristics Clemens B uchner Department of Mathematics and

draft-linus-trans-gossip-ct Daniel Kahn Gillmor, ACLU Linus Nordberg, NORDUnet IETF93, Prague

Connecticut Department of Energy and Environmental Protection Connecticuts Perspective on the

Low-dose CT Enhancement Network with a Perceptual Loss Function in the Spatial Frequency and Image

LifeCourse in Action: Scaling up the LifeCourse at Every Stage in Connecticut Project Goal To

US Environmental Protec1on Agency East Helena Site Welcome

Types of Vaccines Authorized to Administer Based upon APhA / NASPA Survey of State IZ Laws/ Rules

Performance Hung-Wei Tseng Announcement Homework #1 due next Monday before class Reading

GREENWICH PUBLIC SCHOOLS Greenwich, CT September 20, 2018 Board of Education Meeting Addendum to