Stochastic Programming A. Shapiro School of Industrial and Systems - PowerPoint PPT Presentation

The expectation in (16) is conditional on the realization d [ T − 1] of the demand process prior to the considered time T . The optimal value (and the set of optimal solutions) of problem (16) depends on y T and d [ T − 1] , and is denoted Q T ( y T , d [ T − 1] ). At stage t = T − 1 we solve the problem Min c T − 1 ( x T − 1 − y T − 1 ) x T − 1 ≥ y T − 1 � + E b T − 1 [ D T − 1 − x T − 1 ] + + h T − 1 [ x T − 1 − D T − 1 ] + � � � � � + Q T x T − 1 − D T − 1 , D [ T − 1] � D [ T − 2] = d [ T − 2] . Its optimal value is denoted Q T − 1 ( y T − 1 , d [ T − 2] ). 15

Proceeding in this way backwards in time we write the following dynamic programming equations � Q t ( y t , d [ t − 1] ) = min c t ( x t − y t ) + E b t [ D t − x t ] + x t ≥ y t � � � � � + h t [ x t − D t ] + + Q t +1 x t − D t , D [ t ] � D [ t − 1] = d [ t − 1] , t = T − 1 , ..., 2. Finally, at the first stage we need to solve problem � � Min c 1 ( x 1 − y 1 )+ E b 1 [ D 1 − x 1 ] + + h 1 [ x 1 − D 1 ] + + Q 2 ( x 1 − D 1 , D 1 ) . x 1 ≥ y 1 x t , t = T − 1 , ..., 1, be an optimal solution of the corresponding Let ¯ dynamic programming equation. We see that ¯ x t is a function of y t and d [ t − 1] , for t = 2 , ..., T , while the first stage (optimal) decision ¯ x 1 is independent of the data. 16

Under the assumption of the stagewise independence, ¯ x t = ¯ x t ( y t ) becomes a function of y t alone. Note that y t , in itself, is a function of d [ t − 1] = ( d 1 , ..., d t − 1 ) and decisions ( x 1 , ..., x t − 1 ). There- fore we may think about a sequence of possible decisions x t = x t ( d [ t − 1] ), t = 1 , ..., T , as functions of realizations of the demand process available at the time of the decision (with the convention that x 1 is independent of the data). Such a sequence of decisions x t ( d [ t − 1] ) is called a policy . That is, a policy is a rule which spec- ifies our decisions, based on information available at the current stage, for any possible realization of the demand process. By definition, a policy x t = x t ( d [ t − 1] ) satisfies the nonanticipativity constraint. A policy is said to be feasible if it satisfies other constraints with probability one (w.p.1). In the present case a policy is feasible if x t ≥ y t , t = 1 , ..., T , for almost every realization of the demand process. 17

We can formulate optimization problem (15) as the problem of minimization of the expectation in (15) with respect to all feasible policies. An optimal solution of such problem will give us an optimal policy. We have that a policy ¯ x t is optimal if it is given by optimal solutions of the respective dynamic programming equations. In the present case under the assumption of stagewise independence, an optimal policy ¯ x t = ¯ x t ( y t ) is a function of y t alone. Moreover, in that case it is possible to give the Let x ∗ following characterization of the optimal policy. t be an (unconstrained) minimizer of � � c t x t + E b t [ D t − x t ] + + h t [ x t − D t ] + + Q t +1 ( x t − D t ) , (17) t = T, ..., 1. By using convexity of the value functions Q t ( · ) it is x t = max { y t , x ∗ not difficult to show that ¯ t } is an optimal policy. Such policy is called the basestock policy. 18

Multistage portfolio selection. Suppose that we can rebalance our portfolio at several, say T , periods of time. That is, at the beginning we choose values x i 0 of our assets subject to the budget constraint � n i =1 x i 0 = W 0 . (18) At the period t = 1 , ..., T , our wealth is W t = � n i =1 ξ it x i,t − 1 , (19) where ξ it = (1 + R it ) and R it is the return of the i -th asset at the period t . Our objective is to maximize the expected utility Max E [ U ( W T )] (20) at the end of the considered period, subject to the balance constraints � n i =1 x it = W t and x t ≥ 0, t = 0 , ..., T − 1. 19

We use notation x t = ( x 1 t , ..., x nt ) and ξ t = ( ξ 1 t , ..., ξ nt ), and ξ [ t ] = ( ξ 1 , .., ξ t ) for the history of the process ξ t up to time t . The values of the decision vector x t , chosen at stage t , may depend on the information ξ [ t ] available up to time t , but not on the future observations. Dynamic programming equations: the cost-to-go function Q t ( W t , ξ [ t ] ) is given by the optimal value of � � � � Max Q t +1 ( W t +1 , ξ [ t +1] ) � ξ [ t ] E x t ≥ 0 ,W t +1 (21) n n � � s.t. W t +1 = x i,t = W t . ξ i,t +1 x i,t , i =1 i =1 If the process ξ t is stagewise independent, i.e., ξ t is (stochasti- cally) independent of ξ 1 , ..., ξ t − 1 , for t = 2 , ..., T , then the cost- to-go (value) function Q t ( W t ), t = 1 , ..., T − 1, does not depend on ξ [ t ] . 20

Consider the logarithmic utility function U ( W ) := ln W . At stage t = T − 1 we solve the problem     �  n  n � � � �   Max  ln ξ i,T x i,T − 1 � ξ [ T − 1]  s.t. x i,T − 1 = W T − 1 . x T − 1 ≥ 0 E i =1 i =1 (22) Its optimal value is � � � � = ν T − 1 + ln W T − 1 , Q T − 1 W T − 1 , ξ [ T − 1] ξ [ T − 1] � � where ν T − 1 denotes the optimal value of (22) for W T − 1 = ξ [ T − 1] 1. At stage t = T − 2 we solve problem     � n  � �  � � �   Max + ln  ν T − 1 ξ [ T − 1] ξ i,T − 1 x i,T − 2 � ξ [ T − 2] x T − 2 ≥ 0 E  i =1 (23) n � s.t. x i,T − 2 = W T − 2 . i =1 21

Of course, we have that     � n  � �  � � �   + ln  ν T − 1 ξ [ T − 1] ξ i,T − 1 x i,T − 2 � ξ [ T − 2] E  i =1     � � � � � n �   � � � � �   = E ν T − 1 ξ [ T − 1] � ξ [ T − 2] + E  ln ξ i,T − 1 x i,T − 2 � ξ [ T − 2]  , i =1 and hence the optimal value of (23) can be written as � � � � � � � � Q T − 2 W T − 2 , ξ [ T − 2] = ν T − 1 ξ [ T − 1] � ξ [ T − 2] E � � + ν T − 2 ξ [ T − 2] + ln W T − 2 , � � where ν T − 2 is the optimal value of the problem ξ [ T − 2]     � n n   � � � �   Max  ln ξ i,T − 1 x i,T − 2 � ξ [ T − 2]  s.t. x i,T − 2 = 1 . x T − 2 ≥ 0 E i =1 i =1 22

An identical argument applies at earlier stages. Therefore, it suffices to solve at each stage t = T − 1 , ..., 1 , 0, the corresponding optimization problem     � n n   � � � �   Max  ln  s.t. x i,t = W t , (24) ξ i,t +1 x i,t � ξ [ t ] x t ≥ 0 E i =1 i =1 in a completely myopic fashion. By definition, we set ξ 0 to be constant, so that for the first stage problem, at t = 0, the corresponding expectation is unconditional. An optimal solution ¯ x t = ¯ x t ( W t , ξ [ t ] ) of problem (24) gives an optimal policy. 23

If the random process ξ t is stagewise independent, then conditional expectations in (24) are the same as the corresponding unconditional expectations, and hence optimal values ν t ( ξ [ t ] ) = ν t do not depend on ξ [ t ] and are given by the optimal value of the problem     n n   � �   Max  ln ξ i,t +1 x i,t  s.t. x i,t = 1 . (25) x t ≥ 0 E i =1 i =1 Also in the stagewise independent case the optimal policy can Let x ∗ t = ( x ∗ 1 t , ..., x ∗ be described as follows. nt ) be the optimal solution of (25), t = 0 , ..., T − 1. Such optimal solution is unique by strict concavity of the logarithm function. Then x t ( W t ) := W t x ∗ ¯ t , t = 0 , ..., T − 1 , defines the optimal policy. 24

Two and multistage stochastic programming The concept of two-stage (linear) stochastic programming problem with recourse x ∈X c T x + E [ Q ( x, ξ )] , Min (26) where X = { x : Ax = b, x ≥ 0 } and Q ( x, ξ ) is the optimal value of the second stage problem q T y s . t . Tx + Wy = h, y ≥ 0 , Min (27) y with ξ = ( q, T, W, h ). In general, the feasible set X can be finite, i.e., integer first stage problem. Both stages can be integer (mixed integer) problems. 25

Suppose that the probability distribution P of ξ has a finite support, i.e., ξ can take values ξ 1 , ..., ξ K (called scenarios ) with respective probabilities p i > 0, i = 1 , ..., K . In that case K � E P [ Q ( x, ξ )] = p k Q ( x, ξ k ) , k =1 where � � q T Q ( x, ξ k ) = inf k y k : T k x + W k y k = h k , y k ≥ 0 . It follows that we can write problem (26)-(27) as one large linear program: c T x + � K k =1 p k q T Min k y k x,y 1 ,...,y K subject to Ax = b, (28) T k x + W k y k = h k , k = 1 , ..., K, x ≥ 0 , y k ≥ 0 , k = 1 , ..., K. 26

Even crude discretization of the distribution of the data vector ξ leads to an exponential growth of the number of scenarios with increase of its dimension d . Could stochastic programming problems be solved numer- ically? What does it mean to solve a stochastic program? How do we know the probability distribution of the random data vector? Why do we optimize the expected value of the objective (cost) function? 27

Basic properties For any realization ξ , the function Q ( · , ξ ) is convex piecewise linear. By the duality theory of linear programming we can write it in the following equivalent form � � π T ( h − Tx ) : W T π ≤ q Q ( x, ξ ) = sup . (29) It follows that the expectation function Q ( x ) = E [ Q ( x, ξ )] is convex, and if P has a finite support (i.e., the number of scenarios is finite), then Q ( x ) is piecewise linear. Note that it can happen that, for some ( x, ξ ), the feasible set of problem (27) is empty. In that case, by the definition, Q ( x, ξ ) = + ∞ . It also can happen that problem (27) is unbounded from below, and hence Q ( x, ξ ) = −∞ . That is, we can view Q ( x, ξ ) as an extended real valued function. 28

Since Q ( · , ξ ) is a piecewise linear function, it can be differentiable everywhere only in the trivial case when it is linear. Nevertheless, if Q ( · , ξ ) is finite at a point ¯ x , then it has a nonempty set of subgradients. The set of all subgradients is called subdifferential and denoted by ∂Q (¯ x, ξ ). Recall that z ∈ ∂Q (¯ x, ξ ) if x, ξ ) + z T ( x − ¯ Q ( x, ξ ) ≥ Q (¯ x ) , for all x. The function Q ( · , ξ ) is differentiable at a point x iff ∂Q ( x, ξ ) = { z } is a singleton, in which case ∇ x Q ( x, ξ ) = z . The set ∂Q ( x, ξ ) is convex, and since Q ( · , ξ ) is piecewise linear, is polyhedral. By duality theory we have that ∂Q ( x, ξ ) = − T T D ( x, ξ ) , (30) � � π T ( h − Tx ) : W T π ≤ q where D ( x, ξ ) := arg max . 29

If P has a finite support, then the subdifferential of the expectation function Q ( · ) is given ∗ by ∂ Q ( x ) = � K k =1 p k ∂Q ( x, ξ k ) . (31) Therefore, Q ( · ) is differentiable at x iff all functions Q ( · , ξ k ), k = 1 , ..., K , are differentiable at x . If the probability distribution P is continuous, then the situation is more subtle. It is possible to show that if Q ( · ) is finite valued in a neighborhood of x , then � ∂ Q ( x ) = Ω ∂Q ( x, ω ) dP ( ω ) . (32) For a given x , the above integral is understood as the set of all � Ω G ( ω ) dP ( ω ) such that G ( ω ) ∈ ∂Q ( x, ω ) is vectors of the form an integrable selection of ∂Q ( x, ω ). ∗ The summation of the sets is understood here pointwise, i.e., the sum of two sets A and B is the set { a + b : a ∈ A, b ∈ B } . 30

It follows from (32) that ∂ Q ( x ) is a singleton, and hence Q ( · ) is differentiable at x , iff ∂Q ( x, ω ) is a singleton with probability one, i.e., for P -almost every ω ∈ Ω. Loosely speaking we may say that, typically, for continuous distributions the expectation function E [ Q ( x, ξ )] is differentiable, while in the case of discrete distributions it is not. We can formulate optimality conditions for the stochastic problem (26) as follows: a feasible point ¯ x ∈ X is an optimal solution of (26) iff 0 ∈ ∂ Q (¯ x ) + N X (¯ x ) , (33) where N X (¯ x ) is the normal cone to X at ¯ x ∈ X , � � z : z T ( x − ¯ x ) ≤ 0 , for all x ∈ X N X (¯ x ) = . 31

General formulation of two-stage stochastic programming problems � � Min f ( x ) := E [ F ( x, ω )] (34) , x ∈X where F ( x, ω ) is the optimal value of the second stage problem y ∈ X ( x,ω ) g ( x, y, ω ) . Min (35) Here (Ω , F , P ) is a probability space, X ⊂ R n , g : R n × R m × Ω → R and X : R n × Ω ⇒ R m is a multifunction. In particular, the linear two-stage problem can be formulated in the above form with g ( x, y, ω ) := c T x + q ( ω ) T y and X ( x, ω ) := { y : T ( ω ) x + W ( ω ) y = h ( ω ) , y ≥ 0 } . (36) 32

The second stage problem (35) can be also written in the following equivalent form y ∈ R m ¯ Min g ( x, y, ω ) , (37) where   g ( x, y, ω ) , if y ∈ X ( x, ω ) ¯ g ( x, y, ω ) :=  + ∞ , otherwise . By the interchangeability principle we have � � � � y ∈ R m ¯ inf g ( x, y, ω ) = inf ¯ g ( x, y ( ω ) , ω ) , (38) E y ∈ Y E � �� F ( x,ω ) where Y is a functional space, e.g., Y := L p (Ω , F , P ; R m ) with p ∈ [1 , + ∞ ]. 33

Consequently, we can write two-stage problem (34)–(35) as one large problem: x ∈ R n , y ∈ Y E [ g ( x, y ( ω ) , ω )] Min (39) s . t . x ∈ X , y ( ω ) ∈ X ( x, ω ) a . e . ω ∈ Ω . In particular, if Ω = { ω 1 , ..., ω K } is finite, then problem (39) becomes K � Min g ( x, y k , ω k ) x,y 1 ,...,y k (40) k =1 s . t . x ∈ X , y k ∈ X ( x, ω k ) k = 1 , ..., K. 34

Decision rules Recall that the two stage problem (26) can be formulated as an optimization problem over x ∈ X and second stage decision y ( · ) considered as a function of the data. Suppose now that the recourse is fixed, i.e., only T = T ( ω ) and h = h ( ω ) of the second stage problem q T y s . t . Tx + Wy = h, y ≥ 0 , Min y are random. For a given x ∈ X the second stage problem attains its minimal value at an extreme point of the set { y : Wy = h − Tx, y ≥ 0 } , assuming that this set is nonempty and bounded. By the well known result of linear programming we have the following characterization of the extreme points 35

A feasible point ¯ y is an extreme point (basic solution) of the feasible set if there exists an index set I ⊂ { 1 , ..., m } such that the corresponding submatrix W I is nonsingular and ¯ y i = 0 for i ∈ { 1 , ..., m } \ I . That is, W I ¯ y I = h − Tx , where ¯ y I is the corresponding subvector of ¯ y . This can be written as ¯ y = R I ( h − Tx ), where the I -th submatrix of R I is W − 1 and other entries of R I I are zeros. It follows (under appropriate nondegeneracy conditions) that an optimal policy (optimal decision rule) ¯ y = ¯ y ( h, T ) is a piecewise linear function of the data. 36

Nonanticipativity Consider the first stage problem (34). Assume that the number of scenarios is finite, i.e., Ω = { ω 1 , . . ., ω K } with respective (positive) probabilities p 1 , . . ., p K . Let us relax the first stage problem by replacing vector x with K vectors x 1 , x 2 , . . . , x K , one for each scenario. We obtain the following relaxation of problem (34): K � Min p k F ( x k , ω k ) subject to x k ∈ X , k = 1 , . . ., K. (41) x 1 ,...,x K k =1 Problem (41) is separable in the sense that it can be split into K smaller problems, one for each scenario: x k ∈X F ( x k , ω k ) , Min k = 1 , . . . , K, (42) and that the optimal value of problem (41) is equal to the weighted sum, with weights p k , of the optimal values of problems (42), k = 1 , . . ., K . 37

The nonanticipativity constraint: ( x 1 , . . ., x K ) ∈ L , where L := { ( x 1 , . . ., x K ) : x 1 = . . . = x K } ⊂ R n × · · · × R n . Dualization of the nonanticipativity constraints Consider the following formulation of stochastic programming problem (34) (with a finite number of scenarios), K � p k ¯ Min F ( x k , ω k ) s . t . x k = z, k = 1 , ..., K, (43) x 1 ,...,x K ,z k =1 where ¯ F ( x k , ω k ) = F ( x k , ω k ) if x k ∈ X and ¯ F ( x k , ω k ) = + ∞ otherwise. The nonaticipativity constraints x k = z , k = 1 , ..., K , are written here with additional variable z . The (Lagrangian) dual of problem (43) is: � � �� K F ( x k , ω k ) + λ T ¯ Max λ 1 ,...,λ K inf x 1 ,...,x K k =1 p k k x k , (44) � K subject to k =1 p k λ k = 0 . 38

Note the separable structure � � �� K K � � � � F ( x k , ω k ) + λ T F ( x k , ω k ) + λ T ¯ ¯ inf p k k x k = p k inf k x k . x 1 ,...,x K x k k =1 k =1 If the functions F k ( · , ω k ) are piecewise linear (e.g., in the case of linear two-stage stochastic programming), then there is no duality gap between (43) and (44), and both problems have optimal solutions provided that their optimal value is finite. Moreover, if (¯ λ 1 , ..., ¯ λ K ) and (¯ x 1 , ..., ¯ x K , ¯ z ) are optimal solutions of (43) and (44), respectively, then ¯ x 1 = ... = ¯ x K = ¯ z and � � λ T F ( x k , ω k ) + ¯ ¯ x k ∈ arg min ¯ k x k . (45) x k 39

Multistage stochastic programming Consider a multistage decision process of the from decision ( x 1 ) � observation ( ξ 2 ) � decision ( x 2 ) � (46) ..... � observation ( ξ T ) � decision ( x T ). Here ξ t ∈ R d t , t = 1 , ... , is a sequence of vectors with ξ [ t ] := ( ξ 1 , ..., ξ t ) representing history of the data process up to time t . At time period t ∈ { 1 , ..., T } we observe the past, ξ [ t ] , but future observations ξ t +1 , ... , are uncertain. So our decision at time t should only depend on information available at that time, i.e., x t = x t ( ξ [ t ] ) should be a function of ξ [ t ] and should not depend on future observations. This is the basic requirement of nonanticipativity of the decision process. A sequence x 1 , x 2 ( ξ [2] ) , ... of such decisions is called a policy or a decision rule. 40

Risk neutral multistage stochastic programming � � � � f 1 ( x 1 ) + f 2 ( x 2 ( ξ [2] ) , ξ 2 ) + · · · + f T Min x T ( ξ [ T ] ) , ξ T , (47) π ∈ Π E where Π is the set of policies π = ( x 1 , x 2 ( ξ [2] ) , ..., x T ( ξ [ T ] )) satisfying the feasibility constraints x 1 ∈ X 1 , x t ( ξ [ t ] ) ∈ X t ( x t − 1 ( ξ [ t − 1] ) , ξ t ) , t = 2 , . . . , T, for almost every (a.e.) realization of the random process, f t : R n t × R d t → R are real valued functions and X t : R n t − 1 × R d t ⇒ R n t , t = 2 , . . . , T , are multifunctions. For example f t ( x t , ξ t ) := c T t x t , X t ( x t − 1 , ξ t ) := { x t − 1 : B t x t − 1 + A t x t ≤ b t } , t = 2 , ..., T , X 1 := { x 1 : A 1 x 1 ≤ b 1 } , with ξ t = ( c t , A t , B t , b t ), corresponds to linear multistage stochastic programming. 41

Note that it is assumed here that the probability distribution of the random process ξ t does not depend on our decisions. Note also that � �� E [ Z ] = E | ξ 1 · · · E | ξ [ T − 1] [ Z ] , E | ξ [2] where E | ξ [ t ] denotes conditional expectation. This decomposi- tion property of the expectation allows to write the multistage stochastic programming problem in the following nested form � Min f 1 ( x 1 ) + E | ξ 1 x 2 ∈X 2 ( x 1 ,ξ 2 ) f 2 ( x 2 , ξ 2 ) + . . . inf x 1 ∈X 1 � + E | ξ [ T − 2] x T − 1 ∈X T ( x T − 2 ,ξ T − 1 ) f T − 1 ( x T − 1 , ξ T − 1 ) inf �� + E | ξ [ T − 1] [ x T ∈X T ( x T − 1 ,ξ T ) f T ( x T , ξ T )] inf . 42

This formulation assumes that: (i) the probability distribution of the data process is known (specified), (ii) the optimization is performed on average (both, with respect to different realizations of the random process, and with respect to time). In the linear case the nested formulation can be written as (recall that ξ 1 is deterministic and hence E | ξ 1 = E )   � �   c T c T c T   2 x 2 + · · · + E | ξ [ T − 1] Min 1 x 1 + E Min Min T x T   A 1 x 1= b 1 B 2 x 1+ A 2 x 2= b 2 BT xT − 1+ AT xT = bT x 1 ≥ 0 x 2 ≥ 0 xT ≥ 0 If the number of realizations (scenarios) of the process ξ t is finite, then the above problem can be written as one large linear programming problem. 43

Numerical difficulties in solving multistage problems. From a modeling point of view typically it is natural to assume that the random data process has a continuous distribution. This raises the question of how to compute the involved expectations (multivariate integrals). A standard approach is to discretize the random process by generating a finite number of possible realizations (called scenarios). These scenarios can be represented by the corresponding scenario tree . How many scenarios are needed in order to approximate the ”true” distribution of the random data process? 44

Note that solving the deterministic equivalent for the constructed scenario tree does not produce by itself an implementable policy for the ”true” problem (with continuous distributions). This is because an actual realization of the data process could, and with probability one (w.p.1) will, be different from scenarios used in the constructed tree. In that case policy constructed for scenarios of the tree does not tell what decision to make. Of course, one can use only the first stage solution which is deterministic (does not depend on future observations) and update it as new observations become available - this is a rolling horizon approach. Such a rolling horizon approach requires resolving the corresponding multistage problem at every stage as new realization of the data becomes available. 45

Dynamic programming equations Consider the last stage problem x T ∈X T ( x T − 1 ,ξ T ) f T ( x T , ξ T ) . Min (48) The optimal value of this problem, denoted Q T ( x T − 1 , ξ T ), depends on the decision vector x T − 1 and data ξ T . At stage t = 2 , ..., T − 1, we write the problem: � � �� Min f t ( x t , ξ t ) + E | ξ [ t ] Q t +1 x t , ξ [ t +1] x t (49) x t ∈ X t ( x t − 1 , ξ t ) . s . t . Its optimal value depends on the decision x t − 1 at the previous stage and realization of the data process ξ [ t ] = ( ξ 1 , ..., ξ t ), and � � denoted Q t . The idea is to calculate the (so-called x t − 1 , ξ [ t ] � � cost-to-go or value ) functions Q t , recursively, going x t − 1 , ξ [ t ]) backward in time. 46

At the first stage we finally need to solve the problem: x 1 ∈X f 1 ( x 1 ) + E [ Q 2 ( x t , ξ 2 )] . Min (50) The dynamic programming equations: � � � � � � f t ( x t , ξ t ) + Q t +1 Q t x t − 1 , ξ [ t ] = inf x t , ξ [ t ] , (51) x t ∈X t ( x t − 1 ,ξ t ) where � � � � �� Q t +1 x t , ξ [ t ] := E | ξ [ t ] Q t +1 x t , ξ [ t +1] . If the random process is Markovian (i.e., the conditional distribution of ξ t +1 given ξ [ t ] = ( ξ 1 , ..., ξ t ) is the same as the conditional � is a function of � x t − 1 , ξ t distribution of ξ t +1 given ξ t ), then Q t x t − 1 and ξ t , and if it is stagewise independent (i.e., ξ t +1 is inde- � � � � � � pendent of ξ [ t ] ), then E Q t +1 x t , ξ t +1 � ξ t = Q t +1 ( x t ) does not depend on ξ t . 47

A sequence of (measurable) mappings x t ( ξ [ t ] ), t = 1 , ..., T , is called a policy (recall that ξ 1 is deterministic). A policy is feasible if it satisfies the feasibility constraints, i.e., x t ( ξ [ t ] ) ∈ X t ( x t − 1 ( ξ [ t − 1] ) , ξ t ) , t = 2 , ..., T, w . p . 1 . (52) A policy ¯ x t ( ξ [ t ] ) is optimal if and only if it satisfies the dynamic programming equations, i.e., � � � � ¯ x t ( ξ [ t ] ) ∈ arg min f t ( x t , ξ t )+ Q t +1 x t , ξ [ t ] , w . p . 1 . � � x t ∈X t x t − 1 ( ξ [ t − 1] ) ,ξ t ¯ In the stagewise independent case Q t +1 ( x t ) does not depend on ξ [ t ] , and optimal solution ¯ x t (satisfying the dynamic programming equations) is a function of ¯ x t − 1 and ξ t . 48

Nonanticipativity constraints Consider the multistage problem (47). Suppose that there is a finite number of scenarios ξ k 1 , ..., ξ k T , k = 1 , ..., K , with respective probabilities p k . To each scenario k ∈ { 1 , ..., K } corresponds a sequence x k 1 , ..., x k T of decision vectors. The nonanticipativity principle requires that x k t = x ℓ t for all k, ℓ for which ξ k [ t ] = ξ ℓ t = 1 , . . . , T. (53) [ t ] , Let X be the space of all sequences ( x k 1 , ..., x k T ), k = 1 , ..., K (this is a linear space of dimension ( n 1 + ... + n T ) K ) and L be a linear subspace of X defined by the nonanticipativity constraints (53). We can write the multistage problem in the form   K T   � � p k f t ( x k t , ξ k  s . t . x ∈ L . Min  f ( x ) := t ) (54) x ∈ X k =1 t =1 49

In particular, the relaxed version of the linear multistage program � � K � 1 x k c T + ( c k 2 ) T x k ( c k 3 ) T x k + ( c k T ) T x k Min 2 + 3 + p k . . . 1 T k =1 A 1 x k s.t. = b 1 , 1 B k 2 x k A k 2 x k b k + = 2 , 1 2 B k 3 x k A k 3 x k b k + = 3 , 2 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B k T x k A k T x k = b k + T , T − 1 T x k x k x k x k 1 ≥ 0 , 2 ≥ 0 , 3 ≥ 0 , T ≥ 0 , . . . k = 1 , . . . , K. In this problem all parts of the decision vector are allowed to depend on all parts of the random data, while each part x t should be allowed to depend only on the data known up to stage t . To- gether with the nonanticipativity constraints (53) the considered problem becomes equivalent to the original formulation. 50

Monte Carlo sampling methods Consider (two-stage) stochastic programming problem: � � Min f ( x ) := E [ F ( x, ξ )] . (55) x ∈X Even a crude discretization of the distribution of the random data vector ξ ∈ R d typically results in an exponential growth of the number of scenarios with increase of the number d of random variables. For example, if components of random vector ξ are independently distributed and distribution of each component is discretized by r points, then the total number of scenarios is r d . That is, although the input data grows linearly with increase of the dimension d , the number of scenarios grows exponentially. The standard approach to dealing with this issue is to generate a manageable number of scenarios in some “representative” way. 51

For example, we can generate a random sample ξ 1 , ..., ξ N of N realizations of the random vector ξ by using Monte Carlo sampling techniques. Then the expected value function f ( x ) := E [ F ( x, ξ )] can be approximated by the sample average function N � f N ( x ) := N − 1 F ( x, ξ j ) . ˆ j =1 Consequently the true (expected value) problem is approximated by the so-called sample average approximation (SAA) problem: ˆ Min f N ( x ) . (56) x ∈X Note that once the sample is generated, the above SAA problem can be viewed as a (two-stage) problem with the corresponding � ξ 1 , ..., ξ N � set of scenarios each scenario with equal probability 1 /N . 52

A (naive) justification of the SAA method is that for a given x ∈ X , by the Law of Large Numbers (LLN), ˆ f N ( x ) converges to f ( x ) w.p.1 as N tends to infinity. It is possible to show that, under mild regularity conditions, this convergence is uniform on any compact subset of X (uniform LLN). It follows that the optimal value ˆ v N and an optimal solution ˆ x N of the SAA problem (56) converge w.p.1 to their counterparts of the true problem. Central Limit Theorem type results. Notoriously slow convergence of order O p ( N − 1 / 2 ). By the CLT, for a given x ∈ X , N 1 / 2 � � ⇒ N (0 , σ 2 ( x )) , ˆ f N ( x ) − f ( x ) where σ 2 ( x ) := Var[ F ( x, ξ )] and “ ⇒ ” denotes convergence in distribution. 53

Delta method Let Y N ∈ R d be a sequence of random vectors, converging in probability to a vector µ ∈ R d . Suppose that there exists a sequence τ N → + ∞ such that τ N ( Y N − µ ) ⇒ Y . Let G : R d → R m be a vector valued function, differentiable at µ . That is G ( y ) − G ( µ ) = M ( y − µ )+ r ( y ), where M := ∇ G ( µ ) is the m × d Jacobian matrix of G at µ , and the remainder r ( y ) is of order o ( � y − µ � ). It follows that τ N [ G ( Y N ) − G ( µ )] ⇒ MY. In particular, suppose that N 1 / 2 ( Y N − µ ) converges in distribution to a (multivariate) normal distribution with zero mean vector and covariance matrix Σ. Then it follows that N 1 / 2 [ G ( Y N ) − G ( µ )] ⇒ N (0 , M Σ M T ) . 54

Infinite dimensional Delta Theorem. Let B 1 and B 2 be two Banach spaces, and G : B 1 → B 2 be a mapping. It is said that G is directionally differentiable at a point µ ∈ B 1 if the limit G ( µ + td ) − G ( µ ) G ′ µ ( d ) := lim (57) t t ↓ 0 exists for all d ∈ B 1 . If, in addition, the directional derivative G ′ µ : B 1 → B 2 is linear and continuous, then it is said that G is Gˆ ateaux differentiable at µ . It is said that G is directionally differentiable at µ in the sense of Hadamard if the directional derivative G ′ µ ( d ) exists for all d ∈ B 1 and, moreover, G ( µ + td ′ ) − G ( µ ) G ′ µ ( d ) = lim . (58) t t ↓ 0 d ′→ d 55

Theorem 1 (Delta Theorem) Let B 1 and B 2 be Banach spaces, equipped with their Borel σ -algebras, Y N be a sequence of random elements of B 1 , G : B 1 → B 2 be a mapping, and τ N be a sequence of positive numbers tending to infinity as N → ∞ . Sup- pose that the space B 1 is separable, the mapping G is Hadamard directionally differentiable at a point µ ∈ B 1 , and the sequence X N := τ N ( Y N − µ ) converges in distribution to a random element Y of B 1 . Then τ N [ G ( Y N ) − G ( µ )] ⇒ G ′ µ ( Y ) , (59) and τ N [ G ( Y N ) − G ( µ )] = G ′ µ ( X N ) + o p (1) . (60) 56

Let X be a compact subset of R n and consider the space B = C ( X ) of continuous functions φ : X → R . Assume that: (A1) For some point x ∈ X the expectation E [ F ( x, ξ ) 2 ] is finite. (A2) There exists a measurable function C : Ξ → R + such that E [ C ( ξ ) 2 ] is finite and � � � � � F ( x, ξ ) − F ( x ′ , ξ ) � ≤ C ( ξ ) � x − x ′ � , (61) for all x, x ′ ∈ X and a.e. ξ ∈ Ξ. We can view Y N := ˆ f N as a random element of C ( X ). Consider the min-function V : B → R defined as V ( Y ) := inf x ∈X Y ( x ). Clearly ˆ v N = V ( Y N ). It is not difficult to show that for any µ ∈ C ( X ) and X ∗ ( µ ) := arg min x ∈X µ ( x ), V ′ x ∈X ∗ ( µ ) δ ( x ) , ∀ δ ∈ C ( X ) , µ ( δ ) = inf and the above directional derivative holds in the Hadamard sense. 57

By a functional CLT, under assumptions (A1) and (A2), N 1 / 2 ( ˆ f N − f ) converges in distribution to a random element Y of C ( X ). In particular, for any finite set { x 1 , ..., x m } ⊂ X , the random vector ( Y ( x 1 ) , ..., Y ( x m )) has a multivariate normal distribution with zero mean and the same covariance matrix as the covariance matrix of ( F ( x 1 , ξ ) , ..., F ( x m , ξ )). In particular, for fixed x ∈ X , Y ( x ) ∼ N (0 , σ 2 ( x )) with σ 2 ( x ) := Var[ F ( x, ξ )]. Denote v 0 the optimal value and S 0 the set of optimal solutions of the true problem. 58

Theorem 2 Suppose that the set X is compact, and assumptions (A1) and (A2) hold. Then f N ( x ) + o p ( N − 1 / 2 ) , x ∈ S 0 ˆ ˆ v N = min N 1 / 2 [ˆ v N − v 0 ] ⇒ inf x ∈ S 0 Y ( x ) . In particular, if the optimal set (of the true problem) S 0 = { x 0 } is a singleton, then N 1 / 2 [ˆ v N − v 0 ] ⇒ N (0 , σ 2 ( x 0 )) . This result suggests that the optimal value of the SAA problem In particular, if S 0 = { x 0 } , converges at a rate of O p ( N − 1 / 2 ). v N converges to v 0 at the same rate as ˆ f N ( x 0 ) converges then ˆ to f ( x 0 ). 59

Validation analysis How one can evaluate quality of a given (feasible) solution ˆ x ∈ X ? The SAA approach – statistical test based on estimation of x ) − v 0 , where v 0 is the optimal value of the true problem. f (ˆ x ) by the sample average ˆ (i) Estimate f (ˆ f N ′ (ˆ x ), using sample of a large size N ′ . (ii) Solve the SAA problem M times using M independent sam- v (1) v ( M ) ples each of size N . Let ˆ N , ..., ˆ be the optimal values of N the corresponding SAA problems. Estimate E [ˆ v N ] by the average M − 1 � M v ( j ) j =1 ˆ N . Note that � � � x ) − v 0 � � � x ) − M − 1 � M v ( j ) v 0 − E [ˆ ˆ f N ′ (ˆ j =1 ˆ = f (ˆ + v N ] , E N and that v 0 − E [ˆ v N ] > 0. 60

The bias v 0 − E [ˆ v N ] is positive and (under mild regularity conditions) � � N →∞ N 1 / 2 � � v 0 − E [ˆ lim v N ] = E max x ∈ S 0 Y ( x ) , where S 0 is the set of optimal solutions of the true problem, ( Y ( x 1 ) , ..., Y ( x k )) has a multivariate normal distribution with zero mean vector and covariance matrix given by the covariance matrix of the random vector ( F ( x 1 , ξ ) , ..., F ( x k , ξ )). For ill-conditioned problems this bias is of order O ( N − 1 / 2 ) and can be large if the ε -optimal solution set S ε is large for some small ε ≥ 0. 61

Sample size estimates (by Large Deviations type bounds) Consider an iid sequence Y 1 , . . . , Y N of replications of a real valued random variable Y , and let Z N := N − 1 � N i =1 Y i be the corresponding sample average. Then for any real numbers a and t > 0 we have that Prob( Z N ≥ a ) = Prob( e tZ N ≥ e ta ), and hence, by Markov inequality e − ta E [ e tZ N ] = e − ta [ M ( t/N )] N , Prob( Z N ≥ a ) ≤ where M ( t ) := E [ e tY ] is the moment generating function of Y . Suppose that Y has finite mean µ := E [ Y ] and let a ≥ µ . By tak- ing the logarithm of both sides of the above inequality, changing variables t ′ = t/N and minimizing over t ′ > 0, we obtain � � 1 N log Prob( Z N ≥ a ) ≤ − I ( a ) , (62) where I ( z ) := sup t ∈ R { tz − Λ( t ) } is the conjugate of the logarithmic moment generating function Λ( t ) := log M ( t ). 62

Suppose that |X| < ∞ , i.e., the set X is finite, and: (i) for every x ∈ X the expected value f ( x ) = E [ F ( x, ξ )] is finite, (ii) there are constants σ > 0 and a ∈ (0 , + ∞ ] such that M x ( t ) ≤ exp { σ 2 t 2 / 2 } , ∀ t ∈ [ − a, a ] , ∀ x ∈ X , where M x ( t ) is the moment generating function of the random variable F ( u ( x ) , ξ ) − F ( x, ξ ) − E [ F ( u ( x ) , ξ ) − F ( x, ξ )] and u ( x ) is a point of the optimal set S 0 . Choose ε > 0, δ ≥ 0 and α ∈ (0 , 1) such that 0 < ε − δ ≤ aσ 2 . Then for sample size � � 2 σ 2 | X | N ≥ ( ε − δ ) 2 log α we are guaranteed, with probability at least 1 − α , that any δ - optimal solution of the SAA problem is an ε -optimal solution of S δ N ⊂ S ε ) ≥ 1 − α. the true problem, i.e., Prob(ˆ 63

Let X = { x 1 , x 2 } with f ( x 2 ) − f ( x 1 ) > ε > 0 and suppose that random variable F ( x 2 , ξ ) − F ( x 1 , ξ ) has normal distribution with mean µ = f ( x 2 ) − f ( x 1 ) and variance σ 2 . By solving the corresponding SAA problem we make the correct decision (that x 1 is the minimizer) if ˆ f N ( x 2 ) − ˆ f N ( x 1 ) > 0. Probability of this event √ N/σ ). Therefore we need the sample size N > z 2 α σ 2 /ε 2 is Φ( µ in order for our decision to be correct with probability at least 1 − α . In order to solve the corresponding optimization problem we need to test H 0 : µ ≤ 0 versus H a : µ > 0. Assuming that σ 2 is known, by Neyman-Pearson Lemma, the uniformly most powerful test is: “reject H 0 if ˆ f N ( x 2 ) − ˆ f N ( x 1 ) is bigger than a specified critical value”. 64

Now let X ⊂ R n be a set of finite diameter D := sup x ′ ,x ∈X � x ′ − x � . Suppose that: (i) for every x ∈ X the expected value f ( x ) = E [ F ( x, ξ )] is finite, (ii) there is a constant σ > 0 such that M x ′ ,x ( t ) ≤ exp { σ 2 t 2 / 2 } , ∀ t ∈ R , ∀ x ′ , x ∈ X , where M x ′ ,x ( t ) is the moment generating function of the random variable F ( x ′ , ξ ) − F ( x, ξ ) − E [ F ( x ′ , ξ ) − F ( x, ξ )], (iii) there exists κ : Ξ → R + such that its moment generating function is finite valued in a neighborhood of zero and � � � � � ≤ κ ( ξ ) � x ′ − x � , ∀ ξ ∈ Ξ , ∀ x ′ , x ∈ X . � F ( x ′ , ξ ) − F ( x, ξ ) Choose ε > 0, δ ∈ [0 , ε ) and α ∈ (0 , 1). Then for sample ∗ size � � � �� 2 N ≥ O (1) σ 2 O (1) DL n log + log , ( ε − δ ) 2 ( ε − δ ) 2 α � N ⊂ S ε � S δ ˆ we are guaranteed that Prob ≥ 1 − α. ∗ O (1) denotes a generic constant independent of the data. 65

In particular, if κ ( ξ ) ≡ L , then the estimate takes the form � 2 � � � �� LD � 1 O (1) DL N ≥ O (1) n log + log . ε − δ ε − δ α Suppose further that for some c > 0, γ ≥ 1 and ¯ ε > ε the following growth condition holds f ( x ) ≥ v 0 + c [dist( x, S 0 )] γ , ∀ x ∈ S ¯ ε , and that the problem is convex. Then, for δ ∈ [0 , ε/ 2], we have the following estimate of the required sample size: � � 2 � � � �� 1 O (1) ¯ O (1) LD DL N ≥ n log + log , c 1 /γ ε ( γ − 1) /γ ε α ε . In particular, if S 0 = { x 0 } is a D is the diameter of S ¯ where ¯ singleton and γ = 1, we have the estimate (independent of ε ): N ≥ O (1) c − 2 L 2 � � n log( O (1) c − 1 L ) + log( α − 1 ) . 66

� � Example Let F ( x, ξ ) := � x � 2 k − 2 k ξ T x , where k ∈ N and X := { x ∈ R n : � x � ≤ 1 } . Suppose, that ξ ∼ N (0 , σ 2 I n ). Then f ( x ) = � x � 2 k , and for ε ∈ [0 , 1], the set of ε -optimal solutions of the true problem is { x : � x � 2 k ≤ ε } . ξ N := ( ξ 1 + ... + ξ N ) /N . The corresponding sample average Let ¯ function is � � f N ( x ) = � x � 2 k − 2 k ξ T ˆ ¯ N x , ξ N , where γ := 2 k − 2 ξ N � − γ ¯ x N = � ¯ 2 k − 1 if � ¯ and ˆ ξ N � ≤ 1, and γ = 1 if � ¯ ξ N � > 1. Therefore, for ε ∈ (0 , 1), the optimal solution of the SAA problem is an ε -optimal solution of the true problem iff ξ N � ν ≤ ε , where ν := 2 k � ¯ 2 k − 1 . 67

ξ N � 2 /σ 2 has the ξ N ∼ N (0 , σ 2 N − 1 I n ), and hence N � ¯ We have that ¯ chi-square distribution with n degrees of freedom. Consequently, ξ N � ν > ε is equal to the probability the probability that � ¯ � n > Nε 2 /ν /σ 2 � χ 2 Prob . Moreover, E [ χ 2 n ] = n and the probability Prob( χ 2 n > n ) increases and tends to 1 / 2 as n increases. Consequently, for α ∈ (0 , 0 . 3) and ε ∈ (0 , 1), for example, the sample size N should satisfy N > nσ 2 (63) ε 2 /ν in order to have the property: “with probability 1 − α an (exact) optimal solution of the SAA problem is an ε -optimal solution of the true problem”. Note that ν → 1 as k → ∞ . 68

Stochastic Approximation (SA) approach An alternative approach is going back to Robbins and Monro (1951) and is known as the Stochastic Approximation (SA) method. Assume that the true problem is convex , i.e., the set X ⊂ R n is convex (and closed and bounded) and function F ( · , ξ ) : X → R is convex for all ξ ∈ Ξ. Also assume existence of the following stochastic oracle : given x ∈ X and a random realization ξ ∈ Ξ, the oracle returns the quantity F ( x, ξ ) and a stochastic subgradient – a vector G ( x, ξ ) such that g ( x ) := E [ G ( x, ξ )] is well defined and is a subgradient of f ( · ) at x , i.e., g ( x ) ∈ ∂f ( x ). For example, use G ( x, ξ ) ∈ ∂ x F ( x, ξ ). 69

Classical SA algorithm x j +1 = Π X ( x j − γ j G ( x j , ξ j )) , where γ j = θ/j , θ > 0, and Π X ( x ) = arg min z ∈ X � x − z � 2 is the orthogonal (Euclidean) projection onto X . Theoretical bound (assuming f ( · ) is strongly convex and differentiable ) � f ( x j ) − v 0 � = O ( j − 1 ) , E for an optimal choice of constant θ (here v 0 is the optimal value of the true problem). This algorithm is very sensitive to choice of θ , does not work well in practice. 70

2 cx 2 , with c = 0 . 2 and 1 As a simple example consider f ( x ) = X = [ − 1 , 1] ⊂ R and assume that there is no noise, i.e., G ( x, ξ ) ≡ ∇ f ( x ). Suppose, further, that we take θ = 1 (i.e., γ j = 1 /j ), which will be the optimal choice for c = 1. Then the iteration process becomes � � 1 − 1 x j +1 = x j − f ′ ( x j ) /j = x j , 5 j and hence starting with x 1 = 1, � � � � �� = � j − 1 − � j − 1 1 − 1 1 x j = exp s =1 ln 1 + s =1 5 s 5 s − 1 � � � � �� j − 1 − � j − 1 1 1 > exp > exp − 0 . 25 + 5 t − 1 dt s =1 1 5 s − 1 � � − 0 . 25 + 0 . 2 ln 1 . 25 − 1 > 0 . 8 j − 1 / 5 . > exp 5 ln j That is, the convergence is extremely slow. For example for j = 10 9 the error of the iterated solution is greater than 0 . 015. On the other hand for the optimal stepsize factor of θ = 1 /c = 5, the optimal solution ¯ x = 0 is found in one iteration. 71

Robust SA approach (with averaging) (B. Polyak, 1990): consider j � γ t ˜ x j = ν t x t , where ν t = . � j τ =1 γ τ t =1 Let D X = max x ∈X � x − x 1 � 2 , and assume that � � � G ( x, ξ ) � 2 ≤ M 2 , ∀ x ∈ X , E 2 for some constant M > 0. Then X + M 2 � j ≤ D 2 t =1 γ 2 � x j ) − v 0 � t f (˜ . E 2 � j t =1 γ t For γ t = θD X √ t , after N iterations, we have M � θ, θ − 1 � M ( D 2 � x N ) − v 0 � max X + log N ) √ f (˜ ≤ . E 2 D X N 72

Constant step size variant : fixed in advance sample size (number of iterations) N and step size γ j ≡ γ , j = 1 , ..., N : x N = ˜ � N 1 j =1 x j . Theoretical bound N x N ) − v 0 ] ≤ D 2 2 γN + γM 2 X E [ f (˜ . 2 θD X For optimal (up to factor θ ) γ = √ N we have M � x N ) − v 0 � ≤ D X M + θD X M ≤ κD X M √ √ √ f (˜ , E 2 θ N 2 N N where κ = max { θ, θ − 1 } . By Markov inequality it follows that � � ≤ κD X M x N ) − v 0 > ε √ Prob f (˜ . ε N κ 2 D 2 X M 2 ≥ Hence the sample size N guarantees that ε 2 α 2 � � x N ) − v 0 > ε Prob f (˜ ≤ α . 73

Mirror Decent SA method (A. Nemirovski) Let � · � be a norm on R n and ω : X → R be continuously differentiable strongly convex on X with respect to � · � function, i.e., for x, x ′ ∈ X : ω ( x ′ ) ≥ ω ( x ) + ( x ′ − x ) T ∇ ω ( x ) + 1 2 c � x ′ − x � 2 . Prox mapping P x : R n → X : � � ω ( z ) + ( y − ∇ ω ( x )) T z P x ( y ) = arg min . z ∈ X 2 x T x we have that P x ( y ) = Π X ( x − y ). Set For ω ( x ) = 1 x j +1 = P x j ( γ j G ( x j , ξ j )) , j � γ t ˜ x j = ν t x t , where ν t = . � j τ =1 γ τ t =1 74

Then � j ω, X + (2 c ) − 1 M 2 D 2 t =1 γ 2 � x j ) − v 0 � ∗ t ≤ f (˜ , E 2 � j t =1 γ t where M ∗ is a positive constant such that � � � G ( x, ξ ) � 2 ≤ M 2 ∗ , ∀ x ∈ X , E ∗ � x � ∗ = sup � z �≤ 1 x T z is the dual norm of the norm � · � , and � � 1 / 2 z ∈ X ω ( z ) − min D ω, X = max x ∈ X ω ( x ) . For constant step size γ j = γ , j = 1 , ..., N , with optimal (up to � 2 c factor θ > 0) stepsize γ = θD ω, X N , we have M ∗ √ ≤ max { θ, θ − 1 } � x N ) − v 0 � 2 D ω, X M ∗ √ f (˜ . E cN 75

Large Deviations type bounds. Suppose that � � �� G ( x, ξ ) � 2 ∗ /M 2 exp ≤ exp { 1 } , ∀ x ∈ X . E ∗ Then for the constant stepsize policy and any Ω ≥ 1: √ � � 2 max { θ,θ − 1 } M ∗ D ω, X (12+2Ω) x N ) − v 0 > ≤ 2 exp {− Ω } . Prob f (˜ √ cN It follows that for ε > 0, α ∈ (0 , 1) and N SA = O (1) ε − 2 D 2 ω, X M 2 ∗ log 2 (1 /α ) , � � x N ) − v 0 > ε we are guaranteed that Prob f (˜ ≤ α . This can be compared with a corresponding estimate of the sample size for the SAA method: � � N SAA = O (1) ε − 2 R 2 M 2 log(1 /α ) + n log ( RM ∗ /ε ) . ∗ 76

Example Let X = { x ∈ R n : � n i =1 x i = 1 , x ≥ 0 } be a standard simplex. Consider two setups for the Mirror Descent SA: Euclidean setup , 2 x T x , and ℓ 1 -setup , where �·� = �·� 1 where �·� = �·� 2 and ω ( x ) = 1 with � · � ∗ = � · � ∞ , and ω is the entropy function n � ω ( x ) = x i ln x i . i =1 For the constant stepsize policy, the Euclidean setup leads to � x N ) − v 0 � � θ, θ − 1 � MN − 1 / 2 , f (˜ ≤ O (1) max E � � with M 2 = sup x ∈X E � G ( x, ξ ) � 2 (note that the Euclidean diam- 2 √ eter of X is 2). 77

√ The ℓ 1 -setup corresponds to D ω, X = ln n , c = 1 and x 1 = arg min X ω = n − 1 (1 , ..., 1) T . The associated Mirror Descent SA is easily implementable: the prox mapping can be computed in O ( n ) operations according to the explicit formula: x i e − y i [ P x ( y )] i = k =1 x k e − y k , i = 1 , ..., n. � n The efficiency estimate guaranteed with the ℓ 1 -setup is � x N ) − v 0 � � θ, θ − 1 � � log nM ∗ N − 1 / 2 , f (˜ ≤ O (1) max E � � with M 2 � G ( x, ξ ) � 2 ∗ = sup x ∈ X E , To compare the Euclidean ∞ and ℓ 1 -setups, observe that M ∗ ≤ M , and the ratio M ∗ /M can be as small as n − 1 / 2 . When X is a standard simplex of large dimension, we have strong reasons to prefer the ℓ 1 -setup to the usual Euclidean one. 78

Bounds by Mirror Decent SA method. For iterates x j +1 = P x j ( γ j G ( x j , ξ j )) . Consider N � f N ( x ) := ν j [ f ( x j ) + g ( x j ) T ( x − x j )] , j =1 where f ( x ) = E [ F ( x, ξ )], g ( x ) = E [ G ( x, ξ )] and ν j := γ j / ( � N j =1 γ j ). Since g ( x ) ∈ ∂f ( x ), it follows that f N x ∈ X f N ( x ) ≤ v 0 . ∗ := min Also v 0 ≤ f (˜ x N ) and by convexity of f , N � x N ) ≤ f ∗ N := f (˜ ν j f ( x j ) . j =1 79

That is, for any realization of the random sample ξ 1 , ..., ξ N , ∗ ≤ v 0 ≤ f ∗ N . f N Computational (observable) counterparts of these bounds: N � f N := min ν j [ F ( x j , ξ j ) + G ( x j , ξ j ) T ( x − x j )] , x ∈ X j =1 N � f N := ν j F ( x j , ξ j ) . j =1 � f ∗ N � � f N � We have that E = E , and � f N � � f N � ≤ v 0 ≤ E . E 80

Complexity of multistage stochastic programming Let ξ i Conditional sampling. 2 , i = 1 , ..., N 1 , be an iid ran- Conditional on ξ 2 = ξ i dom sample of ξ 2 . 2 , a random sample ξ ij 3 , j = 1 , ..., N 2 , is generated and etc. The obtained scenario tree is considered as a sample approximation of the true problem. Note that the total number of scenarios N = � T − 1 t =1 N t and each scenario in the generated tree is considered with the same probability 1 /N . Note also that in the case of stagewise independence of the corresponding random process, we have two possible strategies. We can generate a different (independent) sample ξ ij 3 , j = 1 , ..., N 2 , for every generated node ξ i 2 , or we can use the same sample ξ j 3 , j = 1 , ..., N 2 , for every ξ i 2 . In the second case we preserve the stagewise independence condition for the generated scenario tree. 81

For T = 3, under certain regularity conditions, for ε > 0 and α ∈ (0 , 1), and the sample sizes N 1 and N 2 satisfying �� n 1 exp � � − O (1) N 1 ε 2 D 1 L 1 O (1) σ 2 ε � � n 2 exp � � � 1 − O (1) N 2 ε 2 D 2 L 2 + ≤ α, ε σ 2 2 we have that any first-stage ε/ 2-optimal solution of the SAA problem is an ε -optimal first-stage solution of the true problem with probability at least 1 − α . (Here D 1 , D 2 , L 1 , L 2 , σ 1 , σ 2 are certain analogues of similar constants in the sample size estimate of two stage problem.) 82

In particular, suppose that N 1 = N 2 and take L := max { L 1 , L 2 } , D := max { D 1 , D 2 } , σ 2 := max { σ 2 1 , σ 2 2 } and n := max { n 1 , n 2 } . Then the required sample size N 1 = N 2 : � � � �� 1 N 1 ≥ O (1) σ 2 O (1) DL n log + log , ε 2 ε α with total number of scenarios N = N 2 1 . That is, the total number of scenarios needed to solve a T - stage stochastic program with a reasonable accuracy by the SAA method grows exponentially with increase of the number of stages T . Another way of putting this is that the number of scenarios needed to solve T -stage problem would grow as � ε − 2( T − 1) � with decrease of the error level ε > 0. O 83

This indicates that from the point of view of the number of scenarios , complexity of multistage programming problems grows exponentially with increase of the number of stages. Further- more, even if the SAA problem can be solved, its solution does not define a policy for the true problem and of use may be only the computed first stage solution. 84

Example (Multistage portfolio selection) Consider the problem of multistage portfolio selection (18)–(20) with logarithmic utility function U ( W ) = ln W and stagewise independent data process ξ t = 1 + R t , t = 1 , ..., T . Then the optimal value v 0 of the true problem is T − 1 � v 0 = ln W 0 + ν t , t =0 where ν t is the optimal value of the problem ∗ � � ln( ξ T s . t . e T x t = 1 . Max t +1 x t ) x t ≥ 0 E Let the SAA method be applied with the respective sample ξ j t , j = 1 , ..., N t − 1 of ξ t , t = 1 , ..., T − 1. ∗ by e we denote vector of ones. 85

In that case the corresponding SAA problem is also stagewise independent and the optimal value of the SAA problem T − 1 � ˆ v N = ln W 0 + ν t,N t , ˆ (64) t =0 where ˆ ν t,N t is the optimal value of the problem N t � � � 1 ( ξ j t +1 ) T x t s . t . e T x t = 1 . Max ln (65) N t x t ≥ 0 j =1 We can view ˆ ν t,N t as an SAA estimator of ν t . Since here we solve a maximization rather than a minimization problem, ˆ ν t,N t is an upwards biased estimator of ν t , i.e., E [ˆ ν t,N t ] ≥ ν t . We also have v N ] = ln W 0 + � T − 1 that E [ˆ t =0 E [ˆ ν t,N t ], and hence T − 1 � � � v N ] − v 0 = E [ˆ E [ˆ ν t,N t ] − ν t (66) . t =0 86

That is, for the logarithmic utility function, bias of the SAA estimator of the optimal value grows additively with increase of the number of stages. Also because the samples at different stages are independent of each other, we have that T − 1 � Var[ˆ v N ] = Var[ˆ ν t,N t ] . (67) t =0 On the other hand, for the power utility function U ( W ) := W γ , with γ ∈ (0 , 1], bias of the corresponding SAA estimator ˆ v N grows with increase of the number of stages in a multiplicative way. In particular, if the respective relative biases are constant, then bias of ˆ v N grows exponentially fast with increase of the number of stages. 87

Distributionally robust and risk averse approaches to stochastic programming � � Min g ( x ) := sup E Q [ F x ] (68) , x ∈X Q ∈ M where X ⊂ R n , F x ( ω ) = F ( x, ξ ( ω )), F : R n × Ξ → R , ξ : Ω → Ξ is a measurable mapping from Ω into Ξ ⊂ R d and M is a (nonempty) set of probability measures (distributions) on the sample space (Ω , F ). Let Z be a linear space of measurable functions Z : Ω → R . We assume that F x ∈ Z for all x ∈ X . Consider � ρ ( Z ) := sup E Q [ Z ] = sup Ω Z ( ω ) dQ ( ω ) , Z ∈ Z . Q ∈ M Q ∈ M The functional ρ : Z → R has the following properties: 88

(A1) Convexity: ρ ( αZ 1 + (1 − α ) Z 2 ) ≤ αρ ( Z 1 ) + (1 − α ) ρ ( Z 2 ) for all Z 1 , Z 2 ∈ Z and α ∈ [0 , 1]. If Z 1 , Z 2 ∈ Z and Z 2 ≥ Z 1 , then ρ ( Z 2 ) ≥ (A2) Monotonicity: ρ ( Z 1 ). (A3) Translation Equivariance: If a ∈ R and Z ∈ Z , then ρ ( Z + a ) = ρ ( Z ) + a . (A4) Positive Homogeneity: ρ ( αZ ) = αρ ( Z ) , Z ∈ Z , α > 0 . Functionals ρ : Z → R satisfying axioms (A1)–(A4) are called coherent risk measures (Artzner, Delbaen, Eber, Heath (1999)). 89

Duality relation between coherent risk measures and distribu- tional robustness Examples of dual spaces Space Z := L p (Ω , F , P ), where P is a (reference) probability measure on (Ω , F ) and p ∈ [1 , ∞ ). That is, Z is the space of random variables Z ( ω ) having finite p -th order moment. For ζ = dQ/dP , space Z is paired with its dual space Z ∗ = L q (Ω , F , P ), where 1 /p + 1 /q = 1, and the scalar product � Ω ζ ( ω ) Z ( ω ) dP ( ω ) , ζ ∈ Z ∗ , Z ∈ Z . � ζ, Z � := We also consider space Z := L ∞ (Ω , F , P ), of essentially bounded (measurable) functions Z : Ω → R , paired with space L 1 (Ω , F , P ). 90

Another example. Let Ω be a metric space equipped with its Borel sigma algebra F , and Z := C (Ω) be the space of continuous functions Z : Ω → R with the max-norm � Z � = sup ω ∈ Ω | Z ( ω ) | . Its dual space Z ∗ is the space of finite signed measures on (Ω , F ) with the scalar product � Ω Z ( ω ) dµ ( ω ) , µ ∈ Z ∗ , Z ∈ Z . � µ, Z � := This framework is suitable when the uncertainty set M is defined by moment constraints. We mainly consider the first example with the reference probability space (Ω , F , P ) and paired spaces Z = L p (Ω , F , P ) and Z ∗ = L q (Ω , F , P ). 91

Dual representation of risk functions By Fenchel-Moreau theorem if ρ : Z → R is convex (assumption (A1)) and lower semicontinuous, then ρ ( Z ) = sup ζ ∈Z ∗ {� ζ, Z � − ρ ∗ ( ζ ) } , Z ∈ Z , (69) where ρ ∗ ( ζ ) := sup Z ∈Z {� ζ, Z � − ρ ( Z ) } . Note that maximization with respect to the dual space Z ∗ can be replaced by its subset A := dom( ρ ∗ ) = { ζ ∈ Z ∗ : ρ ∗ ( ζ ) < + ∞} . It is possible to show that condition (A2) (monotonicity) holds iff ζ � 0 for every µ ∈ A . Condition (A3) (translation equiv- � ariance) holds iff Ω ζdP = 1 for every ζ ∈ A . If ρ is positively homogeneous, then ρ ∗ ( ζ ) = 0 for every ζ ∈ A . 92

If conditions (A1)–(A4) hold, then A is a set of probability density functions and � ρ ( Z ) = sup Ω ζ ( ω ) Z ( ω ) dP ( ω ) . (70) ζ ∈ A That is ρ ( Z ) = sup E Q [ Z ] , Q ∈ M where M is the set of probability measures Q absolutely continuous with respect to the reference measure P and such that dQ/dP ∈ A . 93

Average Value-at-Risk (also called Conditional Value-at- Risk, Expected Shortfall, Expected Tail Loss, Expected Shortfall) Chance (probabilistic) constraint: � � Prob C ( x, ξ ) > τ ≤ α. (71) � � ≤ α, where Constraint (71) can be written as E 1 l (0 , ∞ ) ( Z x ) Z x := C ( x, ξ ) − τ . Let ψ : R → R + be nondecreasing, convex function such that ψ ( · ) ≥ 1 l (0 , ∞ ) ( · ). We have that � � t> 0 E [ ψ ( tZ x )] ≥ E inf 1 l (0 , ∞ ) ( Z x ) , and hence t> 0 E [ ψ ( tZ x )] ≤ α inf (72) is a conservative approximation of the chance constraint (71). 94

The choice ψ ∗ ( z ) := [1 + z ] + gives best conservative approximation. For this choice of ψ , (72) is equivalent to � � t + α − 1 E [ Z x − t ] + inf ≤ 0 . (73) t ∈ R � �� AV @ R α ( Z x ) Note that the minimum in the left hand side of (73) is attained at t ∗ = V @ R α ( Z x ), where V @ R α ( Z ) = H − 1 Z (1 − α ) := inf { t : H Z ( t ) ≥ 1 − α } , with H Z ( t ) := Prob( Z ≤ t ) being the cdf of Z . 95

Constraint Prob( Z ≤ 0) ≥ 1 − α is equivalent to V @ R α ( Z ) ≤ 0. Therefore AV @ R α ( C ( x, ξ )) ≤ τ is a conservative approximation of the chance constraint (71). Note that ρ ( Z ) = AV @ R α ( Z ) is a coherent risk measure. It is natural to take here Z = L 1 (Ω , F , P ) and Z ∗ = L ∞ (Ω , F , P ). Dual representation � AV @ R α ( Z ) = sup Ω Z ( ω ) ζ ( ω ) dP ( ω ) , ζ ∈ A where � � � ζ ∈ Z ∗ : 0 ≤ ζ ( ω ) ≤ α − 1 a.e. ω ∈ Ω , A = Ω ζ ( ω ) dP ( ω ) = 1 . 96

Suppose that Z 1 , ..., Z N is an iid sample of random variable Z . Then by replacing the true probability distribution P of Z by its empirical ∗ estimate P N = � N j =1 δ ( Z j ) we obtain sample estimate of θ := AV @ R α ( Z ): � � t + α − 1 N − 1 � N j =1 [ Z j − t ] + ˆ θ N = inf t ∈ R . By the Delta theorem (Theorem 2) we have � � t + α − 1 N − 1 � N j =1 [ Z j − t ] + + o p ( N − 1 / 2 ) , ˆ θ N = inf t ∈ [ t ∗ ,t ∗∗ ] where [ t ∗ , t ∗∗ ] is the set of 1 − α quantiles of the distribution of If the (1 − α )-quantile t ∗ = V @ R α ( Z ) is the random vector Z . unique, then θ N = V @ R α ( Z ) + α − 1 Y N + o p ( N − 1 / 2 ) , ˆ where Y N := N − 1 � N j =1 [ Z j − t ∗ ] + . This approximation can be reasonable when N is significantly bigger than 1 /α . ∗ By δ ( z ) we denote measure of mass one at the point z . 97

Example Mean-variance risk function ( c > 0): ρ ( Z ) := E [ Z ] + c Var[ Z ] , Z ∈ Z := L 2 (Ω , F , P ) . Dual representation: � � � ζ, Z � − (4 c ) − 1 Var[ ζ ] ρ ( Z ) = sup ζ ∈Z , E [ ζ ]=1 . Satisfies (A1) and (A3), does not satisfy (A2) and (A4). Example Mean-upper-semideviation risk function of order p ∈ [1 , + ∞ ), Z ∈ Z := L p (Ω , F , P ), c ≥ 0 and � � �� 1 /p . [ Z − E P [ Z ]] p ρ ( Z ) := E [ Z ] + c E P + Here ρ satisfies (A1),(A3),(A4), and also (A2) (monotonicity) if c ≤ 1. The max-representation � ρ ( Z ) = sup Ω Z ( ω ) ζ ( ω ) dP ( ω ) ζ ∈ A � holds with A = { ζ : ζ = 1 + h − Ω hdP, � h � q ≤ c, h ≥ 0 } . 98

Example φ -divergence Consider a convex continuous function φ : R + → R + such that φ (1) = 0 and φ ( x ) > 0 for x > 1. Let � � � � A := ζ ≥ 0 : Ω φ ( ζ ( ω )) dP ( ω ) ≤ c, Ω ζ ( ω ) dP ( ω ) = 1 For example we can take φ ( x ) := | x − 1 | p , for some c > 0. p ∈ [1 , ∞ ). In that case it will be natural to use the space Z = L p (Ω , F , P ) and � � � ζ ≥ 0 : � ζ − 1 � p ≤ c 1 /p , A = Ω ζdP = 1 . � For φ ( x ) := x ln x − x + 1 we have that Ω φ ( ζ ( ω )) dP ( ω ) defines the Kullback-Leibler divergence, denoted D KL ( ζ � P ). 99

Stochastic Programming A. Shapiro School of Industrial and Systems - PowerPoint PPT Presentation

Stochastic Programming A. Shapiro School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332-0205, USA Modeling and Basic Properties Consider the optimization problem Min x X F ( x, ) (1)

Stochastic Processes Will Perkins March 7, 2013 Stochastic Processes Q: What is a Stochastic

What If We Only Have Stochastic . . . What if the Stochastic . . . Approximate Stochastic

Some References P. Carpentier Master MMMEF Cours MNOS 2014-2015 263 / 263 Stochastic

Stochastic Optimization and Discretization January 06, 2021 P. Carpentier Master Optimization

Stochastic Electrodynamic Simulation of Hydrogen Ground State D. Reitz Background Stochastic

Stochastic Chemical Reaction Networks Matthew Douglas Johnston University of Waterloo Fall 2010

Introduction to stochastic dynamical modelling Gavin J Gibson Maxwell Institute for Mathematical

CHAPTER V V CHAPTER Annealing by Stochastic Annealing by Stochastic Neural Networks for

Report Title Stochastic Volterra Equations in Banach Spaces and Stochastic Volterra Equations

Lecture 2: Stochastic Discount Factor Simon Gilchrist Boston Univerity and NBER EC 745 Fall,

CS440/ECE448 Lecture 12: Stochastic Games, Stochastic Search, and Learned Evaluation Functions

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Stochastic Differential Dynamic Logic for Stochastic Hybrid Systems Andr e Platzer Logical

Introduction to Stochastic Optimization January 13, 2015 P. Carpentier Master MMMEF Cours

Outlines Stochastic Process Discrete Time Markov Chain (DTMC) 2 Stochastic Process

An Introduction to Stochastic Simulation Stephen Gilmore Laboratory for Foundations of Computer

A"Global"Inflec-on"Point?""

Image Processing 11. Radiometry and Shading Models Aleix M. Martinez aleix@ece.osu.edu

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

Stat 5101 Lecture Slides Deck 7 Charles J. Geyer School of Statistics University of Minnesota

Parallel Thompson Sampling for Large-scale Accelerated Exploration of Chemical Space, Jos e

[301] Conditions Based on slides created by Tyler Caraza-Harter Learning Objectives Today Reason

Distillation of quantum coherence ( 1711.10512 & 1804.09500 ) Kun Fang 1 RMS:QI workshop 2018,

JUST THE MATHS SLIDES NUMBER 8.2 VECTORS 2 (Vectors in component form) by A.J.Hobson

Stochastic Programming A. Shapiro School of Industrial and Systems - PowerPoint PPT Presentation

Stochastic Programming A. Shapiro School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332-0205, USA Modeling and Basic Properties Consider the optimization problem Min x X F ( x, ) (1)

Stochastic Processes Will Perkins March 7, 2013 Stochastic Processes Q: What is a Stochastic

What If We Only Have Stochastic . . . What if the Stochastic . . . Approximate Stochastic

Some References P. Carpentier Master MMMEF Cours MNOS 2014-2015 263 / 263 Stochastic

Stochastic Optimization and Discretization January 06, 2021 P. Carpentier Master Optimization

Stochastic Electrodynamic Simulation of Hydrogen Ground State D. Reitz Background Stochastic

Stochastic Chemical Reaction Networks Matthew Douglas Johnston University of Waterloo Fall 2010

Introduction to stochastic dynamical modelling Gavin J Gibson Maxwell Institute for Mathematical

CHAPTER V V CHAPTER Annealing by Stochastic Annealing by Stochastic Neural Networks for

Report Title Stochastic Volterra Equations in Banach Spaces and Stochastic Volterra Equations

Lecture 2: Stochastic Discount Factor Simon Gilchrist Boston Univerity and NBER EC 745 Fall,

CS440/ECE448 Lecture 12: Stochastic Games, Stochastic Search, and Learned Evaluation Functions

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Stochastic Differential Dynamic Logic for Stochastic Hybrid Systems Andr e Platzer Logical

Introduction to Stochastic Optimization January 13, 2015 P. Carpentier Master MMMEF Cours

Outlines Stochastic Process Discrete Time Markov Chain (DTMC) 2 Stochastic Process

An Introduction to Stochastic Simulation Stephen Gilmore Laboratory for Foundations of Computer

A&quot;Global&quot;Inflec-on&quot;Point?&quot;&quot;

Image Processing 11. Radiometry and Shading Models Aleix M. Martinez aleix@ece.osu.edu

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

Stat 5101 Lecture Slides Deck 7 Charles J. Geyer School of Statistics University of Minnesota

Parallel Thompson Sampling for Large-scale Accelerated Exploration of Chemical Space, Jos e

[301] Conditions Based on slides created by Tyler Caraza-Harter Learning Objectives Today Reason

Distillation of quantum coherence ( 1711.10512 &amp; 1804.09500 ) Kun Fang 1 RMS:QI workshop 2018,

JUST THE MATHS SLIDES NUMBER 8.2 VECTORS 2 (Vectors in component form) by A.J.Hobson

A"Global"Inflec-on"Point?""

Distillation of quantum coherence ( 1711.10512 & 1804.09500 ) Kun Fang 1 RMS:QI workshop 2018,