SLIDE 1 Inverse Optimization and Equilibrium with Applications in Finance and Statistics Jong-Shi Pang Department of Industrial and Enterprise Systems Engineering University of Illinois at Urbana-Champaign presented at SPECIAL SEMESTER on Stochastics with Emphasis
Linz, Austria Monday October 27, 2008, 10:50–11:40 AM
1
SLIDE 2 Contents of Presentation
- A preface
- What is inverse optimization?
- 3 applications
— cross validated support-vector regression — optimal mixing in statistics — implied volatility of American options
— Focusing on concepts and ideas; omitting technical details.
2
SLIDE 3 Preface
Up to now, most inverse problems in mathematics involve the inversion of partial-differential equations (pde’s)—the forward models—in the presence
- f observed and/or experimental data. They lead to optimization problems
with pde constraints. In contrast, the kind of inverse problems we are interested in involves op- timization or equilibrium problems as the forward models, and requires the solution of finite-dimensional optimization problems with algebraic inequality together with certain complementarity constraints. The latter inverse problems require theories and methods of contemporary
- ptimization and variational inequalities where inequalities provide the key
challenges. Inequalities lead to non-smoothness, multi-valuedness, and dis- junctions, which are the atypical characteristics in modern computational mathematics.
3
SLIDE 4 Forward versus Inverse Optimization Optimization pertains to the computation of the maximum or mimimum value of an objective function in the presence of con- straints, which are, for all practical purposes, expressed in terms
- f a finite number of algebraic equations and/or inequalities de-
fined by a finite number of decision variables. Traditional optimization is a forward process; namely, input data are fed into the optimization model, yielding a resolution of the problem—the model solution. Inverse Optimization attempts to build improved optimization models for the goal of better generalization, by choosing the model parameters so that the model solution could optimize a secondary objective, e.g., reproducing an observed solution, either exactly or as closely as possible.
4
SLIDE 5 An illustration: Inverse convex quadratic programming Given a set Ω and an outer objective function θ, find (x, Q, A, b, c): minimize
(x,Q,A,b,c)
θ(x, Q, A, b, c) subject to ( x, Q, A, b, c ) ∈ Ω and x ∈ argmin
x ′ 1 2 ( x ′ )TQx ′ + c Tx ′
subject to Ax ′ ≤ b
3 salient features:
- for each (Q, A, b, c) there is a lower-level quadratic program (in box)
- for which an optimal solution x is sought such that the upper-level constraint
(x, Q, A, b, c) ∈ Ω is satisfied
- a tuple (x, Q, A, b, c) with the above properties is sought to minimize the
upper-level objective function θ(x, Q, A, b, c).
5
SLIDE 6 Bilevel support-vector regression Given a finite set of in-sample data points {(xi, yi)}n
i=1, fit a
hyperplane y = xTw +b by solving the convex quadratic program for (w, b): minimize
(w,b)
C
n
max
- | wTxi + b − yi | − ε, 0
- + 1
2 wTw
for given (C, ε) > 0. Let (w(C, ε), b(C, ε)) be optimal. The inverse problem is to choose (C, ε) to minimize an error of a set of out-of-sample data, such as minimize
(C,ε) n+k
| w(C, ε)Txj + b(C, ε) − yj |. Extension to the statistical methodology of cross validation, in- cluding the leave-one-out validation.
6
SLIDE 7 The (inner-level) SVM quadratic program
minimize
(w,b)
C
n
ei + 1
2 wTw
subject to for all i = 1, · · · , n ei ≥ wTxi + b − yi − ε ei ≥ −wTxi − b + yi − ε, and ei ≥ 0, and its Karush-Kuhn-Tucker optimality conditions: 0 ≤ λ+
i
⊥ ei − wTxi − b + yi + ε ≥ 0 0 ≤ λ−
i
⊥ ei + wTxi + b − yi + ε ≥ 0 0 ≤ ei ⊥ C − λ+ − λ− ≥ 0
i = 1, · · · , n 0 =
m
( λ+
i − λ− i ) xi,
and 0 =
m
( λ+
i − λ− i ),
where ⊥ denotes the complementary slackness condition; thus 0 ≤ a ⊥ b ≥ 0 if and only if [a = 0 ≤ b] or [a ≥ 0 = b].
7
SLIDE 8 The bilevel SVM problem
minimize
(C,ε,e,w,b) n+k
ej subject to ej ≥ | wTxj + b − yj |, j = n + 1, · · · , n + k and the inner SVM KKT conditions.
- An instance of a linear program with linear complementarity constraints,
abbreviated as an LPCC; i.e., a linear program except for the disjunctive complementarity slackness constraints.
- As such, it is a nonconvex optimization problem, albeit of a very special
kind.
- In this application, the inverse process is to optimize an out-of-sample error
based on an in-sample training set of data.
8
SLIDE 9 Extension: Cross-validated support-vector regression T : a positive integer (the number of folds) Ω =
T
Ωt : a partitioning of the data into disjoint subgroups Nt : index set of data in Ωt, with complement N t. The fold t training subproblem:
minimize
(wt,bt)∈ℜn+1 1 2 wt 2 2 +
C | N t |
max
- | ( wt )Txi + bt − yi | − ε, 0
- subject to
−w ≤ wt ≤ w, for feature selection yielding the fold t loss, which depends on the choice of (C, ε, w):
| ( wt )Txi + bt − yi |.
9
SLIDE 10 Cross-validated support-vector regression (cont.)
Given C > C ≥ 0, ε > ε ≥ 0, and wub > wlb, minimize
C, ε, w
{(wt,bt)}T
t=1
T
1 | Nt |
| ( wt )Txi + bt − yi | subject to C ≤ C ≤ C, ε ≤ ε ≤ ε, wlb ≤ w ≤ wub and for t = 1, · · · , T, (wt, bt) ∈ argmin
(wt,bt) 1 2wt2 2 +
C |N t|
max
- (wt)Txi + bt − yi
- − ε, 0
- subject to
− w ≤ wt ≤ w
- same (C, ε, w) across all folds
- can easily accommodate other convex loss functions and constraints
- extension to parameterized kernel selection
- other tasks, such as classification and semi-supervised learning, can be
similarly handled.
10
SLIDE 11 A bilevel maxi-likelihood approach to target classification
- Problem. Given data points for 2 target classes, identified by the columns in
the 2 matrices: XI
∈ Rn1×d, for target class I XII
∈ Rn2×d, for target class II, determine a statistical model to classify future data as type I or II. Our approach is as follows:
- Aggregate the data via a common set of weights: w ∈ W ⊆ Rd, obtaining
the aggregated data: XI,IIw.
- Apply an m-term mixture Gaussian model:
Ψ(x, µ, σ, p)
m
pi 1 σi √ 2π exp
2
x − µi
σi
2
to the aggregated data XI,IIw.
11
SLIDE 12
- Determine the mixing coefficients pi via a log-likelihood maximization:
pI,II ∈ argmax
p n1,n2
log Ψ(XI,II
j• w, µ, σ, p)
subject to
m
pi = 1, p ≥ 0.
- The overall process chooses the parameters (pI,II, w, µ, σ) by maximizing a
measure of separation between the two classes based on the given data XI,II: argmax
pI,II,w,µ,σ
θ(pI,II, w, µ, σ) subject to w ∈ W and pI,II ∈ argmax
p n1,n2
log Ψ(XI,II
j• w, µ, σ, p)
subject to
m
pi = 1, p ≥ 0.
12
SLIDE 13
Pricing American Options: the vanilla Black-Scholes model Consider the forward pricing of an American put/call option of an underlying asset whose (random) price pattern S(t) satisfies the stochastic differential equation: dS = ( µ S − D(S, t) ) dt + σ(S, t) S dW, where
µ drift of the price process r prevalent interest rate, assumed constant D(S, t) dividend rate of the asset σ(S, t) non-constant volatility of the asset dW standard Wiener process with mean zero and variance dt.
Let the Black-Scholes operation be denoted by LBS ∂ ∂t + 1
2 σ2(S, t) S2 ∂2
∂S2 + ( r S − D(S, t) ) ∂ ∂S − r.
13
SLIDE 14
The forward pricing model The American option price V (S, t) satisfies the partial differential linear complementarity system: for (S, t) ∈ (0, ∞) × [0, T], 0 ≤ V (S, t) − Λ(S, t) ⊥ LBS(V ) ≤ 0, plus boundary conditions at terminal time t = T and extreme asset values S = 0, ∞, where T time of expiry of option Λ(S, t) payoff function of option at expiry. The complementarity expresses the early exercise feature of an American option.
14
SLIDE 15
The discretized complementarity problem Discreting time and asset values, obtain a finite-dimensional lin- ear complementarity problem, parameterized by the asset volatil- ities: 0 ≤ V − Λ ⊥ q(σ) + M(σ)V ≥ 0, where V {V (mδS, nδt)} is the vector of approximated option prices at times t = nδt and asset values S = mδS. With suitable discretization, M(σ) is a strictly row diagonally dominant, albeit not always symmetric, matrix for fixed σ. Extensions to multiple-state problems, such as options on several assets, models with stochastic volatilities and interest rates, as well as some exotic options.
15
SLIDE 16 The inverse pricing problem Suppose V obs is a set of observed American option prices. Want to determine an implied volatility surface to minimize the devia- tions between the model and observed option prices. minimize
σ,V
V − V obs subject to 0 ≤ V − Λ ⊥ q(σ) + M(σ)V ≥ 0 and
possibly additional constraints, such as positivity of σ.
- An instance of a mathematical program with complementarity constraints,
abbreviated as MPCC.
- Computationally more challenging than an LPCC due to the nonlinear prod-
uct M(σ)V , in addition to the complementarity constraint.
- Research underway to develop an efficient algorithm.
16
SLIDE 17
A Unified Mathematical Framework
A Mathematical Program with Complementarity Constraints minimize
(x,y,w)
θ(x, y, w) subject to g(x, y, w) ≤ 0 h(x, y, w) = 0 and 0 ≤ y ⊥ w ≥ 0, where the functions θ (real-valued), and g and h (both vector-valued) are all (possibly twice) continuously differentiable. A feasible triple (¯ x, ¯ y, ¯ w) is stationary if (dx, dy, dw) = (0, 0, 0) is a global minimizer of the linearization of the MPCC at (¯ x, ¯ y, ¯ w): miminize
(dx,dy,dw)
∇xθ(¯ x, ¯ y, ¯ w)Tdx + ∇yθ(¯ x, ¯ y, ¯ w)Tdy + ∇wθ(¯ x, ¯ y, ¯ w)Tdw subject to g(¯ x, ¯ y, ¯ w) + Jxg(¯ x, ¯ y, ¯ w)dx + Jyg(¯ x, ¯ y, ¯ w)dy + Jwg(¯ x, ¯ y, ¯ w)dw ≤ 0 Jxh(¯ x, ¯ y, ¯ w)dx + Jyh(¯ x, ¯ y, ¯ w)dy + Jwh(¯ x, ¯ y, ¯ w)dw = 0 and 0 ≤ dy ⊥ dw ≥ 0, which is an LPCC.
17
SLIDE 18 Some theoretical issues
- Is stationarity necessary and/or sufficient for (local) optimality?
— Necessity depends on non-traditional constraint qualifications. — Sufficiency holds for local optimality if objective is (pseudo-)convex and constraint functions g and h are affine.
- How do we calculate (and verify) a stationary tuple?
— Requires the global resolution of an LPCC — a recent research topic.
— when objective is convex and constraint functions g and h are affine — quite impossible in general.
The principal challenge of the MPCC, besides the possible non-convexity of the constraint functions, is the complementarity constraint, which arises as a result of the lower-level optimality/equilibrium conditions.
18
SLIDE 19
Breaking up the complementarities An MPCC is equivalent to 2m nonlinear programs, each called a piece and derived from a subset α ⊆ {1, · · · , m} with complement ¯ α, where m is the dimension of the variables y and w: NLP(α) : minimize
(x,y,w)
θ(x, y, w) subject to g(x, y, w) ≤ 0 h(x, y, w) = 0 wα ≥ 0 = yα and w¯
α = 0 ≤ y¯ α
Goal is to develop practically efficient algorithms without com- plete enumeration of all the pieces.
19
SLIDE 20 Nonlinear programming approaches
- Smoothing: for ε > 0 sufficiently small,
minimize
(x,y,w)
θ(x, y, w) subject to g(x, y, w) ≤ 0 h(x, y, w) = 0 ( y, w ) ≥ 0, and yTw ≤ ε
- Penalization: for ρ > 0 sufficiently large,
minimize
(x,y,w)
θ(x, y, w) + ρ yTw subject to g(x, y, w) ≤ 0 h(x, y, w) = 0 and ( y, w ) ≥ 0.
- NEOS solvers are practically quite efficient and robust.
- Yet they are incapable of ascertaining the quality of the solution,
- motivating research for the global resolution of the MPCCs.
20
SLIDE 21 Formulation as a mixed nonlinear integer program
For ρ > 0 sufficiently large, minimize
(x,y,w)
θ(x, y, w) subject to g(x, y, w) ≤ 0 h(x, y, w) = 0 0 ≤ y ≤ ρ z, 0 ≤ w ≤ ρ ( 1 − z ) and z ∈ { 0, 1 }m
- Drawback: Applicable only to feasible MPCCs with computable bounds.
- Goal: Follow this conceptual formulation but treat ρ implicitly.
- Current work focuses on the LPCC.
21
SLIDE 22 Concluding remarks We have
- introduced the idea of inverse optimization
- described 3 applications in statistics and finance, and
- briefly mentioned some solution approaches.
We welcome questions and discussion about the omitted details. Thank you!
22