Generalized Derivatives
Automatic Evaluation & Implications for Algorithms Paul I. Barton, Kamil A. Khan & Harry A. J. Watson
Process Systems Engineering Laboratory Massachusetts Institute of Technology
Generalized Derivatives Automatic Evaluation & Implications for - - PowerPoint PPT Presentation
Generalized Derivatives Automatic Evaluation & Implications for Algorithms Paul I. Barton, Kamil A. Khan & Harry A. J. Watson Process Systems Engineering Laboratory Massachusetts Institute of Technology Nonsmooth Equation Solving
Process Systems Engineering Laboratory Massachusetts Institute of Technology
2
◆ Semismooth Newton method: ◆ Linear programming (LP) Newton method: ◆ some element of a generalized derivative
γ ,x γ
∞ ≤ γ f(xk ) ∞ 2
∞ ≤ γ f(xk ) ∞
Kojima & Shindo (1986), Qi & Sun (1993), Facchinei, Fischer & Herrich (2014).
Polyhedral set
3
◆ Suppose locally Lipschitz => differentiable on a set S ◆ B-subdifferential: ◆ Clarke Jacobian: ◆ Useful properties of :
Ø Nonempty, convex, and compact Ø Satisfies mean-value theorem, implicit/inverse function theorems Ø Reduces to subdifferential/derivative when is convex/strictly differentiable
∂Bf(x):={H :H = lim
i→∞Jf(x(i)), x = lim i→∞x(i), x(i) ∈S}
∂f(x):= conv∂Bf(x)
f (x) = x ∂f (x) = {1} ∂f (x) = {−1} x ∂B f (x) ={−1,1}, ∂ f (x) = [−1,1]
∂f(x)
f
Clarke (1973).
4
◆ Suppose generalized derivative contains no
◆ Semismooth Newton method: Ø local Q-superlinear convergence if Ø local Q-quadratic convergence if strongly semismooth ◆ Semismooth Newton & LP-Newton methods for
Ø local Q-quadratic convergence if ◆ Automatic/Algorithmic Differentiation (AD) Ø Automatic methods for computing derivatives in complex settings Ø Automatic method for computing elements of generalized derivatives? Ø Computationally relevant generalized derivatives
5
6
◆ Automatically evaluating Clarke Jacobian
◆ Lack of sharp calculus rules: g(x) = max{0, x} 0 ∈ ∂g(0) =[0,1]
x x x
0 ∈ ∂h(0) =[0,1] (0 + 0) ∉∂ f (0) ={1} h(x) = min{0, x} f (x) = g(x)+ h(x) ∂ f (0) ⊂ ∂g(0) + ∂h(0)
7
◆ Directional derivative: ◆ Sharp chain rule for locally Lipschitz functions: ◆ AD gives the directional derivative ◆ PC1 functions: finite collection of C1 functions for which ◆ 2-norm not PC1
t→0+
Griewank (1994), Scholtes (2012).
8
◆ PC1 functions have piecewise linear directional
d1 d2
′ f (x;d) = B(1) d ′ f (x;d) = B(2) d ′ f (x;d) = B(3) d
9
◆ PC1 functions have piecewise linear directional
◆ Directional derivatives in the coordinate
◆ Also defeats finite differences
d1 d2
′ f (x;d) = B(1) d ′ f (x;d) = B(2) d ′ f (x;d) = B(3) d
10
◆ may be a strict subset of
i=1 m
2
11
Nesterov (1987), Khan and Barton (2014), Khan and Barton (2015).
◆ The following functions are L-smooth:
Ø Continuously differentiable functions Ø Convex functions (e.g. abs, 2-norm) Ø PC1 functions Ø Compositions of L-smooth functions: Ø Integrals of L-smooth functions: Ø Solutions of ODEs with L-smooth right-hand sides: where
a b
dx dt (t,c) = g(t,x(t,c)), x(0,c) = c c ! x(b,c),
12 ◆
L-subdifferential:
Ø Contains L-derivatives in directions M:
◆
Useful properties:
Ø L-derivatives classical derivative wherever strictly differentiable Ø L-derivatives elements of Clarke gradient Ø Contains only subgradients when f convex Ø Contained in plenary hull of Clarke Jacobian, and can be used in place of Clarke Jacobian in numerical methods: Ø For PC1 functions, L-derivatives elements of B-subdifferential Ø Satisfies sharp chain rule, expressed naturally using LD-derivatives
JLf(x;M), detM ≠ 0
Nesterov (1987), Khan and Barton (2014), Khan and Barton (2015).
13
◆ Extension of classical directional derivative ◆ LD-derivative: for any ◆ If M is square and nonsingular: ◆ If f is differentiable at x: ◆ Sharp LD-derivative chain rule:
(0) (m(1)) ! fx,M ( p−1)(m( p))]
Khan and Barton (2015).
14
◆ Sharp chain rule immediately implies, given the “seed
directions” M, forward-mode AD can compute:
◆ Need calculus rules for “elementary functions”:
Ø abs, min, max, mid, , etc. Ø algorithm for “elemental PC1 functions” Ø linear programs and lexicographic linear programs parameterized by their RHSs Ø implicit function: is the unique solution N of
⋅ 2 Khan and Barton (2015), Khan and Barton (2013), Hoeffner et al. (2015).
w'(ˆ z;M)
15
◆ Inexact Newton method: ◆ Solve iteratively: ◆ But, directional derivative not a linear function of the
directions…
◆ Let , M nonsingular. Then: ◆ But, M not known in advance ◆ Compute columns of one at time
Ø computation of a column affects subsequent columns Ø automatic code can be “locked” to record influence of earlier columns
◆ Local Q-superlinear & Q-quadratic convergence rates
can be achieved
16
◆
LD-derivative:
◆
FD approx. of using p+1 function evaluations: M := [m(1) ! m( p)]∈Rn× p
(0) (m(1)) ! fx,M ( p−1)(m( p))]
fx,M
(0) (m(1)) ≈ α −1[f(x +αm(1)) − f(x)] =: Dαm(1)[f](x)
fx,M
(1) (m(2)) ≈ Dαm(2)[fx,M (0) ](m(1)) = Dαm(2)Dαm(1)[f](x)
! fx,M
( p−1)(m( p)) ≈ Dαm( p)[fx,M ( p−2)](m( p−1)) = Dαm( p)"Dαm(2)Dαm(1)[f](x)
x +αm(1) +α 2m(2) x +αm(1) x fx,M
(1) (m(2)) ≈ α −2[f(x +αm(1) +α 2m(2)) − f(x +αm(1))]
f '(x;M)
(0) (m(1)) fx,M (1) (m(2))]
fx,M
(0) (m(1)) ≈ α −1[f(x +αm(1)) − f(x)]
17
◆ Cost of AD can be reduced when the Jacobian is sparse
Ø Find structurally orthogonal columns Ø Perform vector forward pass with seed matrix rather than
◆ AD for LD-derivatives à order of the directions matters
Ø Corresponding to M is an uncompressed (permutation) matrix Q:
» M = QD for some matrix D
Ø Procedure: » Identify matrices Q, D, and M » Perform vector forward pass to calculate » Copy entries of into entries of sparse data structure for
◆ Done based on assumption that
» Calculate (i.e. by sparse permutation)
Ø is not true in general a b c d e f g h ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥
1 1 1 1 ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ 1 1 1 1 ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ ⎥
n× p
n×n
18
i Ti in − Ti
i∈H
in
j∈ C
p∈P EBP C p − EBP H p
k k∈K k≠|K|
1,T 1 in
1,T 1
|H|,T|H|
|H|,T|H|
in
Watson et al. (2015).
19
◆ ◆ The LD-derivative mapping uniquely
solves the ODE:
◆ Boundary value problem: ◆ Solve with semismooth (inexact) Newton method using
chain rule for LD-derivatives
Ø if in addition f is semismooth, then Q-superlinear convergence rate Ø if it happens to be PC1, then Q-quadratic convergence rate
Khan and Barton (2014), Khan and Barton (2015), Pang and Stewart (2009).
20
◆ L-derivatives “computationally relevant”
◆ Can be computed automatically for broad
◆ Strong theory gives practically computable
Ø Implicit functions Ø Algorithms Ø ODE solutions Ø Linear programs Ø etc.
21
◆ Peter Stechlinski ◆ Novartis ◆ Statoil
22
◆ may be a strict subset of
i=1 m
2
2
23
[1]: Y. Nesterov, Math. Program. B, 104 (2005) 669-700. ◆
is L-smooth at if it is locally Lipschitz continuous and directionally differentiable, and if, for any , the following functions exist:
◆
If the columns of span , then is linear
◆
Lexicographic subdifferential:
◆
The class of L-smooth functions is closed under composition, and includes all smooth functions and all convex functions f : X ∈ Rn → Rm x ∈ X M :=[m(1) ! m( p)] ∈ Rn×p fx,M
(0) :d ! f '(x;d)
fx,M
(1) :d ! [fx,M (0) ]'(m(1);d)
" fx,M
( p) :d ! [fx,M ( p−1)]'(m( p);d)
M Rn fx,M
( p)
∂Lf(x) ={JLf(x;M):M ∈Rn×n, det M ≠ 0} JLf(x;M):= Jf ( p)
x,M(0)
24
1: Khan and Barton, submitted
◆ LD-derivatives for inverse functions:
Ø Suppose is L-smooth and locally invertible near , and is Lipschitz near Ø Result: is also L-smooth at Ø Result: is the unique solution of :
◆ LD-derivatives for implicit functions:
Ø Suppose is L-smooth, and there exists an implicit function such that Ø for each near Ø Result: is L-smooth at Ø Result: is the unique solution of:
f ˆ y, f(ˆ y) = ˆ z f −1 ˆ z f −1 ˆ z [f −1]'(ˆ z;M) N f '(ˆ y;N) = M h h(ˆ y, ˆ z) = 0, w h(w(z),z) = 0, z ˆ z ˆ z w w'(ˆ z;M) N h' ˆ y ˆ z ! " # # $ % & & ; N M ! " # $ % & ' ( ) ) * + , , = 0
25
◆
26
i Ti in − Ti
i∈H
in
j∈ C
p∈P EBP C p − EBP H p
k k∈K k≠|K|
1,T 1 in
1,T 1
|H|,T|H|
|H|,T|H|
in
27
◆ Consider the set of points which are either
Ø Index this set of points with set Ø Each has an associated enthalpy value
k
1 k
1 2
| | K
k
28
◆ If both temperatures at adjacent points are
k
1 k
1 2
| | K
k
k
1 k
+ 1 k
+ k
k
29
◆ Summing the area over all intervals gives the
k k∈K k≠|K|
k
1 k
1 2
k
| | K
30
◆ Difficulties:
Ø The enthalpies and temperatures need to be sorted Ø Not all of the temperatures are known
◆ Consider (naïve) bubble sort: ◆ Only calculations involve taking min/max of two entries ◆ Same sequence and number of calculations regardless of
whether the input is well-sorted or not Naïve bubble sort is an composite PC1 function
31
◆ Finding the unknown temperatures involves solving one
◆ Easily solved using “if-else” logic
Ø Correctly calculates values of the unknown temperatures, but not their LD-derivatives ß not composite PC1
◆ Solution of the equation defines an implicit function
:
in in
k k k i i i i H k k k j j j j C
∈ ∈
k k
k k
ny × ! → !
k k
k k
k k k
Generalization of the implicit function theorem
32
◆ Once all temperatures are found, the driving
Ø Requires modification of standard log-mean temperature difference calculation: ◆ Finally, calculate the total area using:
1 1 1
min min LM 1 1 1 1 2 1 ln( ) ln( )
k k k k k k
k k k k k k k k T T k k k T T T T
+ + +
+ + + + Δ +Δ + Δ −Δ Δ − Δ
LM | | k k K k k K
∈ ≠
Continuously differentiable elemental function
33
◆ New extended pinch operator formulation: ◆ Relevant equation for the new formulation:
EBP
C p =
f j max 0,(T p − ΔTmin) − t j
in
j∈ C
− max 0,(T p − ΔTmin) − t j
+max 0,(T p − ΔTmin) − tmax
EBP
H p =
Fi max 0,T p −Ti
i∈H
− max 0,T p −Ti
in
−max 0,T min −T p
p∈P EBP C p − EBP H p
34
i Ti in − Ti
i∈H
in
j∈ C
p∈P EBP C p − EBP H p
k k∈K k≠|K|
in,∀i ∈H
in,∀j ∈C
3 equations (plus constraints on outlet temperatures) Solve for 3 unknown quantities using LP-Newton method (temperatures, flow rates, area, minimum approach temperature, etc.)
35
◆ Two MHEX models plus three intermediate
Ø 9 equations (4 nonsmooth) à solve for nine variables
36
◆ Seven temperatures are taken as unknown in
Ø 2 additional variables can be taken as unknowns (UA, approach temperature, more temperatures, etc.
37
min
4 K ? A T U Δ = =
min
4 K ? T UA = Δ =
◆ Given the minimum approach temperature,
38
y1 y2 y3 y4 y5 y6 y7 365.07 K 225.44 K 193.35 K 95.08 K 180.85 K 357.14 K 95.14 K
39
20 40 60 80 100 120 140 Iteration
2 2 4 6 8 10
− − − − −
k ∞
40
min
? 120 (kW/K) UA T = = Δ
min
? 30 (kW/K) UA T = = Δ
◆ Given the area, calculate the minimum
41
min
min
y1 y2 y3 Y4 y5 y6 y7 365.07 K 217.66 K 196.09 K 95.08 K 174.75 K 362.46 K 91.85 K
42
min
4 K 85 (kW/K) UA T = Δ =
min
4 K 35 (kW/K) UA T = Δ =
◆ Given the area and the minimum approach
43
H2
C3 = 168.
y1 y2 y3 y4 y5 y6 y7 365.07 K 231.10 K 195.06 K 95.08 K 194.58 K 352.12 K 97.52 K
44
◆
positive, independent of
1 1
t
t
2 + y0 2
t
zt(x0) for t ∈ [0,10] (y0 = 0, z0 = 0)
2 + y0 2
45
◆
46
◆
1 2 3 2 4 6 t Γ[x(2,⋅)](t)
1 2 3
5 t x(t,c)
47
◆
48
◆
0.5 1 1.5 2 2.5 3 1 2 3 4 5 6 Time (t) Generalized derivative bounds LNA bounds
1 2 3
5 t x(t,c)
49
◆
OK OK Not OK