Outline Higher order is commonly used on convergence and on - - PowerPoint PPT Presentation

outline
SMART_READER_LITE
LIVE PREVIEW

Outline Higher order is commonly used on convergence and on - - PowerPoint PPT Presentation

Workshop AD Higher Order Workshop AD Higher Order Outline Higher order is commonly used on convergence and on derivatives in opti- Trust Region with a Cubic Model mization. First order methods are gradient based and have


slide-1
SLIDE 1

Workshop AD Higher Order

✬ ✫ ✩ ✪

Trust Region with a Cubic Model

Trond Steihaug Department of Informatics University of Bergen, Norway and Humboldt Universit¨ at zu Berlin

Workshop on Automatic Differentiation, Nice April 15-15, 2005

Slide 1 Workshop AD Higher Order

✬ ✫ ✩ ✪

Outline

Higher order is commonly used on convergence and on derivatives in opti- mization. First order methods are gradient based and have Q-order 1 or Q-super-linear (for Quasi-Newton methods) rate of convergence. Second or- der methods are using the Hessian and have Q-order 2 rate of convergence. Rate of convergence (Q-order) and the degree of the derivatives will not match for ’difficult’ problems.

  • Regularization ⇒ Trust-region Subproblem (TRS)
  • Trust region Methods in Unconstrained Optimization → TRS
  • AD can give higher order
  • Higher Order TRS

Slide 2 Workshop AD Higher Order

✬ ✫ ✩ ✪

Linear Least Squares (LLS)

Given m × n matrix A and b ∈ I Rm where m ≥ n. Compute x ∈ I Rn so that min 1 2Ax − b2 Let A = V ΣU T be the singular value decomposition and let Σ† = diag( 1 σ1 , . . . , 1 σr , 0, . . . , 0), r = rank(A). Define A† = V Σ†U T . The solution x is x = A†b =

r

  • i=1

uT

i b

σi vi where U = [u1 · · · un] and V = [v1 · · · vm].

Slide 3 Workshop AD Higher Order

✬ ✫ ✩ ✪

Singular Values σi for Rank Deficient Problem

5 10 15 20 25 30 0.5 1 1.5 2 2.5 i singular values

Slide 4

slide-2
SLIDE 2

Workshop AD Higher Order

✬ ✫ ✩ ✪

Singular Values σi for Discrete Ill-posed Problem

5 10 15 20 25 30 35 40 45 50 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Singular values for a Discrete Ill−Posed Problem. Problem: ill−cond. heat, n=50 i singular values

Slide 5 Workshop AD Higher Order

✬ ✫ ✩ ✪

Discrete Picard Condition

A, b come from discretization from an ill-posed problem. All σi > 0 so for- mally x = A†b =

n

  • i=1

uT

i b

σi vi However uT

i b

σi ց 0 as i increases (the discrete Picard condition.) Introduce noise in problem b = ˜ b + ε.

Slide 6 Workshop AD Higher Order

✬ ✫ ✩ ✪

Coefficients uT

i b

σi for exact data and noisy data

5 10 15 20 25 30 35 40 45 50 10

−2

10

−1

10 10

1

10

2

10

3

Coefficients of right singular vectors in LS solution. Problem: deriv2,n=50 i coefficients of right singular vectors * exact data

  • noisy data

Slide 7 Workshop AD Higher Order

✬ ✫ ✩ ✪

One Solution to the Noisy Problem: Regularization

The following three problems are equivalent and make the ’noisy’ prob- lem smooth Given µ ≥ 0 solve min 1

2Ax − b2 2 + µx2 2.

Given λ ≥ 0 solve (AT A + λI)x = AT b. Given ∆ ≥ 0 solve minx≤∆

1 2Ax − b2. TRS

Equivalence from the Karush-Kuhn-Tucker conditions. (There exits open intervals for the three parameters µ, λ, ∆ so that x is the solution to all three problems)

Where is AD?

Slide 8

slide-3
SLIDE 3

Workshop AD Higher Order

✬ ✫ ✩ ✪

Gauss - Newton and Nonlinear Least Squares

Given a nonlinear function F : I Rn → I Rm. Inexact Gauss-Newton Method: Given x0 while not converged do Compute F ′(xi) Find approximate solution si of mins∈I

Rn 1 2F ′(xi)s + F(xi)2 2

Update xi+1 = xi + si end-while F ′(x) is the m × n Jacobian matrix at x Noise is inherit in the LLS problem! unless high accuracy of F and F ′

Slide 9 Workshop AD Higher Order

✬ ✫ ✩ ✪

Higher Order Model Function

Gauss-Newton is based on 1.order approximation of F at x, i.e. F(x + s) ≈ M1(s) = F(x) + F ′(x)s and solve for the step s min

s∈I Rn M(s)2 2.

Finding approximate solution si by constraining s ≤ ∆ leads to Levenberg

  • Marquard methods. These are trust-region methods that use a linear model

M(s) = F ′(xi)s + F(xi) at xi of F(xi + s) with approximate solution min

s≤∆

M(s)2

2.

Use more accurate model M2(s) = F(xi) + F ′(xi)s + 1 2(T s)s, T = F ′′(xi)

Slide 10 Workshop AD Higher Order

✬ ✫ ✩ ✪

Higher Order Model Function (2)

Let m(s) ≈ f(x + s) = F(x + s)T F(x + s) and solve min

s≤∆ m(s)

where m2(s) = f(x) + ∇f(x)T s + 1 2sT ∇2f(x)s m3(s) = f(x) + ∇f(x)T s + 1 2sT ∇2f(x)s + 1 6sT (T s)s, T = ∇3f(x)

Slide 11 Workshop AD Higher Order

✬ ✫ ✩ ✪

The Basic Trust Region Method

Given x0 and ∆0 (0 ≤ γ2 < γ1 < 1, 0 ≤ γ4 ≤ γ5 < 1 ≤ γ3) while not converged do Compute model mi(s). Compute approximate solution si of TRS: mins≤∆ mi(s). Compute f(xi + si), mi(si) and ρi = f(xi)−f(xi+si)

f(xi)−mi(si)

= actual predicted Update xi+1 = ⎧ ⎨ ⎩ xi + si if ρ ≥ γ2 xi otherwise Update ∆i+1: si ≤ ∆i+1 ≤ γ3si if ρi ≥ γ1 γ4si ≤ ∆i+1 ≤ γ5si if ρi < γ1 end-while

Slide 12

slide-4
SLIDE 4

Workshop AD Higher Order

✬ ✫ ✩ ✪

Properties

m1(s) = f(x) + ∇f(x)T s - linear model m2(s) = m1(s) + 1

2sT ∇2f(x)s- quadratic model

m3(s) = m2(s) + 1

6sT (T s)s - cubic model.

Under ’reasonable’ conditions the basic trust region algorithm be globally convergent, i.e. for given ε > 0 and any x0 there exists an index i so that ∇f(xi) ≤ ε. for the models. Need to understand the Trust Region Subproblem (TRS) min

s≤∆

m(s).

Slide 13 Workshop AD Higher Order

✬ ✫ ✩ ✪

Exact Solution of TRS mi(s), i = 1, 2, 3

The trust region subproblem with m1 min

s≤∆ f + gT s

gives the Step Constrained Cauchy point ˜ s ˜ s = − ∆ gg

Slide 14 Workshop AD Higher Order

✬ ✫ ✩ ✪

Exact Solution of TRS m2(s)

min

s≤∆ f + gT s + 1

2sT Hs s is a solution with Lagrange multiplier δ if and only if (i) (H + δI)s + g = 0 (ii) H + δI is positive semi definite (iii) δ ≥ 0 and δ(s − ∆) = 0. (Gay (1981) and Sorensen (1982)) The solution is on the form s(δ) = −(H + δI)−1g provided H + δI pos.def. and s(δ) = ∆ (i.e. small ∆ gives large δ). For H + δI positive semi-definite we have two cases: g is orthogonal to the null-space of H + δI and we have the so called ’hard-case’ and g not orthogonal in which case we have a smooth solution.

Slide 15 Workshop AD Higher Order

✬ ✫ ✩ ✪

H positive definite

0.5 1 1.5 2 2.5 3 3.5 4 −2 −1.5 −1 −0.5 0.5 1 1.5 2 1 1 1 4 4 4 4 4 4 8 8 8 8 8 8 8 16 16 16 16 1 6 16 32 3 2 32

Slide 16

slide-5
SLIDE 5

Workshop AD Higher Order

✬ ✫ ✩ ✪

H semi definite

−2 −1.5 −1 −0.5 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0.5 1 1.5 2 128 64 64 32 32 14.023 16 8 8 4 −4

Slide 17 Workshop AD Higher Order

✬ ✫ ✩ ✪

Exact Solution of TRS for LLS

min

s≤∆ f + gT s + 1

2sT Hs Let S1 = N(H + λI) (N is the nullspace). We have the hard case when g ⊥ S1. For LLS recall that H = AT A and g = −AT b (so λ = 0 for the hard case) gT vk = −σkbT uj, 1 ≤ j ≤ mk where uj, vj is associated with singular value σk with multiplicity mk. (Rojas-Sorensen (2002))

Slide 18 Workshop AD Higher Order

✬ ✫ ✩ ✪

The Hard Case is the Normal

50 100 150 200 250 300 10

−18

10

−16

10

−14

10

−12

10

−10

10

−8

10

−6

10

−4

10

−2

10 10

2

|gTvi| i

Note that gT vj = 0 is the (exact) hard case and gT vj = −σjuT

j b.

Slide 19 Workshop AD Higher Order

✬ ✫ ✩ ✪

A major Challenge: The Cubic Model

min

s≤∆

m3(s).

  • We can characterize (if and only if) the (local) solution of TRS.
  • We can compute the local minimizers. In a way
  • What do we know about the (global) solution path? In the general case

it bifurcates, stops and is not continuous

  • The solution path we want consists of local and global solutions.

Slide 20

slide-6
SLIDE 6

Workshop AD Higher Order

1 0.5 −0.5 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.57 7.58 8.5 9.5 10.5 12 13 3.5 3 2.5 2 1.5 1 0.5 −0.5 −1 −1.5 −2.5 −3 −2.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5

The Beale function: X0=(0.5,−0.25) and X*=(3,0.5)TRS with Cubic Model X−axis

0.5 1 1.5 2 2.5 3

Slide 21 Workshop AD Higher Order

✬ ✫ ✩ ✪

Global Convergence with the Cubic Model

min

s≤∆

m3(s).

  • Convergence results require m3(si) ≤ γ0m1(˜

si). (Here ˜ si is the step constrained Cauchy-point). Not always the case for fixed γ0 > 0

  • A problem arises in the proof of convergence when tensor is getting large.

Assume that the tensor is uniformly bounded

  • These results uses existence of si No guaranteed working algorithm to

compute si.

  • Can we say anything about the rate of convergence? Except in the case

when f is strictly convex at a (local) solution

Slide 22 Workshop AD Higher Order

✬ ✫ ✩ ✪

Vector Representation of Tensors

n n n n n n n n n

Columns

  • Rows
  • Tubes

Slide 23 Workshop AD Higher Order

✬ ✫ ✩ ✪

Data structures for Super-symmetric tensors

Tijk = ∂3f ∂xi∂xj∂xk (x) Clearly Tijk = Tjik = Tikj = Tjki = Tkij = Tkji To store the tensor we need to store (n + 2)(n + 1)n/6 (real) numbers Tijk 1 ≤ k ≤ j ≤ i ≤ n Linear array: T((i − 1)i(i + 1)/6 + (j − 1) ∗ j/2 + k) ≡ Tijk, 1 ≤ k ≤ j ≤ i ≤ n c# and java offer new possibilities to store the super-symmetric tensor and using standard notation T[i][j][k]. Tube (i, j) is the array T[i][j]

Slide 24

slide-7
SLIDE 7

Workshop AD Higher Order

✬ ✫ ✩ ✪

Sparse Tensors

Griewank-Toint (partial separability): The Hessian matrix is said to be sparse if ∇2f(x)ij = 0 for all x ∈ I Rn (i, j) ∈ Z. Then the sparsity structure of the tensor T is determined by the sparsity structure of the Hessian matrix. Tijk = 0 when (i, j), (i, k) or (j, k) ∈ Z. Symmetric skyline format is ’vector’ based and can be extended to sparse super-symmetric tensors using array of arrays or a linear array with only n pointers as datastructure .

Slide 25 Workshop AD Higher Order

✬ ✫ ✩ ✪

Tensor Methods are in Use

Around 50 papers in the database (in optimization and computational science). Around 50% of the papers ’Higher order methods have been considered by....’. Brett W. Bader, PhD 2003, University of Colorado Ali Bouaricha, PhD 1992, University of Colorado Ta-Tung Chow, PhD 1989, University of Colorado Paul D. Frank, PhD 1984, University of Colorado Workshop on Tensor Decompositions and Applications August 2005 to discuss ’Large scale problems’.

Slide 26 Workshop AD Higher Order

✬ ✫ ✩ ✪

Concluding Remarks

  • AD has given us the opportunity to use higher derivatives.

Too messy for hand-coding

  • Very few classes of methods in optimization are capable to utilize

3rd derivative.

  • Few efficient data structures for sparse super symmetric tensors.
  • Are they right the researches that claim tensor methods can never

compete with Newton’s method in terms of speed of convergence when the Hessian matrix is nonsingular at the solution. ⇒ Is there a big enough class of problems where 3rd derivative will be ’useful’.

  • Ongoing work by Geir Gundersen, University of Bergen

Slide 27