ERCIM 2013 Fitting techniques for estimating the trace of the - - PowerPoint PPT Presentation

ercim 2013 fitting techniques for estimating the trace of
SMART_READER_LITE
LIVE PREVIEW

ERCIM 2013 Fitting techniques for estimating the trace of the - - PowerPoint PPT Presentation

ERCIM 2013 Fitting techniques for estimating the trace of the inverse of a matrix Andreas Stathopoulos, Lingfei Wu, Jesse Laeuchli College of William and Mary Vasilis Kalantzis University of Minnesota Stratis Gallopoulos University of Patras,


slide-1
SLIDE 1

ERCIM 2013 Fitting techniques for estimating the trace of the inverse of a matrix Andreas Stathopoulos, Lingfei Wu, Jesse Laeuchli College of William and Mary Vasilis Kalantzis University of Minnesota Stratis Gallopoulos University of Patras, Greece Acks: NSF, DOE SciDAC

slide-2
SLIDE 2

The problem Given a large, N ×N matrix A and a function f find trace of f(A): Tr(f(A)) Common functions f(A) = A−1, log(A), exp(A), RT

i A−1R j, ...

Applications: UQ, Data Mining, Quantum Monte Carlo, Lattice QCD Our focus: f(A) = A−1 but techniques general

[ 2 ]

slide-3
SLIDE 3

Standard underlying method Monte Carlo (Hutchinson 1989) If x is a vector of random Z2 variables xi =

  • 1 with probability 1/2

−1 with probability 1/2 then E(xTA−1x) = Tr(A−1) Monte Carlo Trace for i=1:n x = randZ2(N,1) sum = sum + xTA−1x trace = sum/n

[ 3 ]

slide-4
SLIDE 4

Standard underlying method Monte Carlo (Hutchinson 1989) If x is a vector of random Z2 variables xi =

  • 1 with probability 1/2

−1 with probability 1/2 then E(xTA−1x) = Tr(A−1) Monte Carlo Trace for i=1:n x = randZ2(N,1) sum = sum + xTA−1x trace = sum/n 2 problems Large number of samples How to compute xTA−1x

[ 4 ]

slide-5
SLIDE 5

Standard underlying method Monte Carlo (Hutchinson 1989) If x is a vector of random Z2 variables xi =

  • 1 with probability 1/2

−1 with probability 1/2 then E(xTA−1x) = Tr(A−1) Monte Carlo Trace for i=1:n x = randZ2(N,1) sum = sum + xTA−1x trace = sum/n Solve Ay = x vs quadrature xTA−1x Golub’69, Bai’95, Meurant’06,’09, Strakos’11 O(100−1000s) statistically independent RHS Recycling (de Sturler), Deflation (Morgan, AS’07)

[ 5 ]

slide-6
SLIDE 6

Selecting the vectors in xTA−1x to min variance or error Random x ∈ ZN

2

best variance for real matrices (Hutchinson 1989) x = ei variance depends only on diag(A−1) x = FTei mixing F = DFT, Hadamard (Avron et al. 2010) Deterministic x = Hei, i = 1,...,2k Hadamard in natural order (Bekas et al. 2007) xm

i =

1 i ∈ Cm 0 else

  • Probing. Assumes multicolored graph (Tang et al. 2011)

x = H(pm,ki) Hierarchical Probing for lattices (A.S, J.L. 2013) Maintains benefits of probing but cheap and incremental

[ 6 ]

slide-7
SLIDE 7

Variance of the estimators Rademacher vectors xi ∈ ZN

2

Tr = 1

s ∑s i=1xT i A−1xi

Var(Tr) = 2

s ˜

A−12

F = 2 s ∑i=j(A−1 i j )2

Diagonal x = ej(i) Tr = N

s ∑s i=1A−1 j(i),j(i)

Var(Tr) = N2

s Var(diag(A−1))

A

−1

magnitude variance

Unclear which method is best a-priori

[ 7 ]

slide-8
SLIDE 8

Why focus on the diagonal method? Trace = integral of a 1-D signal. Can we improve Monte Carlo? Not without external information about the distribution of diagonal elements Our goals:

  • What if we have an approximation M ≈ diag(A−1)?
  • Is Tr(M) ≈ Tr(A−1) sufficient?
  • If not, can we use fitting p(M) (regression/interpolation/quadrature)?
  • Can the fitting reduce Var(p(M)−diag(A−1))?

[ 8 ]

slide-9
SLIDE 9

Approximations to diag(A−1)

  • Inexpensive bounds on diagonal elements (Robinson and Wathen ’92)

e.g., for A SPD, 1/Aii often capture the pattern of diag(A−1)

  • Let [L,U] = ILU(A) (incomplete LU) and M = diag(U−1L−1)

Requires only A−1

i,j entries from sparsity of L,U (Erisman, Tienny, ’75)

  • Eigen/singular vectors

M = diag(XΛ−1Y T), for nev smallest eigenvalues Already available from deflating multiple right hand sides! Number of eigenvectors can be increased while solving Ax = ei (eigCG)

[ 9 ]

slide-10
SLIDE 10

For some problems M captures pattern of diag(A−1) well Laplacian delsq(numgrid(’S’,34))

200 400 600 800 1000 0.2 0.4 0.6 0.8 1 1.2 1.4

Index of diagonal diag(A−1) diag(XL−1XT) diag((LU)−1)

Deflation: 15 smallest eigenpairs ILU(’crout’,’row’,0.01) Traces not close but Var(diag(XΛ−1XT −A−1)) = 4e-4 Var(diag((LU)−1 −A−1)) = 1e-2 MC on diag(A−1 −M) can be competitive to Hutchinson’s method

[ 10 ]

slide-11
SLIDE 11

In some cases approximation is pointless Rajat10 circuit simulation matrix (size 30202)

0.5 1 1.5 2 2.5 3 x 10

4

−1 −0.5 0.5 1 Index of diagonal element diag(A−1 diag(M)

M from 100 smallest singular triplets

[ 11 ]

slide-12
SLIDE 12

Capture pattern better by fitting M to D = diag(A−1) MC resolves shift D = c+M, but not scale D = bM (variance may increase!) Approach 1. Least squares fit with bM +c

  • 1. Solve Di = eT

i A−1ei, for i ∈ S a set of k indices

  • 2. Find [b,c] = argmin{D(S)−(bM(S)+c)2, b,c ∈ ℜ}

Not many points (linear systems) are needed. Typically 10-20. Significant improvement in the estimation of trace Reduces variance for potentially continuing with MC

[ 12 ]

slide-13
SLIDE 13

Approach 1 example D ≈ bM +c Matrix RDB5000, 50 smallest singular triplets, k=20 points used to fit Accuracy of systems and singular vectors is 1e-6.

1000 2000 3000 4000 5000 −0.2 −0.15 −0.1 −0.05 0.05 0.1 0.15 Index of diagonal diag(A−1 M 1000 2000 3000 4000 5000 −0.2 −0.15 −0.1 −0.05 0.05 0.1 Index of diagonal diag(A−1 fitted: diag(bM+c)

Tr(A−1) −267.880 Var(D) 6.1e−3 Tr(b∗M +c) −267.544 Var(D−M) 4.3e−4 Rel.Err. 1E −3 Var(D−bM −c) 4.3e−5

[ 13 ]

slide-14
SLIDE 14

Better fitting Linear model preserves shape of M, thus relies too much on the quality of M Interpolating with a higher degree polynomial could be noisy. Approach 2 basic. Piecewise Cubic Hermitian Spline Interpolation (PCHIP)

  • 1. Solve Di = eT

i A−1ei, for i ∈ S a set of k indices

  • 2. Fit p(M(S)) = D(S)

For PCHIP to effectively capture the pattern (global and local) of D it needs:

  • smoothness of the approximant
  • elements of M(S) to appear in increasing order
  • to capture the whole range of values of D
  • to capture where most of the action in D is happening

[ 14 ]

slide-15
SLIDE 15

Approach 2. Piecewise Cubic Hermitian Spline Interpolation (PCHIP)

  • 1. [ ˜

M,J] = sort(M) to obtain a CDF-like, smooth graph

  • 2. Choose Q a set of k indices: {1,2} ∈ Q and the k−2 are chosen such that they

minimize the integration error with trapezoidal rule of ˜

  • M. Do not consider

indices that produce non-unique ˜ Mi values.

1000 2000 3000 4000 5000 −0.2 −0.15 −0.1 −0.05 0.05 0.1 0.15 Index of diagonal diag(A−1 M 1000 2000 3000 4000 5000 −0.15 −0.1 −0.05 0.05 Diagonal index of sorted M M selected points diag(A−1) [ 15 ]

slide-16
SLIDE 16

Approach 2. Piecewise Cubic Hermitian Spline Interpolation (PCHIP)

  • 1. [ ˜

M,J] = sort(M)

  • 2. Choose Q a set of k indices.
  • 3. S = J(Q) the corresponding indices in original ordering
  • 4. Solve Di = eT

i A−1ei, for i ∈ S

1000 2000 3000 4000 5000 −0.15 −0.1 −0.05 0.05 Diagonal index of sorted M M diag(A−1) computed A−1(ii) [ 16 ]

slide-17
SLIDE 17

Approach 2. Piecewise Cubic Hermitian Spline Interpolation (PCHIP)

  • 1. [ ˜

M,J] = sort(M)

  • 2. Choose Q a set of k indices.
  • 3. S = J(Q) original ordering
  • 4. Solve Di = eT

i A−1ei, for i ∈ S

  • 5. PCHIP fit p(M(S)) = D(S). Use p(M) ≈ D

1000 2000 3000 4000 5000 −0.15 −0.1 −0.05 0.05 Diagonal index of sorted M M diag(A−1) computed A−1(ii) 1000 2000 3000 4000 5000 −0.15 −0.1 −0.05 0.05 Diagonal index of sorted M M diag(A−1) p(M) fit [ 17 ]

slide-18
SLIDE 18

Approach 2. Piecewise Cubic Hermitian Spline Interpolation (PCHIP)

  • 1. [ ˜

M,J] = sort(M)

  • 2. Choose Q a set of k indices.
  • 3. S = J(Q) original ordering
  • 4. Solve Di = eT

i A−1ei, for i ∈ S

  • 5. PCHIP fit p(M(S)) = D(S). Use p(M) ≈ D

If (4) computes also evecs, update points incrementally

1000 2000 3000 4000 5000 −0.15 −0.1 −0.05 0.05 Diagonal index of sorted M M diag(A−1) computed A−1(ii) 1000 2000 3000 4000 5000 −0.15 −0.1 −0.05 0.05 Diagonal index of sorted M M diag(A−1) p(M) fit [ 18 ]

slide-19
SLIDE 19

Approach 2. Piecewise Cubic Hermitian Spline Interpolation (PCHIP) Improvement over bM+c

3500 4000 4500 5000 0.02 0.025 0.03 0.035 0.04 0.045 Diagonal index of sorted M diag(A−1) p(M) bM+c 500 1000 1500 2000 −0.18 −0.17 −0.16 −0.15 −0.14 −0.13 −0.12 Diagonal index of sorted M diag(A−1) p(M) bM+c

2200 2400 2600 2800 3000 −0.1 −0.08 −0.06 −0.04 −0.02 0.02 Diagonal index of sorted M diag(A−1) p(M) bM+c

[ 19 ]

slide-20
SLIDE 20

Fitting examples nev=k=100: OLM5000, SiNa, KUU OLM5000

1000 2000 3000 4000 5000 −0.2 −0.15 −0.1 −0.05 Index of sorted M

Diagonal element Comparing diag(A−1), diag(M) and diag(M_fit)

diag(A−1) M Fit p(M) 500 1000 1500 2000 2500 −0.21 −0.208 −0.206 −0.204 −0.202 −0.2 −0.198 −0.196 −0.194 Index of sorted M

Diagonal element Comparing diag(A−1), diag(M) and diag(M_fit)

diag(A−1) M Fit p(M)

SiNa

1000 2000 3000 4000 5000 −0.1 −0.05 0.05 0.1 0.15 0.2 Index of sorted M

Comparing diag(A−1), diag(M) and diag(M_fit)

diag(A−1) M Fit p(M) 1000 2000 3000 4000 5000 6000 7000 0.2 0.4 0.6 0.8 1 1.2 1.4 Index of sorted M

Comparing diag(A−1), diag(M) and diag(M_fit)

diag(A−1) M Fit p(M)

KUU

[ 20 ]

slide-21
SLIDE 21

Very good eigenvalue approximation OLM5000

20 40 60 80 100 10

−4

10

−3

10

−2

10

−1

10 Matrix olm5000: Relative Error of Trace Number of deflated singular vectors Trace RelErr of bM+c Trace RelErr of p(M) 20 40 60 80 100 10

−3

10

−2

10

−1

Matrix rdb5000: Relative Error of Trace Number of deflated singular vectors Trace RelErr of bM+c Trace RelErr of p(M)

RDB5000 SiNa

20 40 60 80 100 10

−3

10

−2

Matrix Sina: Relative Error of Trace

Number of deflated singular vectors

Trace RelErr of bM+c Trace RelErr of p(M) 20 40 60 80 100 10

−3

10

−2

10

−1

Matrix Kuu: Relative Error of Trace Number of deflated singular vectors Trace RelErr of p(M) Trace RelErr of bM+c

KUU

[ 21 ]

slide-22
SLIDE 22

Variance: Z2 on E = A−1 −YΣ−1XT vs MC on diag D− p(M) OLM5000

20 40 60 80 100 10

3

10

4

10

5

10

6

Matrix olm5000: Variance Number of deflated eigenvectors or singular vectors Variance var(diag(E)_fit) var(diag(A−1)) (Z2): var(A−1) var(diag(E)) (Z2): var(E) var(A − aM) 20 40 60 80 100 10

1

10

2

10

3

10

4

10

5

10

6

Matrix rdb5000: Variance Number of deflated eigenvectors or singular vectors Variance var(diag(E)_fit) var(diag(A−1)) (Z2): var(A−1) var(diag(E)) (Z2): var(E) var(A − aM)

RDB5000 SiNa

20 40 60 80 100 10

2

10

3

10

4

10

5

Matrix SiNa: Variance Number of deflated eigenvectors or singular vectors Variance var(diag(E)_fit) var(diag(A−1)) (Z2): var(A−1) var(diag(E)) (Z2): var(E) var(A − aM) 20 40 60 80 100 10

3

10

4

10

5

10

6

10

7

Matrix Kuu: Variance Number of deflated eigenvectors or singular vectors Variance

var(D−p(M)) var(diag(A−1)) (Z2): var(A−1) var(diag(E)) (Z2): var(E) var(A − aM)

KUU

[ 22 ]

slide-23
SLIDE 23

When to use it? Estimate dynamically:

  • 1. Relative trace error

Cross validation: (a) Use m subsets Si ⊂ S (b) Fit p( ˜ M(Si)) and compute the mean error εi of the S−Si points (c) Confidence interval for error: ±2

  • Var(εi)
  • 2. Variance of (D− p(M)) vs Z2 on E

(a) Compute a j = A−1ej, j ∈ S. (b) Based on a j j update estimates for var(D), var(D−M), var(D− p(M)) (c) Based on ai j and µi = YΣ−1XTei update Hutchinson variance estimates var(A) = 2A2

F

var(A−YΣ−1XT) Large differences in various methods would show after a few points

[ 23 ]

slide-24
SLIDE 24

Dynamically identifying smallest variance Estimated variance converges to actual variance Relative differences apparent almost immediately

10

1

10

2

10

−2

10 10

2

10

4

10

6

Matrix cage8: Variance Estimation Number of fitting points Variance

  • Estim. Var(diag(E2))
  • Estim. Var(A

−1)

  • Estim. Var(E)

Actual Var(diag(E2)) Actual Var(A

−1)

Actual Var(E)

10

1

10

2

10

1

10

2

Matrix bwm200: Variance Estimation Number of fitting points Variance

  • Estim. Var(diag(E2))
  • Estim. Var(A

−1)

  • Estim. Var(E)

Actual Var(diag(E2)) Actual Var(A

−1)

Actual Var(E)

[ 24 ]

slide-25
SLIDE 25

Dynamically identifying smallest variance If a total of s steps allowed, what method will give the smallest error at s? Eg., the matb5 QCD matrix:

10 20 30 40 50 60 70 80 90 100 2 4 6 8 10 12 14 num of points used in fitting Predicted Error

Predicted Error at 200th quadrature with Rademacher on A Predicted Error at 200th quadrature with Rademacher on E Predicted Error at 200th quadrature with unit. Basis on E2

After 10 steps, excellent match between estimated and observed variances

[ 25 ]

slide-26
SLIDE 26

Conclusions If M approximates qualitatively well D, our technique combines deterministic regression and stochastic estimation to achieve good accuracy on ∑Di with as few samples as possible.

  • Most eigenvectors are a by product of solving right hand sides (samples).
  • Fitting achieves good eigenvalue accuracy, soon (less expensive than MC)
  • Fitting may or may not improve variance
  • Dynamic monitoring possible. Some improvements are needed.

[ 26 ]