ERCIM 2013 Fitting techniques for estimating the trace of the - - PowerPoint PPT Presentation
ERCIM 2013 Fitting techniques for estimating the trace of the - - PowerPoint PPT Presentation
ERCIM 2013 Fitting techniques for estimating the trace of the inverse of a matrix Andreas Stathopoulos, Lingfei Wu, Jesse Laeuchli College of William and Mary Vasilis Kalantzis University of Minnesota Stratis Gallopoulos University of Patras,
The problem Given a large, N ×N matrix A and a function f find trace of f(A): Tr(f(A)) Common functions f(A) = A−1, log(A), exp(A), RT
i A−1R j, ...
Applications: UQ, Data Mining, Quantum Monte Carlo, Lattice QCD Our focus: f(A) = A−1 but techniques general
[ 2 ]
Standard underlying method Monte Carlo (Hutchinson 1989) If x is a vector of random Z2 variables xi =
- 1 with probability 1/2
−1 with probability 1/2 then E(xTA−1x) = Tr(A−1) Monte Carlo Trace for i=1:n x = randZ2(N,1) sum = sum + xTA−1x trace = sum/n
[ 3 ]
Standard underlying method Monte Carlo (Hutchinson 1989) If x is a vector of random Z2 variables xi =
- 1 with probability 1/2
−1 with probability 1/2 then E(xTA−1x) = Tr(A−1) Monte Carlo Trace for i=1:n x = randZ2(N,1) sum = sum + xTA−1x trace = sum/n 2 problems Large number of samples How to compute xTA−1x
[ 4 ]
Standard underlying method Monte Carlo (Hutchinson 1989) If x is a vector of random Z2 variables xi =
- 1 with probability 1/2
−1 with probability 1/2 then E(xTA−1x) = Tr(A−1) Monte Carlo Trace for i=1:n x = randZ2(N,1) sum = sum + xTA−1x trace = sum/n Solve Ay = x vs quadrature xTA−1x Golub’69, Bai’95, Meurant’06,’09, Strakos’11 O(100−1000s) statistically independent RHS Recycling (de Sturler), Deflation (Morgan, AS’07)
[ 5 ]
Selecting the vectors in xTA−1x to min variance or error Random x ∈ ZN
2
best variance for real matrices (Hutchinson 1989) x = ei variance depends only on diag(A−1) x = FTei mixing F = DFT, Hadamard (Avron et al. 2010) Deterministic x = Hei, i = 1,...,2k Hadamard in natural order (Bekas et al. 2007) xm
i =
1 i ∈ Cm 0 else
- Probing. Assumes multicolored graph (Tang et al. 2011)
x = H(pm,ki) Hierarchical Probing for lattices (A.S, J.L. 2013) Maintains benefits of probing but cheap and incremental
[ 6 ]
Variance of the estimators Rademacher vectors xi ∈ ZN
2
Tr = 1
s ∑s i=1xT i A−1xi
Var(Tr) = 2
s ˜
A−12
F = 2 s ∑i=j(A−1 i j )2
Diagonal x = ej(i) Tr = N
s ∑s i=1A−1 j(i),j(i)
Var(Tr) = N2
s Var(diag(A−1))
A
−1
magnitude variance
Unclear which method is best a-priori
[ 7 ]
Why focus on the diagonal method? Trace = integral of a 1-D signal. Can we improve Monte Carlo? Not without external information about the distribution of diagonal elements Our goals:
- What if we have an approximation M ≈ diag(A−1)?
- Is Tr(M) ≈ Tr(A−1) sufficient?
- If not, can we use fitting p(M) (regression/interpolation/quadrature)?
- Can the fitting reduce Var(p(M)−diag(A−1))?
[ 8 ]
Approximations to diag(A−1)
- Inexpensive bounds on diagonal elements (Robinson and Wathen ’92)
e.g., for A SPD, 1/Aii often capture the pattern of diag(A−1)
- Let [L,U] = ILU(A) (incomplete LU) and M = diag(U−1L−1)
Requires only A−1
i,j entries from sparsity of L,U (Erisman, Tienny, ’75)
- Eigen/singular vectors
M = diag(XΛ−1Y T), for nev smallest eigenvalues Already available from deflating multiple right hand sides! Number of eigenvectors can be increased while solving Ax = ei (eigCG)
[ 9 ]
For some problems M captures pattern of diag(A−1) well Laplacian delsq(numgrid(’S’,34))
200 400 600 800 1000 0.2 0.4 0.6 0.8 1 1.2 1.4
Index of diagonal diag(A−1) diag(XL−1XT) diag((LU)−1)
Deflation: 15 smallest eigenpairs ILU(’crout’,’row’,0.01) Traces not close but Var(diag(XΛ−1XT −A−1)) = 4e-4 Var(diag((LU)−1 −A−1)) = 1e-2 MC on diag(A−1 −M) can be competitive to Hutchinson’s method
[ 10 ]
In some cases approximation is pointless Rajat10 circuit simulation matrix (size 30202)
0.5 1 1.5 2 2.5 3 x 10
4
−1 −0.5 0.5 1 Index of diagonal element diag(A−1 diag(M)
M from 100 smallest singular triplets
[ 11 ]
Capture pattern better by fitting M to D = diag(A−1) MC resolves shift D = c+M, but not scale D = bM (variance may increase!) Approach 1. Least squares fit with bM +c
- 1. Solve Di = eT
i A−1ei, for i ∈ S a set of k indices
- 2. Find [b,c] = argmin{D(S)−(bM(S)+c)2, b,c ∈ ℜ}
Not many points (linear systems) are needed. Typically 10-20. Significant improvement in the estimation of trace Reduces variance for potentially continuing with MC
[ 12 ]
Approach 1 example D ≈ bM +c Matrix RDB5000, 50 smallest singular triplets, k=20 points used to fit Accuracy of systems and singular vectors is 1e-6.
1000 2000 3000 4000 5000 −0.2 −0.15 −0.1 −0.05 0.05 0.1 0.15 Index of diagonal diag(A−1 M 1000 2000 3000 4000 5000 −0.2 −0.15 −0.1 −0.05 0.05 0.1 Index of diagonal diag(A−1 fitted: diag(bM+c)
Tr(A−1) −267.880 Var(D) 6.1e−3 Tr(b∗M +c) −267.544 Var(D−M) 4.3e−4 Rel.Err. 1E −3 Var(D−bM −c) 4.3e−5
[ 13 ]
Better fitting Linear model preserves shape of M, thus relies too much on the quality of M Interpolating with a higher degree polynomial could be noisy. Approach 2 basic. Piecewise Cubic Hermitian Spline Interpolation (PCHIP)
- 1. Solve Di = eT
i A−1ei, for i ∈ S a set of k indices
- 2. Fit p(M(S)) = D(S)
For PCHIP to effectively capture the pattern (global and local) of D it needs:
- smoothness of the approximant
- elements of M(S) to appear in increasing order
- to capture the whole range of values of D
- to capture where most of the action in D is happening
[ 14 ]
Approach 2. Piecewise Cubic Hermitian Spline Interpolation (PCHIP)
- 1. [ ˜
M,J] = sort(M) to obtain a CDF-like, smooth graph
- 2. Choose Q a set of k indices: {1,2} ∈ Q and the k−2 are chosen such that they
minimize the integration error with trapezoidal rule of ˜
- M. Do not consider
indices that produce non-unique ˜ Mi values.
1000 2000 3000 4000 5000 −0.2 −0.15 −0.1 −0.05 0.05 0.1 0.15 Index of diagonal diag(A−1 M 1000 2000 3000 4000 5000 −0.15 −0.1 −0.05 0.05 Diagonal index of sorted M M selected points diag(A−1) [ 15 ]
Approach 2. Piecewise Cubic Hermitian Spline Interpolation (PCHIP)
- 1. [ ˜
M,J] = sort(M)
- 2. Choose Q a set of k indices.
- 3. S = J(Q) the corresponding indices in original ordering
- 4. Solve Di = eT
i A−1ei, for i ∈ S
1000 2000 3000 4000 5000 −0.15 −0.1 −0.05 0.05 Diagonal index of sorted M M diag(A−1) computed A−1(ii) [ 16 ]
Approach 2. Piecewise Cubic Hermitian Spline Interpolation (PCHIP)
- 1. [ ˜
M,J] = sort(M)
- 2. Choose Q a set of k indices.
- 3. S = J(Q) original ordering
- 4. Solve Di = eT
i A−1ei, for i ∈ S
- 5. PCHIP fit p(M(S)) = D(S). Use p(M) ≈ D
1000 2000 3000 4000 5000 −0.15 −0.1 −0.05 0.05 Diagonal index of sorted M M diag(A−1) computed A−1(ii) 1000 2000 3000 4000 5000 −0.15 −0.1 −0.05 0.05 Diagonal index of sorted M M diag(A−1) p(M) fit [ 17 ]
Approach 2. Piecewise Cubic Hermitian Spline Interpolation (PCHIP)
- 1. [ ˜
M,J] = sort(M)
- 2. Choose Q a set of k indices.
- 3. S = J(Q) original ordering
- 4. Solve Di = eT
i A−1ei, for i ∈ S
- 5. PCHIP fit p(M(S)) = D(S). Use p(M) ≈ D
If (4) computes also evecs, update points incrementally
1000 2000 3000 4000 5000 −0.15 −0.1 −0.05 0.05 Diagonal index of sorted M M diag(A−1) computed A−1(ii) 1000 2000 3000 4000 5000 −0.15 −0.1 −0.05 0.05 Diagonal index of sorted M M diag(A−1) p(M) fit [ 18 ]
Approach 2. Piecewise Cubic Hermitian Spline Interpolation (PCHIP) Improvement over bM+c
3500 4000 4500 5000 0.02 0.025 0.03 0.035 0.04 0.045 Diagonal index of sorted M diag(A−1) p(M) bM+c 500 1000 1500 2000 −0.18 −0.17 −0.16 −0.15 −0.14 −0.13 −0.12 Diagonal index of sorted M diag(A−1) p(M) bM+c
2200 2400 2600 2800 3000 −0.1 −0.08 −0.06 −0.04 −0.02 0.02 Diagonal index of sorted M diag(A−1) p(M) bM+c
[ 19 ]
Fitting examples nev=k=100: OLM5000, SiNa, KUU OLM5000
1000 2000 3000 4000 5000 −0.2 −0.15 −0.1 −0.05 Index of sorted M
Diagonal element Comparing diag(A−1), diag(M) and diag(M_fit)
diag(A−1) M Fit p(M) 500 1000 1500 2000 2500 −0.21 −0.208 −0.206 −0.204 −0.202 −0.2 −0.198 −0.196 −0.194 Index of sorted M
Diagonal element Comparing diag(A−1), diag(M) and diag(M_fit)
diag(A−1) M Fit p(M)
SiNa
1000 2000 3000 4000 5000 −0.1 −0.05 0.05 0.1 0.15 0.2 Index of sorted M
Comparing diag(A−1), diag(M) and diag(M_fit)
diag(A−1) M Fit p(M) 1000 2000 3000 4000 5000 6000 7000 0.2 0.4 0.6 0.8 1 1.2 1.4 Index of sorted M
Comparing diag(A−1), diag(M) and diag(M_fit)
diag(A−1) M Fit p(M)
KUU
[ 20 ]
Very good eigenvalue approximation OLM5000
20 40 60 80 100 10
−4
10
−3
10
−2
10
−1
10 Matrix olm5000: Relative Error of Trace Number of deflated singular vectors Trace RelErr of bM+c Trace RelErr of p(M) 20 40 60 80 100 10
−3
10
−2
10
−1
Matrix rdb5000: Relative Error of Trace Number of deflated singular vectors Trace RelErr of bM+c Trace RelErr of p(M)
RDB5000 SiNa
20 40 60 80 100 10
−3
10
−2
Matrix Sina: Relative Error of Trace
Number of deflated singular vectors
Trace RelErr of bM+c Trace RelErr of p(M) 20 40 60 80 100 10
−3
10
−2
10
−1
Matrix Kuu: Relative Error of Trace Number of deflated singular vectors Trace RelErr of p(M) Trace RelErr of bM+c
KUU
[ 21 ]
Variance: Z2 on E = A−1 −YΣ−1XT vs MC on diag D− p(M) OLM5000
20 40 60 80 100 10
3
10
4
10
5
10
6
Matrix olm5000: Variance Number of deflated eigenvectors or singular vectors Variance var(diag(E)_fit) var(diag(A−1)) (Z2): var(A−1) var(diag(E)) (Z2): var(E) var(A − aM) 20 40 60 80 100 10
1
10
2
10
3
10
4
10
5
10
6
Matrix rdb5000: Variance Number of deflated eigenvectors or singular vectors Variance var(diag(E)_fit) var(diag(A−1)) (Z2): var(A−1) var(diag(E)) (Z2): var(E) var(A − aM)
RDB5000 SiNa
20 40 60 80 100 10
2
10
3
10
4
10
5
Matrix SiNa: Variance Number of deflated eigenvectors or singular vectors Variance var(diag(E)_fit) var(diag(A−1)) (Z2): var(A−1) var(diag(E)) (Z2): var(E) var(A − aM) 20 40 60 80 100 10
3
10
4
10
5
10
6
10
7
Matrix Kuu: Variance Number of deflated eigenvectors or singular vectors Variance
var(D−p(M)) var(diag(A−1)) (Z2): var(A−1) var(diag(E)) (Z2): var(E) var(A − aM)
KUU
[ 22 ]
When to use it? Estimate dynamically:
- 1. Relative trace error
Cross validation: (a) Use m subsets Si ⊂ S (b) Fit p( ˜ M(Si)) and compute the mean error εi of the S−Si points (c) Confidence interval for error: ±2
- Var(εi)
- 2. Variance of (D− p(M)) vs Z2 on E
(a) Compute a j = A−1ej, j ∈ S. (b) Based on a j j update estimates for var(D), var(D−M), var(D− p(M)) (c) Based on ai j and µi = YΣ−1XTei update Hutchinson variance estimates var(A) = 2A2
F
var(A−YΣ−1XT) Large differences in various methods would show after a few points
[ 23 ]
Dynamically identifying smallest variance Estimated variance converges to actual variance Relative differences apparent almost immediately
10
1
10
2
10
−2
10 10
2
10
4
10
6
Matrix cage8: Variance Estimation Number of fitting points Variance
- Estim. Var(diag(E2))
- Estim. Var(A
−1)
- Estim. Var(E)
Actual Var(diag(E2)) Actual Var(A
−1)
Actual Var(E)
10
1
10
2
10
1
10
2
Matrix bwm200: Variance Estimation Number of fitting points Variance
- Estim. Var(diag(E2))
- Estim. Var(A
−1)
- Estim. Var(E)
Actual Var(diag(E2)) Actual Var(A
−1)
Actual Var(E)
[ 24 ]
Dynamically identifying smallest variance If a total of s steps allowed, what method will give the smallest error at s? Eg., the matb5 QCD matrix:
10 20 30 40 50 60 70 80 90 100 2 4 6 8 10 12 14 num of points used in fitting Predicted Error
Predicted Error at 200th quadrature with Rademacher on A Predicted Error at 200th quadrature with Rademacher on E Predicted Error at 200th quadrature with unit. Basis on E2
After 10 steps, excellent match between estimated and observed variances
[ 25 ]
Conclusions If M approximates qualitatively well D, our technique combines deterministic regression and stochastic estimation to achieve good accuracy on ∑Di with as few samples as possible.
- Most eigenvectors are a by product of solving right hand sides (samples).
- Fitting achieves good eigenvalue accuracy, soon (less expensive than MC)
- Fitting may or may not improve variance
- Dynamic monitoring possible. Some improvements are needed.