Numerical analysis and random matrix theory
Tom Trogdon ttrogdon@math.uci.edu UC Irvine
Numerical analysis and random matrix theory Tom Trogdon - - PowerPoint PPT Presentation
Numerical analysis and random matrix theory Tom Trogdon ttrogdon@math.uci.edu UC Irvine Acknowledgements This is joint work with: Percy Deift Govind Menon Sheehan Olver Raj Rao Numerical analysis and random matrix theory
Tom Trogdon ttrogdon@math.uci.edu UC Irvine
This is joint work with:
Trotter (1984), Silverstein (1985), Edelman (1989), Dumitriu and Edelman (2002)
Bornemann (2008), Witte, Bornemann and Forrester (2012), Olver and T (2014)
Mezzadri (2006), Edelman and Rao (2005), Menon and Li (2014), Olver, Rao and T (2015)
Spielman and Teng (2009), Borgwardt (2012), Smale (1983, 1985), Deift and T (2016, 2017), Menon and T (2016), Feldheim, Paquette, and Zeitouni (2014)
Liberty, Woolfe, Martinsson, Rokhlin, and Tygert (2007)
Consider the numerical solution of the system (I H)x = b, H = H ⇤, kHk2 < 1. The ensemble. Assume H = Udiag(λ1,λ2,...,λN)U ∗ where (λj)1≤j≤N are iid with P(λj ∈ B) = Z
B∩[−1,1]
p(x)dx. The method. A naïve iterative scheme is given by xn = Hxn−1 + b, x0 = 0.
The halting time. To keep things simple, define T = T (H,ε) to be the smallest integer T such that
T −1
X
k=0
H k
< ε. The random variable T is the halting time. The question. What is the distribution of T ? Is it universal as N → ∞?
Define λmin = minλj, λmax = maxλj. Then for fixed t ≥ 0 as N → ∞ P ⇣ 1 − t N ≤ λj ≤ 1 ⌘ = p(1) t N + o(N −1), P ⇣ −1 ≤ λj ≤ −1 + t N ⌘ = p(−1) t N + o(N −1). Assume p(1) 6= 0. Then as N ! 1 p(1)N(1 λmax)
dist.
! exp(1), N(1 + λmin)
dist.
! exp(p(1)). Because H is Hermitian
n−1
X
k=0
H k
= max ⇢ |λmax|n |1 − λmax|, |λmin|n |1 − λmin|
For ε > 0, define t = t(H,ε) by max ⇢ |λmax|t |1 − λmax|, |λmin|t |1 − λmin|
Then T (H,ε) = max{0,dt(H,ε)e} and t(H,ε) = max ⇢logε1|1 λmax|1 log|λmax|1 , logε1|1 λmin|1 log|λmin|1
Define tmax = logε−1|1 − λmax|−1 log|λmax|−1 , tmin = logε−1|1 − λmin|−1 log|λmin|−1 .
Two estimates are needed as N → ∞: tmax r(1)N log[Nε−1]
dist.
−→ 1 ξ , ξ ∼ exp(1), tmin = O(N). Because tmin is asymptotically smaller than tmax it does not affect the limit: T (H,ε) r(1)N log[Nε−1]
dist.
−→ 1 ξ , ξ ∼ exp(1).
This establishes universality for T (H,ε) in this simple case. It required three main components:
0.0 0.5 1.0 1.5 2.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6
Two choices for p(x):
Significant work goes into determining the scaling. If the scaling is not known, but universality is going to be investigated, a normalization must be inferred from the data. Given a collection of samples of the random variable T = T (H,ε), define the empirical fluctuations τ(H,ε) = T (H,ε) − 〈T 〉 σT , where 〈T 〉 is the sample median and σT is the median absolute deviation.
The question. What is the distribution of T ? Is it universal as N → ∞? The method. The power method is given by (x0 specified) µk = x∗
0H 2k−1x0
x∗
0H 2k−2x0
→ λN = λmax, k → ∞. The halting time. The halting time is given by T (H,x0,ε) = min{n : |µn − µn−1| < ε2}. The ensembles. A sample covariance matrix (SCM) is an N × N real symmetric (β = 1) or complex Hermitian (β = 2) matrix H = X ∗X/M, X = (Xi j)1≤i≤M,1≤j≤N iid, M ∼ N/d, 0 < d < 1, EXi j = 0, E|Xi j|2 = 1. For β = 2, EX 2
i j = 0 is imposed. We require uniform exponential tails.
0.0 0.5 1.0 1.5 0.2 0.4 0.6 0.8 1.0 1.2
Using 4 complex SCMs, the fluctuations τ(H,ε) = T (H,ε) − 〈T 〉 σT , appear universal.
Let H = UΛU ∗ and U ∗x0 = [β1,β2,...,βN]T , Λ = diag(λ1,λ2,...,λN), 0 ≤ λ1 ≤ λ2 ≤ ··· ≤ λN. To estmate the halting time for the power method, we must find the large N and large k asymptotics for µk =
N
X
j=1
|βj|2λ2k−1
j N
X
j=1
|βj|2λ2k−2
j
= λN B B B B B @ 1 +
N−1
X
j=1
βN
λN å2k−1 1 +
N−1
X
j=1
βN
λN å2k−2 1 C C C C C A .
For eigenvalues:
SCM converges a.s.
λ− for iid standard normal random variables.
the fluctuations of the largest and smallest eigenvalues for (real or complex) standard normal distributions.
(see also Tao and Vu (2012)).
(2016) for the most comprehensive results. For eigenvectors:
see Silverstein (1986); Bai, Miao and Pan (2007).
to prove our theorems. We require exponential tails which is stronger than the assumptions in Yin (1986); Geman (1980).
Theorem.
For SCMs N 1/2(|βN|,|βN−1|,|βN−2|) converge jointly in distribution to (|X1|,|X2|,|X3|) where {X1,X2,X3} are iid real (β = 1) or complex (β = 2) standard normal random variables. Additionally, N 2/3λ−2/3
+
d 1/2(λ+ − λN,λ+ − λN−1,λ+ − λN−2) converge jointly in distribution to random variables (Λ1,β,Λ2,β,Λ3,β) which are the smallest three eigenvalues of the so-called stochastic Airy operator. This follows from Ramírez, Rider and Virág (2011); Pillai and Yin (2014); Bloemendal, Knowles, Yau and Yin (2016).
The first theorem is known as rigidity and it was first shown for Wigner ensembles by Erd˝
The second theorem is known as delocalization.
Theorem (Pillai and Yin (2014)).
For any s > 0 lim
N→∞P
Ä |λn − γn| ≤ N −2/3+s(max{n,N − n + 1})−1/3 for all n ä = 1.
Theorem (Bloemendal, Knowles, Yau and Yin (2016)).
For any s > 0 lim
N→∞P
Ä |βn| ≤ N −1/2+s for all n ä = 1.
The distribution function F gap
β (t), supported on t ≥ 0 for β = 1,2 is defined by
F gap
β (t) := lim N→∞P
1 2−7/6N 2/3λ−2/3
+
d −1/2(λN − λN−1) ≤ t ! .
Theorem (Deift and T).
Let H be a real (β = 1) or complex (β = 2) SCM and let x0 be a random unit vector independent of H. Assuming ε ≤ N −5/3−σ lim
N→∞P
T (H,x0,ε) 2−7/6λ1/3
+ d −1/2N 2/3(logε−1 − 2/3logN)
≤ t ! = F gap
β (t).
In the the proof we give the following approximation: This we obtain: Convergence is at an (at best) logarithmic rate.
T(H, x0, ✏) N 2/3 log N − log ✏−1 − 2
3 log N
N 2/3 log N−1
+ |N − N−1| prob.
− → 0
T(H, x0, ✏) N 2/3 − log ✏−1 + log ⇣ 1 −
λN−1
+ 1
2 log 2N + log
βN
λN−1
− → 0
1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 Rescaled halting time Frequency
β 1
1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 Rescaled halting time Frequency
β 2
lim
N→∞P
T (H,x0,ε) 2−7/6λ1/3
+ d −1/2N 2/3(logε−1 − 2/3logN + γ)
≤ t ! = F gap
β (t).
Thanks to Forkmar Bornemann (TUM)
Similar techniques give universality theorems for:
See Deift and T (2016) and Deift and T (2017).
w/ Percy Deift, Govind Menon and Sheehan Olver
QR algorithm with “simple halting” QR algorithm with “realistic halting”
1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 Rescaled halting time Frequency
β 1
1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 Rescaled halting time Frequency
β 2
1 2 3 4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Halting Time Fluctuations Frequency
GOE, GUE, CoshUE, QUE, BE DimHsL = 150
w/ Percy Deift, Govind Menon and Sheehan Olver
1 2 3 0.0 0.1 0.2 0.3 0.4 0.5 Halting Time Fluctuations Frequency
Matrix sizeHsL = 100
1 2 3 0.0 0.1 0.2 0.3 0.4 0.5 Halting Time Fluctuations Frequency
Matrix sizeHsL = 500
4 2 2 4 0.0 0.1 0.2 0.3 0.4 0.5 Halting Time Fluctuations Frequency
Matrix size 100
4 2 2 4 0.0 0.1 0.2 0.3 0.4 0.5 Halting Time Fluctuations Frequency
Matrix size 500 500
Conjugate gradient algorithm GMRES algorithm
w/ Levent Sagun and Y ann LeCun
−2 2 4 6 Halting time fluctuations 0.0 0.1 0.2 0.3 0.4 0.5 Frequency
Universal scaling - N = 400 Gaussian Bernoulli Uniform
−4 −2 2 4 Halting time fluctuations 0.0 0.1 0.2 0.3 0.4 0.5 Frequency
Universality in fully connected network with SGD
Fully connected MNIST Fully connected random MNIST on convnet MNIST on norm condition
Gradient descent for spin glasses Stochastic gradient descent in machine learning Google search times
The halting time. The halting time (x0 = 0) is given by T (H,ε) = min{k : kHxk bk2 < ε}. The problem. For a deterministic matrix A > 0, estimate the moments
The ensemble. An LUE matrix is an N × N complex Hermitian matrix H = X ∗X/M, X = (Xi j)1≤i≤M,1≤j≤N iid standard complex normals. Take M ∼ N/d, d = 1 − N −α, 1/2 ≤ α < 1. The method. The conjugate gradient algorithm is an iterative method to solve Hx = b: 0 = x0 → x1 → ··· → xk → ··· → xN = x.
Here we have chosen d = 1 − N −α. The rationale for this is the following:
bounded condition number).
& unconcentrated). The case where M = N + α, α > 0, N → ∞ then α → ∞ was studied by Borodin & Forrester (2002). This describes the transition from the ill-conditioned to the well-conditioned case (from heavy-tailed to sub-exponential distributions).
If d = 1 − (4c)−1/2N −1/2 then (see Deift, Menon & T. (2017)) lim
N→∞P
Çλmax/λmin − 4Nc−1 4c−4/3N 2/3 ≤ t å = F2(t) ⇐ Tracy–Widom dist. The condition number grows with N but the (limit) distribution has sub-exponential tails, giving a finite N matrix that gives the Borodin–Forrester transition.
180 190 200 210 0.00 0.02 0.04 0.06 0.08
m á n + 2 n
500 600 700 800 900 1000 1100 0.00000 0.00005 0.00010 0.00015 0.00020
m á n
52 54 56 58 0.0 0.1 0.2 0.3 0.4
m á 2 n
n = N + 2 p N n = N n = 2N
M = N M = 2N M = N + 2 p N
In particular, sup
kAk1, A0
E[T (A+ σ2H,ε)] = O ✓ N α ✓ 1 + 1 σ2 ◆ log N α ✓ 1 + 1 σ2 ◆ ε1 ◆ . and the right-hand side is less than N (for suff. large N).
Theorem (Menon and T (2016)).
As N ! 1, for 1/2 α < 1, sup
kAk1, A0
E[T j(A+ σ2H,ε)] = O Ç N αj ✓ 1 + 1 σ2 ◆j logj N α ✓ 1 + 1 σ2 ◆ ε1 å . The proof uses tail estimates on the condition number derived from the Riemann–Hilbert analysis in Deift, Menon, and T (2015).
Let xk be the approximation of x, Hx = b found by CG. It is known (see Greenbaum, Pták & Strakoš (1996)) that any non-increasing sequence for kxk xkH kx0 xkH , k = 0,1,...,N 1, is attainable by choosing H appropriately. In particular, there exists a pathological case where kxk xkH kx0 xkH = 1, k = 0,1,...,N 1, and xN = x. The probability of such a pathological case decays faster than N k for any k > 0.
References
[Bai et al., 2007] Bai, Z. D., Miao, B. Q., and Pan, G. M. (2007). On asymp- totics of eigenvectors of large sample covariance matrix.
35(4):1532–1572. [Ben Arous and P´ ech´ e, 2005] Ben Arous, G. and P´ ech´ e, S. (2005). Universality
Pure Appl. Math., 58(10):1316–1357. [Bloemendal et al., 2016] Bloemendal, A., Knowles, A., Yau, H.-T., and Yin, J. (2016). On the principal components of sample covariance matrices. Probab. Theory Relat. Fields, 164(1-2):459–552. [Borgwardt, 1987] Borgwardt, K. H. (1987). The simplex method: A probabilis- tic analysis. Springer–Verlag, Berlin, Heidelberg. [Bornemann, 2010] Bornemann, F. (2010). On the numerical evaluation of Fred- holm determinants. Math. Comput., 70(4):871–915. [Deift and Trogdon, 2016] Deift, P. and Trogdon, T. (2016). Universality for the Toda algorithm to compute the largest eigenvalue of a random matrix. [Deift and Trogdon, 2017] Deift, P. and Trogdon, T. (2017). Universality for eigenvalue algorithms on sample covariance matrices. [Deift et al., 2016] Deift, P. A., Menon, G., and Trogdon, T. (2016). On the condition number of the critically-scaled Laguerre Unitary Ensemble. Discret.
[Dumitriu and Edelman, 2002] Dumitriu, I. and Edelman, A. (2002). Matrix models for beta ensembles. J. Math. Phys., 43(11):5830. [Edelman, 1988] Edelman, A. (1988). Eigenvalues and condition numbers of random matrices. SIAM J. Matrix Anal. Appl., 9(4):543–560. [Edelman and Rao, 2005] Edelman, A. and Rao, N. R. (2005). Random matrix
[Forrester, 1993] Forrester, P. J. (1993). The spectrum edge of random matrix
[Geman, 1980] Geman, S. (1980). A limit theorem for the norm of random
[Hough et al., 2006] Hough, J. B., Krishnapur, M., Peres, Y., and Vir´ ag, B. (2006). Determinantal Processes and Independence. Probab. Surv., 3:206– 229.
References
[Johansson, 2000] Johansson, K. (2000). Shape Fluctuations and Random Ma-
[Johnstone, 2001] Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Stat., 29(2):295–327. [Li and Menon, 2013] Li, X. H. and Menon, G. (2013). Numerical solution of Dyson Brownian motion and a sampling scheme for invariant matrix ensem-
[Menon and Trogdon, 2016] Menon, G. and Trogdon, T. (2016). Smoothed analysis for the conjugate gradient algorithm. Symmetry, Integr. Geom. Meth-
[Mezzadri, 2006] Mezzadri, F. (2006). How to generate random matrices from the classical compact groups. Not. AMS, 54:592–604. [Olver et al., 2015] Olver, S., Rao, N. R., and Trogdon, T. (2015). Sampling unitary ensembles. Random Matrices Theory Appl., 4(1):1550002. [Olver and Trogdon, 2013] Olver, S. and Trogdon, T. (2013). Numerical solu- tion of Riemann–Hilbert Problems: Random matrix theory and orthogonal
[Pfrang et al., 2014] Pfrang, C. W., Deift, P., and Menon, G. (2014). How long does it take to compute the eigenvalues of a random symmetric matrix? Random matrix theory, Interact. Part. Syst. Integr. Syst. MSRI Publ., 65:411– 442. [Pillai and Yin, 2014] Pillai, N. S. and Yin, J. (2014). Universality of covariance
[Scardicchio et al., 2009] Scardicchio, A., Zachary, C. E., and Torquato, S. (2009). Statistical properties of determinantal point processes in high- dimensional Euclidean spaces. Phys. Rev. E, 79(4):041108. [Silverstein, 1986] Silverstein, J. (1986). Eigenvalues and Eigenvectors of large dimensional sample covariance matrices. Contemp. Math., 50. [Silverstein, 1985] Silverstein, J. W. (1985). The Smallest Eigenvalue of a Large Dimensional Wishart Matrix. Ann. Probab., 13(4):1364–1368. [Smale, 1983] Smale, S. (1983). On the average number of steps of the simplex method of linear programming. Math. Program., 27(3):241–262. [Smale, 1985] Smale, S. (1985). On the efficiency of algorithms of analysis. Bull.
The density on the eigenvalues is 1 ˆ ZN Y
i<j
|λj − λi|2e−N P
i V (λi)dλ = 1
ˆ ZN e−2NHN (λ)dλ. If V (x) is not quadratic (say, V (x) = x4), then the entries of the matrix are not independent. The problem: Generate a sample from the N × N Hermitian random matrix with distribution 1 ZN e−NtrV (H)dH.
One approach: Numerically solve the Dyson BM SDEs: dλ = rHN(λ)dt + 1 p N dW. See Menon and Li (2014). Our approach: Exploit the determinantal structure: 1 ˆ ZN Y
i<j
|λj − λi|2e−N P
i V (λi)dλ =
1 N! detKN(λi,λj)dλ, KN(λi,λj) = X
n
pn(λi)pn(λj)e− 1
2 NV (λi)− 1 2 NV (λj ).
Assume φn(x) = pn(x)e− 1
2V (x) can be evaluated for n = 0,1,...,N − 1.
Find the “first" eigenvalue: Use inverse transform sampling to sample the density ρ1(x) := 1 N KN(x, x) → λ1. Find and sample the new density: ρ2(x) ∝ det ï KN(x, x) KN(x,λ1) KN(λ1, x) KN(λ1,λ1) ò , ρ2(x) → λ2. Find and sample the new density: ρ3(x) ∝ det KN(x, x) KN(x,λ1) KN(x,λ2) KN(λ1, x) KN(λ1,λ1) KN(λ1,λ2) KN(λ2, x) KN(λ2,λ1) KN(λ2,λ2) , ρ3(x) → λ3.
Given φ1(x) := (φn(x))0≤n≤N−1 and the sample 1 N KN(x, x) = 1 N φ1(x)T φ1(x) → λ1, let U be an N ×N −1 orthogonal matrix whose columns form an orthonormal basis the orthogonal complement of {φ1(λ1)}. φ1(λ1)T U = 0 U T U = I ρ2(x) = 1 N − 1φ2(x)T φ2(x) φ2(x) = U T φ1(x) For N large, this is not a stable procedure. By implementing the work of Hough, Krishnapur, Peres and Virág (2006) and Scardicchio, Zachary and Torquato (2009), the conditional densities can be computed using stable methods from linear algebra.
0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8
Sampling with V (x) = x8 and N = 5 Note that H = UΛU ∗