SLIDE 1 www.bsc.es
SOFIA 2013,
Vassil Alexandrov (ICREA-BSC)
On Scalability Behaviour of Monte Carlo Sparse Approximate Inverse and Hybrid Algorithms for Matrix Computations
SLIDE 2 2
Overview
About BSC HPC trends Monte Carlo – History and Motivation Parallel Algorithm Scaling behaviour Monte Carlo , SPAI and MSPAI Experimental results Conclusions
SLIDE 3
BSC OVERVIEW
SLIDE 4 Barcelona Supercomputing Center
The BSC-CNS (over 350 researchers) objectives:
R&D in Computer Sciences, Life Sciences and Earth Sciences Supercomputing support to external research
Application scope “Earth Sciences” Application scope “Astrophysics” Application scope “Engineering” Application scope “Physics” Application scope “Life Sciences” Compilers and tuning of application kernels Programming models and performance tuning tools Architectures and hardware technologies Extreme Computing Solving Problems With Uncertainty, Exascale Computing
SLIDE 5 5
BSC
- Four joint centres/institutes: IBM, Intel, Microsoft, CUDA
- Severo Ochoa excellence award from Spanish
Government
- PRACE - PATC training centre
- 62 European and national grants
SLIDE 6 6
MareNostrum III
IBM iDataPlex cluster with 3028 compute nodes
- Peak Performance of 1 Petaflops
- 48,448 Intel SandyBridge-EP E5-2670 cores at
2.6 GHz
- Two 8 core CPUs per node (16 cores/node)
- 94.625 TB of main memory (32 GB/node)
- 1.9 PB of disk storage
- Interconnection networks:
- Infiniband
- Gigabit Ethernet
- Operating System: Linux - SuSe Distribution
- Consisting of 36 racks
- Footprint:120m2
Completed system - 48,448 cores and predicted to be in the top 25
SLIDE 7 7
MinoTauro
NVIDIA GPU cluster with 128 Bull B505 blades
- 2 Intel E5649 6-Core processors at 2.53 GHz
per node; in total 5544 cores
- 2 M2090 NVIDIA GPU Cards
- 24 GB of Main memory
- Peak Performance: 185.78 TFlops
- 250 GB SSD (Solid State Disk) as local
storage
- 2 Infiniband QDR (40 Gbit each) to a non-
blocking network
- RedHat Linux
- 14 links of 10 GbitEth to connect to BSC
GPFS Storage The Green 500 list November 2012: #36 with 1266 Mflops/Watt, 81.5 kW total Power
SLIDE 8
HPC TRENDS
SLIDE 9
SLIDE 10
SLIDE 11
Projected Performance Development
SLIDE 12 Road to Exascale
- Prof. Jack Dongarra, ScalA12, SLC, USA
SLIDE 13
MONTE CARLO METHODS FOR LINEAR ALGEBRA
SLIDE 14 14
History – Serial Monte Carlo methods for matrix computation
Curtis - 1956 Halton – 1970s Michailov – 1980s Ermakov – 1980s Alexandrov, Fathi 1996-2000 – hybrid methods Sabelfeld – 1990s and 2009
SLIDE 15 15
History – Parallel Monte Carlo methods for matrix computation
Alexandrov, Megson – 1994 O(nNT) Alexandrov, Fathi – 1998-2000 Alexandrov, Dimov – 1998-2002 resolvent MC Alexandrov, Weihrauch, Branford – 2000-2008 Karaivanova – 2000’s - QMC Alexandrov, Dimov, Weichrauch, Branford – 2005-2009 Alexandrov, Strassburg, 2010-2013 – hybrid methods
SLIDE 16 16
Motivation
Many scientific and engineering problems revolve around: – inverting a real n by n matrix (MI)
– solving a system of linear algebraic equations (SLAE)
- Given A and b
- Solve for x, Ax = b
- Or find A−1 and calculate x = A−1b
SLIDE 17 17
Motivation cont.
Traditional Methods for dense matrices: – Gaussian elimination – Gauss-Jordan – Both take O(n3) steps Time prohibitive if – large problem size – real-time solution required
SLIDE 18 18
Idea of the Monte Carlo method
Wish to estimate the quantity α Define a random variable ξ Where ξ has the mathematical expectation α Take N independent realisations ξi of ξ
– Then – And
SLIDE 19 19
Reason for using Monte Carlo
O(NT) steps to find an element of the – matrix inverse A – solution vector x Where – N Number of Markov Chains – T length of Markov Chains Independent of n - size of matrix or problem Algorithms can be efficiently parallelised
SLIDE 20 MC Important Properties
- Efficient Distribution of the compute data
- Minimum communication/ communication
reducing algorithms
- Increased precision is achieved adding extra
computations (without restart)
- Fault-Tolerance achieved through adding extra
computations
SLIDE 21 21
Parallel Algorithm
Start with a diagonally dominant matrix ^B Make the split ^B = D - B1
– D is the diagonal elements of ^B – B1 has off-diagonal elements only
Compute A = D-1B1 If we had started with a matrix B that was not diagonally dominant then an additional splitting would have been made at the beginning, B = ^B - (^B - B), and a recovery section would be needed at the end of the algorithm to get B-1 from ^B-1
SLIDE 22 22
Parallel Algorithm cont.
Finding ^B-1
SLIDE 23 23
Parallel Algorithm cont.
Matrix Inversion using MCMC
SLIDE 24 24
Parallel Algorithm cont.
Matrix Setup Use parallel Monte Carlo to find a rough inverse of A Use parallel iterative refinement to improve accuracy and retrieve final inverse
SLIDE 25 25
Parallel Algorithm cont.
SLIDE 26 26
Parallel Algorithm cont.
Almost linear speedup for large enough problem sizes Minimal inter-process communication necessary Work and communication balanced Smaller problem sizes suffer due to computation/communication ratio
SLIDE 27
HPC SIMULATION
SLIDE 28
Motivation
Predict behaviour on different system Find bottlenecks, sweet spot, scaling problems Easier than running on several machines Reproducible
SLIDE 29 xSim – The Extreme-Scale Simulator
Developed at the Oak Ridge National Laboratory (ORNL), USA Real life application testing and algorithmic resilience options development (Extreme Computing Group at BSC and Computer Science and Mathematics Division at ORNL) Highly scalable solution that trades off accuracy for node
- versubscription simulation
Execution of real applications, algorithms or their models atop a simulated HPC environment for
– Performance evaluation, including identification of resource contention and underutilization issues – Investigation at extreme scale, beyond the capabilities of existing simulation efforts
SLIDE 30
xSim – The Extreme-Scale Simulator
The simulator is a library Virtual processors expose a virtual MPI to applications using a Parallel Discreet Event Simulation (PDES) Easy to use:
– Replace the MPI header – Compile and link with the simulator library – Run the MPI program
Support for C and Fortran MPI applications
SLIDE 31 31
Scaling Behaviour
Core Scaling on a 240 core system
SLIDE 32 32
Scaling Behaviour cont.
Core Scaling, logarithmic
SLIDE 33 33
Scaling Behaviour cont.
Problem scaling
SLIDE 34
MONTE CARLO AND SPAI
SLIDE 35 35
Combination of Monte Carlo and SPAI
SPAI – SParse Approximate Inverse Preconditioner
– Computes a sparse approximate inverse M of given matrix A by minimizing ∥ AM − I ∥ in the Frobenius norm – Explicitly computed and can be used as a preconditioner to an iterative method – Uses BICGSTAB algorithm to solve systems of linear algebraic equations Ax = b
Sparse Monte Carlo Matrix Inverse Algorithm
– Computes an approximate inverse by using Monte Carlo methods – Uses an iterative filter refinement to retrieve the inverse
SLIDE 36 36
Combination of Monte Carlo and SPAI cont.
Idea: using a Monte Carlo computed approximate matrix inverse as a preconditioner
– Considering
- computation time for MC solution
- Suitability of rough inverse
- Matrix types that are not invertable via Frobenius norm
SLIDE 37 37
Experiments
Preliminary results
– optimisation of parameters necessary
Randomly generated sparse and dense matrices Standard SPAI settings
– RHS taken from input matrix
Monte Carlo method without filter option
– Very rough inverse that already provides a good alternative
SLIDE 38 38
Experiments
Computation time in seconds for both preconditioners
1 2 3 4 5 6 7 8 50 100 150 200 250 500 1000 2000 3000 4000 5000 Frobenius Monte Carlo
Sparsity 0.5 Sparsity 0.7
1 2 3 4 5 6 7 8 50 100 150 200 250 500 1000 2000 3000 4000 5000 Frobenius Monte Carlo
SLIDE 39 39
Experiments cont.
Residual from BICGSTAB solution in SPAI
0.00E+00 1.00E-03 2.00E-03 3.00E-03 4.00E-03 5.00E-03 6.00E-03 7.00E-03 8.00E-03 9.00E-03 1.00E-02 50 100 150 200 250 500 1000 2000 3000 4000 5000 Frobenius Monte Carlo
Sparsity 0.5
0.00E+00 2.00E-03 4.00E-03 6.00E-03 8.00E-03 1.00E-02 1.20E-02 50 100 150 200 250 500 1000 2000 3000 4000 5000
Sparsity 0.7
SLIDE 41 41
MC & SPAI residuals
SLIDE 42 42
Monte Carlo, SPAI & MSPAI
SLIDE 44
MONTE CARLO AND MSPAI
SLIDE 45 45
Monte Carlo & MSPAI
SLIDE 46
Comparison Monte Carlo - MSPAI
University of Florida Sparse Matrix Collection Symmetric Matrix from Parsec set n=268,000 nnz=18,5mio
SLIDE 47
Residuals for Parsec data set
SLIDE 48
Runtime for Si10H16
SLIDE 49 www.bsc.es
Thank you!
49