Xabclib and OpenATLib Ver.1.0: A Fully Auto-tuned Sparse Iterative - - PowerPoint PPT Presentation

xabclib and openatlib ver 1 0 a fully auto tuned sparse
SMART_READER_LITE
LIVE PREVIEW

Xabclib and OpenATLib Ver.1.0: A Fully Auto-tuned Sparse Iterative - - PowerPoint PPT Presentation

Xabclib and OpenATLib Ver.1.0: A Fully Auto-tuned Sparse Iterative Library and Its Auto-tuning Interfaces Takahiro Katagiri The University of Tokyo Collaborators: Takao Sakurai, Ken Naono, Mitsuyoshi Igai (HITACHI Ltd.) Satoshi Ohshima, Kengo


slide-1
SLIDE 1

Xabclib and OpenATLib Ver.1.0: A Fully Auto-tuned Sparse Iterative Library and Its Auto-tuning Interfaces

Takahiro Katagiri The University of Tokyo

1

Collaborators: Takao Sakurai, Ken Naono, Mitsuyoshi Igai (HITACHI Ltd.) Satoshi Ohshima, Kengo Nakajima, Shoji Itoh (U.Tokyo) Hisayasu Kuroda (Ehime U./U. Tokyo)

International Workshop on Peta-Scale Computing Programming Environment, Languages and Tools (WPSE 2012) Date: 29th (Wed) February, 2012, 10:45-11:15 Place: Seminar Room (6F), RIKEN Advanced Institute for Computational Science (AICS), Kobe, Hyogo, Japan

slide-2
SLIDE 2

Outline

 Background  Xabclib & OpenATLib  Overview of New function of V.1.0

  • Automatic Selection of

Numerical Algorithms and Preconditioners

 Performance Evaluation

  • The T2K Open Supercomputer

(Quad core AMD Opteron)

  • Comparison with

Lis and PETSc (Preliminary Results)

2

slide-3
SLIDE 3

“Seamless and Highly-Productive Parallel Programming Environment for High-Performance Computing”, High Performance Library , MEXT, JAPAN(From 2008FY to 2011FY)

 Problem

  • Heavily relies on artisan techniques .
  • Non-productive and non-portability.
  • Time-consuming and not cost-

effective.

  • Sometimes fails in convergence

caused by out of ranges on parameters.

 Goal

  • To provide highly productive and

performance portable numerical library.

 With respect to sparse matrix

non-zero structures, we supply an On-the-fly AT Facility for :

1. Computation Kernel Selection 2. Numerical Algorithm Selection 3. Parallel Implementation Selection 4. General AT API (OpenATLib)

Before

Dedicated Tuning Execution Verification Time consumi ng! Sometimes converged! Sometimes slower! Not converged! Dedicated Tuning Dedicated Tuning

After

High performance with low-cost seamless connection to next generation supercomputer

With on-the-fly: ・Parameter Settings; ・Algorithm, and Implementation Selections. Common AT Interface

Generalization Automation with AT facility

slide-4
SLIDE 4

Xabclib and OpenATLib

 Xabclib

  • A Numerical Library with OpenATLib
  • Supplied Solvers:

1. Linear Equations Solvers: GMRES(m), BiCGStab by Itoh’s 2. Eigensolvers: Restart Lanczos, Explicit Restart Arnoldi

 OpenATLib

  • A General API (Application Programming Interface) for Auto-Tuning

(AT) 1. Restart Frequency Adjustment 2. Automatic Selection of Sparse Matrix-Vector Multiplication (SpMV) Implementations

 Balanced Loads; Segmented Scan for Scalar Machines (BSS); Symmetric;

3. Numerical Policy Function

 Execution Speed; Computational Accuracy; Memory Space;

4. Automatic Selection of Numerical Algorithms and Preconditioners

 Both are thread version (with OpenMP).

8

slide-5
SLIDE 5

A SPARSE ITERATIVE SOLVER WITH AUTO- TUNING FACILITY

13

Xabclib :

(OpenMP Parallelization Version)

slide-6
SLIDE 6

OpenATLib Supplied Functions

25

Function Name

Description

1 OpenATI_INIT

Set default parameters for OpenATLib and Xabclib.

2 OpenATI_DAFRT

Judge increment for restart frequency on Krylov subspace.

3 OpenATI_DSRMV

Select the best implementation for double precision symmetric SpMV with CRS format.

4 OpenATI_DURMV

Select the best implementation for double precision non-symmetric SpMV with CRS format.

5 OpenATI_DSRMV_Setup

Setup function for OpenATI_DSRMV.

6 OpenATI_DURMV_Setup

Setup function for OpenATI_DURMV.

7 OpenATI_DAFGS

Gram-Schmidt orthogonalization functions for 4 implementations.

8 OpenATI_DAFSTG

Detecting stagnation for history of residual norms.

9 OpenATI_LINEARSOLVE

A Meta-interface of Linear Solver with numerical policy interface.

10 OpenATI_EIGENSOLVE

A Meta-interface of Eigen Solver with numerical policy interface.

slide-7
SLIDE 7

The Inner Data Structure

27

OpenAT & Xabclib thread safe parameter list

IATParam(50) RATParam(50) index default description type default description type 1 mandatory M mandatory M 2 mandatory M mandatory M ([3:20] OpenATI's Information) ([3:20] OpenATI's Information) 3 (*1) # of THREADS ( S MP's) I (reserved) R (*1): OMP_NUM_THREADS 4 1 Flag of Krylov subspace expand by MM-ratio I 100.0 threshhold of MM-ratio I 5 1 OpenATI_DS RMV auto-tuned On/ Off (0:AT-off, 1:AT-on) I (reserved) R 6 12 Fastest OpenATI_DS RMV impl. Method (11:block row decomp., 12:nonzero decomp., 13:parallel vector reduction) I/ O (reserved) R 7 1 OpenATI_DURMV auto-tuned On/ Off (0:AT-off, 1:AT-on) I (reserved) R 8 12 Fastest OpenATI_DURMV impl. Method (11:block row decomp., 12:nonzero decomp., 13:BS S , 21:oroginal SS) I/ O (reserved) R 9 128 Columns of S egmented S can's algorithms I (reserved) R 10 2 type of Gram-S chmidt procedure (0:CGS , 1:DGKS , 2:MGS, 3:Blocked CGS ) I (reserved) R 11
  • DGKS
refinement done or not ( done : 1 , not : 0 ) O (reserved) R 12-20 (reserved) R (reserved) R ([21:40] Xabclib's Information) 21
  • # of OMP_NUM_THREADS
O
  • (reserved)
R 22
  • 1 (init)
  • Max. Iterations
( if S
  • lver recognize '-1' then set 'N' )
I/ O
  • 1(∞)
  • Max. elapsed time (limit time)
I 23 # of Iterations O 1.0E-8 Convergence criterion I 24 1 <L>preconditioner operations flag 1: not generated yet , 2 : already generated I (reserved) R 25 4 <L>preconditioner type 1: None , 2 : Jacobi , 3 : S OR , 4 : ILU(0) I (*2) <L>preconditioner parameter SOR(type=3): relaxation omega ( 1<= omega < 2) (*2)ILU(0)(t ype=4) : Break down threshold (default 1.0E-8) I 26 (reserved) R (reserved) R 27 20 input size of Krylov subspace ( in GMRES / Arnoldi ) I (reserved) R 28 2 start size of Krylov subspace at subspace expand AT-on ( in GMRES / Arnoldi ). S ee IATPARAM(4) O
  • <L> 2-norm of RHS
O 29
  • final size of Krylov subspace ( in GMRES / Arnoldi )
O
  • 2-norm of max. residual
O 30 (reserved) R
  • float ing operat ions (×10^9 operat ions)
O 31 (reserved) R
  • <L> precondit ioner t ime
O 32 (reserved) R
  • total solve time(elapsed)
O 33-49
  • (reserved)
R (reserved) R 50 debug info ( 1: on , else :off ) I (reserved) R <L> : Linear system <E> : Eigen system

Similar to: A Direct Sparse Solver by IBM, WSMP: Watson Sparse Matrix Package

http://www- users.cs.umn.edu/~ agupta/wsmp.html

slide-8
SLIDE 8

Call The Solver

  • 1. Measure All Candidates, and

Select the Best Candidate.

  • 2. Adjust Numerical

Algorithm Parameters. Convergence Test Convergence Test Not Converged Converged

  • 3. Test of Computation

Accuracy Based on Numerical Policy. End of Computation Pass Failure Inner Iteration

Hundreds of Times

Outer Iteration

: Several Times

Numerical Computation Policy From End Users. Execution Time Memory Amount Computation Accuracy

Auto-tuning Facilities

Execute the Sparse Iterative Method By Using The Best Candidate of I.

The Run-time AT Strategy on OpenATLib2011

slide-9
SLIDE 9

PERFORMANCE EVALUATION

94

slide-10
SLIDE 10

COMPARISON TO OTHER LIBRARIES

120

slide-11
SLIDE 11

Linear Algebra Libraries

  • Xabclib

– 2011/12/29 Version 1.0 ①

  • Lis (a Library of Iterative Solvers for linear

systems)

– 2005/09/20 Version 1.0.0 – 2011/11/24 Version 1.2.62 ②

  • PETSc (Portable, Extensible Toolkit for

Scientific Computation )

– 1995/06/21 Version 2.0β4 – 2011/09/08 Version 3.2 ③

slide-12
SLIDE 12

Evaluation Condition

Xabclib Lis PETSc Policy TIME ‐ ‐ Solvers GMRES, BiCGStab(Itoh’s) GMRES, BiCGStab GMRES, BiCGStab Restarts on GMRES

Auto 40 (Default) 30 (Default)

Preconditioni ng None, ILU0 None (Default), ILU0 None, ILU0(Default) Torrance of residual 1.0D‐8 1.0D‐8 1.0D‐8 #CPU 1, 16 1, 16 1 Time Limit 600 Sec. 600 Sec. (Force Exit) 300 Sec. (Force Exit)

Restart of GMRES is set to default value on each library. Multi‐thread version of PETSc is not under supporting now. We only evaluate one thread execution.

slide-13
SLIDE 13

GMRES, 1 Thread, No preconditioning

0.5 1 1.5 2 2.5 chem_master1 chipcool0 dc2 epb1 epb2 epb3 ex19 language memplus poisson3Da poisson3Db sme3Da torso2 torso3 trans4 viscoplastic2 wang3 wang4 xenon1 xenon2 GeoMean Xabclib Lis PETSc

Average of speedup for Lis is 0.89x. Average of speedup for PETSc is 0.87x.

#successes #fails Xabclib 20 11 Lis 12 19 PETSc 13 18 Relative speedups to Xabclib Ratio for Success of Convergence in Xabclib is 65% Faster

slide-14
SLIDE 14

GMRES, 16 Threads, No preconditioning

0.5 1 1.5 2 2.5 chem_master1 chipcool0 dc2 epb1 epb2 epb3 hcircuit language memplus poisson3Da poisson3Db torso2 torso3 trans4 trans5 viscoplastic2 wang3 wang4 xenon1 xenon2 GeoMean Xabclib Lis #successes #fails Xabclib 20 11 Lis 16 15 Average of speedup for Lis is 1.31x. Relative speedups to Xabclib Faster

slide-15
SLIDE 15

BiCGStab, 1 Thread, ILU0 Preconditioning

0.5 1 1.5 2 2.5 Baumann airfoil_2d chem_master1 chipcool0 dc2 ecl32 epb1 epb2 epb3 hcircuit language memplus nmos3 poisson3Da poisson3Db sme3Da sme3Db sme3Dc torso1 torso2 torso3 trans4 trans5 viscoplastic2 wang3 wang4 xenon1 xenon2 Xabclib Lis PETSc #successes #fails Xabclib 28 3 Lis 22 9 PETSc 21 10 Average speedup for Lis is 0.65x. Average speedup for PETSc is 1.09x. Relative speedups to Xabclib Faster Ratio for Success of Convergence in Xabclib is 90%

slide-16
SLIDE 16

BiCGStab, 16Threads, ILU0 Preconditioning

0.5 1 1.5 2 2.5 3 3.5

Baumann airfoil_2d chem_master1 chipcool0 dc2 ecl32 epb1 epb2 epb3 hcircuit language memplus nmos3 poisson3Da poisson3Db sme3Da sme3Db sme3Dc torso1 torso2 torso3 trans4 trans5 viscoplastic2 wang3 wang4 xenon1 xenon2 GeoMean

Xabclib Lis #successes #fails Xabclib 28 3 Lis 19 12 Average speedup for Lis is 2.75x. Relative speedups to Xabclib Faster Ratio for Success of Convergence in Xabclib is 90%

slide-17
SLIDE 17

Conclusion

 OpenATLib: General APIs for Auto-tuning

  • 1. Algorithm Selection on Solver Level

 OpenATI_EIGENSOLVE  OpenATI_LINEARSOLVE

  • 2. Numerical Computation Policy
  • 3. Automatic Selection of Numerical Algorithms

and Preconditioners

 An Algorithm of Stagnation Detection at Run-time

Xabclib: Sparse Iterative Solver with OpenATLib

  • 1. New Implementation of Numerical Algorithms

 Itoh’s Preconditioned BiCGStab  Explicit Restart Arnoldi with Real Vector Operation for Complex Vectors

144