Adrian Tate Adrian Tate Technical Lead of Scientific Libraries - PowerPoint PPT Presentation

Adrian Tate Adrian Tate Technical Lead of Scientific Libraries Technical Lead of Scientific Libraries Senior Software Engineer, Cray Inc. Senior Software Engineer, Cray Inc. iWAPT iWAPT, Tokyo, Oct 2009 , Tokyo, Oct 2009

1991 1993 1976 1985 1988 1982 Cray-C90 Cray-T3D Cray-XMP Cray-YMP Cray-1 Cray-2 1994 2001 2003 2005 2008 1995 Cray-T90 Cray-SV1 Cray-T3E Cray-X1 Cray-XT3 Cray-XT5

1991 1993 1976 1985 1988 1982 Single Vector Pipe Multiple Pipe No data cache small data cache One –few processors Several processors Cray-C90 Cray-C90 Cray-T90 Cray-T90 Cray-XMP Cray-XMP Cray-YMP Cray-YMP Cray-1 Cray-1 Cray-2 Cray-2 2001 2003 2005 2008 1995 1994 Massively parallel Massively parallel, vector, scalar, Data caches x86,CISC, GPU, FPGA, multi-core Distributed memory Cray-T3D Cray-T3E Cray-SV1 Cray-X1 Cray-XT3 Cray-XT5

LAPACK BLAS1 LINPACK BLAS2 BLAS3 1991 1993 1976 1985 1988 1982 Single Vector Pipe Multiple Pipe No data cache small data cache One –few processors Several processors Cray-C90 Cray-C90 Cray-T90 Cray-T90 Cray-XMP Cray-XMP Cray-YMP Cray-YMP Cray-1 Cray-1 Cray-2 Cray-2 ScaLAPACK PETSc ATLAS FFTW Trilinos 2001 2003 2005 2008 1995 1994 Massively parallel Massively parallel, vector, scalar, Data caches x86,CISC, GPU, FPGA, multi-core Distributed memory Cray-T3D Cray-T3E Cray-SV1 Cray-X1 Cray-XT3 Cray-XT5

� Clearly, not to make the problem worse � Improve performance of PETSc and Trilinos on Cray MPPs � tuning sparse matrix vector multiply in general fashion � Tune HPL benchmark for largest machines (massive runtime) � O (N^3) factorization driven by multiple parameters � Tune Dense linear algebra (BLAS3 mainly) � Tune Dense linear algebra (BLAS3 mainly) � BLAS3 � Apply the above only to the Cray hardware � Allows the search space to be manipulated to our advantage � Tune eigensolvers in a general purpose way � It is pretty obvious that hand-tuning alone cannot achieve this Can we construct a generalized AT framework to do all the above?

� HPL (High Performance Linpack) � O(N^3) factorization and solve � Parameter tuning is now paramount � Has 13 parameters (+ 7 more in Cray version) � some parameters have very large dimensionality � Search space is very large indeed (more later) � Search space is very large indeed (more later) � Has become a massive problem due to excessive runtime

Offline Offline

� Sparse Linear Algebra (mainly sparse matrix-vector product) � (for CSR) Irregular memory access � Memory bandwidth bound kernel � Wildly dependent on matrix characteristics � Has never had a general purpose tuned code for this reason

Offline Offline Runtime

� Mostly serial O(N^3) BLAS3 optimizations � Loop transformations � Multiple algorithmic effects

Offline Offline Runtime

� Search space is made “manageable” because of � Restriction to one processor type � Knowledge of target problem sizes / characteristics � Search space is attainable because of infinite resource � Freedom only to make incremental changes (e.g. no new data-structures) � Hence, to make an auto-tuner that works in the real world Hence, to make an auto-tuner that works in the real world � Enormous Offline Testing infrastructure � We have unlimited resources available for the offline testing! � Performance model as output from offline autotuning We can assume the same architecture for each distribution! � � Adaptive libraries that take the performance model as input � The above define our “industrial” autotuning model � CrayATF is the framework built on this model

Code Search Execution Input Generator Engine Engine Module Provide generic Construct batch Deduce Parse Template XML input interface concurrency in file interface search Take information Take information Translate Translate from Search Input matrix directives to engine characteristic Create new code transforms search table Spawn threads, Deduce # Input problem transformations sizes Create input Check search files completion Produce specific Input searching kernel variant limits Execute codes in parallel Create Parse multiple Performance Enter matrix Spin on algorithm characteristics Model completion templates

Code Generator Compile Execution engine Engine Input Search Engine Engine Most importantly – this is a) extensible b) replaceable

Modified Custom C Ruby C/Fortran

Parameter specifications (range, step size, Execute a single dependency, priority) iteration of the search algorithm Initial sets of parameters Need (random/user specified) more tuning? Generate sets of parameters in the Generate Kernels next execution phase Execute all program Build program executions Get performance DONE! numbers Code Generator/ Build Module Batch Module Input Search Module (Ruby) (Ruby) (XML) (Ruby)

For each row in search table: Launch multiple ruby threads in parallel Machine specifications (directories, PBS options, max_cores, walltime) Receive Create unique input file, Create unique input file, results from Create PBS script Create PBS script Batch module Launch job Launch job Launch job Launch job Input XMLFiles Input XMLFiles Wait for job end Wait for job end Yes More tuning ?? Search table. Each row is Parse output file Parse output file a unique list of param values to be executed. No Append execution data Append execution data to search table to search table DONE! Thread barrier Search Module Search Module Batch Module (Ruby) (ruby) (Ruby)

� Ruby is the language used for almost all ATF development � Scripting ability � E.g. One-line text replacement of a single file subs.keys.each {|x| filestring.gsub!(x, subs[x]) } � System programming ease: � E.g. On Cray XT systems, find all the jobs I have in the queue, and delete them : out = Array.new(`qstat -u #{`whoami`.to_s}`.to_s.to_a) 5.upto(out.length-1){|line| system("qdel #{out[line].split('.')[0].to_i}") }

� Extremely simple and lightweight threading � Threadpool implemented in 40 lines of code includes routines to: � Initialize the pool Launch threads � Destroy threads � Exception handling � � Super-soft typed � Super-soft typed � For non-numerical work, we do not want to be concerned with � Datatype conversion � Accuracy Performance (!) � � Allows functional code to be developed very quickly � Integration with XML for extremely powerful configuration/input methods

� High Performance Linpack benchmark � Used for top500 rankings � Traditional tuning approach for HPL : Choose N to fill local memory (reduce comms cost ) 1. heavily tune serial dgemm (parallel dgemm dominates) 2. find a good enough parameter combination (trial and error) 3. � This has been successful in the past, but � #1 is hard to do when the machine grows so large � #3 has never been taken very seriously in practice � But does have good auto-tuning treatment - Hollingsworth et al

Adrian Tate Adrian Tate Technical Lead of Scientific Libraries - PowerPoint PPT Presentation

Adrian Tate Adrian Tate Technical Lead of Scientific Libraries Technical Lead of Scientific Libraries Senior Software Engineer, Cray Inc. Senior Software Engineer, Cray Inc. iWAPT iWAPT, Tokyo, Oct 2009 , Tokyo, Oct 2009 1991 1993 1976

2011 11 12 12 th th at t Sta tate te (3:18.02) :18.02) 2012 12 10 10 th th at t

C HAIRPERSON G OVERNOR S TATE OF H AWAII S TATE P UBLIC C HARTER S CHOOL C OMMISSION ( A HA K ULA

T HE S TATE : M ILITARISM , N ATIONAL S ECURITY , I MPERIALISM 1 N ATION AND S TATE UNDER S

C ATHERINE P AYNE D AVID Y. I GE C HAIRPERSON G OVERNOR S TATE OF H AWAII S TATE P UBLIC C HARTER S

C HAIRPERSON G OVERNOR S TATE OF H AWAII S TATE P UBLIC C HARTER S CHOOL C OMMISSION ( A HA K ULA

C ATHERINE P AYNE D AVID Y. I GE C HAIRPERSON G OVERNOR S TATE OF H AWAII S TATE P UBLIC C HARTER S

J OHN S. S. K IM D AVID Y. I GE C HAIRPERSON G OVERNOR S TATE OF H AWAII S TATE P UBLIC C HARTER S

C ATHERINE P AYNE D AVID Y. I GE C HAIRPERSON G OVERNOR S TATE OF H AWAII S TATE P UBLIC C HARTER S

C HAIRPERSON G OVERNOR S TATE OF H AWAII S TATE P UBLIC C HARTER S CHOOL C OMMISSION ( A HA K ULA

C ATHERINE P AYNE D AVID Y. I GE C HAIRPERSON G OVERNOR S TATE OF H AWAII S TATE P UBLIC C HARTER S

The Frobenius and the Tate diagonal eCHT Reading Seminar on THH Yuqing Shi University of Utrecht

C ATHERINE P AYNE D AVID Y. I GE C HAIRPERSON G OVERNOR S TATE OF H AWAII S TATE P UBLIC C HARTER S

1 DIGES TATE FROM MA NURE R ECYCLING T ECHNOLOGIES Digesmart project DIGES tate from MA nure R

Status of the Caribou readout Adrian Fiergolski Adrian.Fiergolski@cern.ch CERN, PH-LCD 10 March

Spring 2012 BMTRY 789-02 Parallel Processing in R Adrian Michael Nida DBE 2012-04-03 Adrian

Status of CLICdp pixel-detector readout systems Adrian Fiergolski Adrian.Fiergolski@cern.ch on

Systems for Data Science Marco Serafini COMPSCI 532 Lecture 1 Course Structure

Announcements P4: Graded Will resolve all Project grading issues this week P5: File Systems

CS519: Computer Networks Lecture 5, Part 3: Mar 10, 2004 Transport: TCP performance TCP

A PEEK INSIDE RIAK Steve Vinoski Basho Technologies Cambridge, MA USA http://basho.com

The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora {mbalassi, gyfora}@apache.org

Our cloud is thirsty ! Shaolei Ren Florida International University sren@cs.fiu.edu 1 A

NFSv4 Replication for Grid Storage Middleware Peter Honeyman Center for Information Technology

The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors Austin T.