of Software and the ATLAS project* Software Engineering Seminar - PowerPoint PPT Presentation

Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri *R. Clint Whaley, Antoine Petitet and Jack Dongarra. Parallel Computing, 27(1-2):3-35, 2001. Is Search Really Necessary to Generate High-Performance BLAS? Kamen Yotov and Xiaoming Li and Gang Ren and Maria Garzaran and David Padua and Keshav Pingali and Paul Stodghill PROCEEDINGS OF THE IEEE, VOL. 93, NO. 2, FEBRUARY 2005

INTRODUCTION

BLAS (Basic Linear Algebra Subprograms) • Level 1 Vector operations 𝒛 ← 𝛽𝒚 + 𝒜 • Level 2 Matrix-Vector operations 𝒛 ← 𝛽𝑩𝒚 + 𝒜 • Level 3 Matrix-Matrix operations 𝑬 ← 𝛽𝑩𝑪 + 𝛾𝑫

ATLAS (Automatically Tuned Linear Algebra Software) • Implements BLAS • Applies empirical optimization techniques to source code to generate an optimized library • Fully automatic • Produces ANSI-C code

ATLAS Matrix-Matrix Multiplication MMM MMM MMM "Automated Empirical Optimization of Software and the ATLAS project" by R. Clint Whaley, Antoine Petitet and Jack Dongarra. Parallel Computing, 27(1-2):3-35, 2001.

ARCHITECTURE

ATLAS Architecture L1 Cache Detect Parameters ATLAS Search ATLAS Code Hardware CPU parameters Engine Generator Parameters Source Code MFLOPS Execute And Measure Multiple versions Kamen Yotov and Xiaoming Li and Gang Ren and Maria Garzaran and David Padua and Keshav Pingali and Paul Stodghill PROCEEDINGS OF THE IEEE, VOL. 93, NO. 2, FEBRUARY 2005

ATLAS CODE GENERATOR

ATLAS Optimizations • Case: Matrix-Matrix multiplication 𝐷 𝐶 𝑂 𝑂 𝐵 × 𝐿 = 𝑁 𝑁 𝐿 𝑔𝑝𝑠 𝑗 ∈ 0: 1: 𝑂 − 1 𝑔𝑝𝑠 𝑘 ∈ 0: 1: 𝑁 − 1 𝑔𝑝𝑠 𝑙 ∈ 0: 1: 𝐿 − 1 𝐷 𝑗,𝑘 = 𝐷 𝑗,𝑘 + 𝐵 𝑘,𝑙 × 𝐶 𝑙,𝑘

Loop Ordering 𝐵 𝐶 𝐷 𝑔𝑝𝑠 𝒌 ∈ 0: 1: 𝑂 − 1 𝑔𝑝𝑠 𝒋 ∈ 0: 1: 𝑁 − 1 𝑔𝑝𝑠 𝒍 ∈ 0: 1: 𝐿 − 1 𝑗 × = 𝐷 𝑗,𝑘 = 𝐷 𝑗,𝑘 + 𝐵 𝑘,𝑙 × 𝐶 𝑙,𝑘 𝑘 Store 𝑩 in Cache 𝐵 𝐷 𝐶 𝑔𝑝𝑠 𝒋 ∈ 0: 1: 𝑂 − 1 𝑔𝑝𝑠 𝒌 ∈ 0: 1: 𝑁 − 1 𝑔𝑝𝑠 𝒍 ∈ 0: 1: 𝐿 − 1 𝑗 × = 𝐷 𝑗,𝑘 = 𝐷 𝑗,𝑘 + 𝐵 𝑘,𝑙 × 𝐶 𝑙,𝑘 𝑘 Store 𝑪 in Cache

1st Level Blocking 𝑂 𝐶 𝑂 × 𝐿 𝑂 = 𝑂 𝐶 𝑁 𝑁 𝐿 𝑂 𝐶 is choosen in such that the 𝑔𝑝𝑠 𝑗 ∈ 0: 𝑶 𝑪 : 𝑂 − 1 working set fits into 𝑀 1 𝑔𝑝𝑠 𝑘 ∈ 0: 𝑶 𝑪 : 𝑁 − 1 𝑔𝑝𝑠 𝑙 ∈ 0: 𝑶 𝑪 : 𝐿 − 1 𝑔𝑝𝑠 𝑘 ′ ∈ [𝑘: 1: 𝑘 + 𝑂 𝐶 − 1] × = 𝑔𝑝𝑠 𝑗 ′ ∈ [𝑗: 1: 𝑗 + 𝑂 𝐶 − 1] 𝑔𝑝𝑠 𝑙′ ∈ [𝑙: 1: 𝑙 + 𝑂 𝐶 − 1] 𝐷 𝑗′,𝑘′ = 𝐷 𝑗′,𝑘′ + 𝐵 𝑘′,𝑙′ × 𝐶 𝑙′,𝑘′

2nd Level Blocking 𝑔𝑝𝑠 𝑗 ∈ 0: 𝑶 𝑪 : 𝑂 − 1 𝑔𝑝𝑠 𝑘 ∈ 0: 𝑶 𝑪 : 𝑁 − 1 𝑔𝑝𝑠 𝑙 ∈ 0: 𝑶 𝑪 : 𝐿 − 1 𝑔𝑝𝑠 𝑘 ′ ∈ [𝑘: 𝑶 𝑽 : 𝑘 + 𝑂 𝐶 − 1] 𝑔𝑝𝑠 𝑗 ′ ∈ [𝑗: 𝑵 𝑽 : 𝑗 + 𝑂 𝐶 − 1] × = 𝑔𝑝𝑠 𝑙′ ∈ [𝑙: 𝑳 𝑽 : 𝑙 + 𝑂 𝐶 − 1] 𝑁 𝑉 + 𝑂 𝑉 + 𝑁 𝑉 × 𝑂 𝑉 ≤ 𝑂 𝑆 𝑔𝑝𝑠 𝑙 ′′ ∈ 𝑙 ′ : 1: 𝑙′ + 𝐿 𝑉 − 1 𝑔𝑝𝑠 𝑘′ ′ ∈ 𝑘 ′ : 1: 𝑘′ + 𝑂 𝑉 − 1 𝑔𝑝𝑠 𝑗′′ ∈ [𝑗 ′ : 1: 𝑗′ + 𝑁 𝑉 − 1] 𝐷 𝑗′′,𝑘′′ = 𝐷 𝑗′′,𝑘′′ + 𝐵 𝑘′′,𝑙′′ × 𝐶 𝑙′′,𝑘′′ Unroll Loop 𝑂 𝑉 𝑁 𝑉 × = 𝑙 𝑙 Graphic from “How To Write Fast Numerical Code: A Small Introduction” Srinivas Chellappa, Franz Franchetti, and Markus Püschel

Scalar Replacement • Replace array accesses with scalars Stored in memory Store intermediate results in registers do doubl uble t[2]; do doubl uble t0, t1, x0, x1, D0; for or (i=0; i<8; i++) { for or (i=0; i<8; i++) { x0 = x[2*i]; x1 = x[2*i+1]; D0 = D[2*i]; Store for reuse t[0] = x[2*i] + x[2*i+1]; t0 = x0 + x1; t[1] = x[2*i] - x[2*i+1]; t1 = x0 - x1; y[2*i] = t[0] * D[2*i]; y[2*i] = t0 * D0; y[2*i+1] = t[0] * D[2*i]; y[2*i+1] = t1 * D0; } } How To Write Fast Numerical Code: A Small Introduction Srinivas Chellappa, Franz Franchetti, and Markus Püschel

Scalar Replacement a11 = A[1][1] c11 = a11*b11 a12 = A[1][2] c11 += a12*b21 a13 = A[1][3] c11 += a13*b31 a14 = A[1][4] … … c12 = a11*b12 b11 = B[1][1] c12 += a12*b22 b12 = B[1][2] c12 += a13*b32 b13 = B[1][3] … b14 = B[1][4] … C[1][1] = c11 C[1][2] = c12 C[1][3] = c13

Data Hazards IF ID EX WB LD R1, 0(R2) MEM DSUB R4, R1, R5 IF ID EX WB MEM IF ID EX WB AND R6, R1, R7 MEM IF ID EX WB OR R8, R1, R9 MEM IF ID EX WB XOR R10, R1, R11 MEM Skewing Factor Jens Teubner · Data Processing on Modern Hardware · Fall 2010

Pipeline Scheduling Interleave 𝑛𝑣𝑚 and 𝑏𝑒𝑒 sequences 𝑛𝑣𝑚 1 𝑛𝑣𝑚 2 Skewing factor 𝑀 𝑇 … 𝑛𝑣𝑚 𝑀 𝑇 𝑏𝑒𝑒 1 𝑛𝑣𝑚 𝑀 𝑇 +1 𝑏𝑒𝑒 2 𝑛𝑣𝑚 𝑀 𝑇 +2 𝑏𝑒𝑒 3 …

Pipeline Scheduling a11 = A[1][1] c11 = a11*b11 a12 = A[1][2] c12 = a11*b12 a13 = A[1][3] … a14 = A[1][4] c11 += a12*b21 … c12 += a12*b22 b11 = B[1][1] … b12 = B[1][2] c11 += a13*b31 b13 = B[1][3] c12 += a13*b32 b14 = B[1][4] … … C[1][1] = c11 C[1][2] = c12 C[1][3] = c13

EMPIRICAL OPTIMIZATION IN ATLAS

ATLAS Architecture L1 Cache Detect Parameters ATLAS Search ATLAS Code Hardware CPU parameters Engine Generator Parameters Optimize 𝑔(𝑦 1 , 𝑦 2 , 𝑦 3 , … , 𝑦 𝑜 ) Source Code Execute MFLOPS And Measure Multiple versions Kamen Yotov and Xiaoming Li and Gang Ren and Maria Garzaran and David Padua and Keshav Pingali and Paul Stodghill PROCEEDINGS OF THE IEEE, VOL. 93, NO. 2, FEBRUARY 2005

Optimization Order 1. Find best block size for outer loop 2. Find best block sizes for inner loop 3. Find best skewing factor 4. Find best parameters for scheduling of loads 5. Additional parameters 𝑂 𝑉 𝑂 𝐶 𝑂 × 𝐿 𝑂 = 𝑂 𝐶 𝑁 𝑉 × = 𝑙 𝑙 𝑁 𝑁 𝐿

Search for best Outer Loop Size Restrict search space 16 ≤ 𝑂 𝐶 ≤ min(80, 𝑀 1 𝑇𝑗𝑨𝑓) 𝑂 𝐶 𝑂 𝑂 × 𝐿 = 𝑂 𝐶 𝑁 𝑁 𝐿 • 𝑂 𝐶 must be a multiple of 4 • Use fastest version Try with and without unrolling the inner loop

DISCUSSION

Comparison to PhiPAC PhiPAC ATLAS • Coding methodology to • Library generator write fast code • Automatic generation of • Precursor for ATLAS optimized BLAS • Specialized Code Generator for BLAS Matrix-Matrix • Support for handcoded Multiplication routines • Optimizes parameters for inner and outer loop

ATLAS Matrix-Matrix Multiplication MMM MMM MMM "Automated Empirical Optimization of Software and the ATLAS project" by R. Clint Whaley, Antoine Petitet and Jack Dongarra. Parallel Computing, 27(1-2):3-35, 2001.

Comparison to eigen http://eigen.tuxfamily.org/index.php?title=Benchmark Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz ( x86_64 )

Conclusion Pro • Fast method to generate an optimized library for a new platform • Supports hand optimized code • Implements BLAS Contra • Needs constant adjustment to support new architectures • Outdated

Further Information • ATLAS Project http://math-atlas.sourceforge.net/ • BLAS http://netlib.org/blas/

of Software and the ATLAS project* Software Engineering Seminar - PowerPoint PPT Presentation

Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Sprri *R. Clint Whaley, Antoine Petitet and Jack Dongarra. Parallel Computing, 27(1-2):3-35, 2001. Is Search Really Necessary to Generate

Measuring DNSSEC using RIPE Atlas Kaveh Ranjbar RIPE NCC RIPE Atlas Coverage RIPE Atlas 2

ATLAS Searches for SUSY Chris Young, CERN ATLAS Group What have we not looked for? 1 / 37 ATLAS

ATLAS ROOT I/O pt 2 Atlas Hot Topics (with reference to CHEP presentations) Big data

ATLAS I/O Overview Peter van Gemmeren (ANL) gemmeren@anl.gov for many in ATLAS 8/23/2018 Peter

World Wide Computing and the ATLAS World Wide Computing and the ATLAS Experiment Experiment th

Highlights and Searches in ATLAS Dave Charlton University of Birmingham on behalf of the ATLAS

Atlas Arteria Investor Presentation July 2018 Important notice and disclaimer Disclaimer Atlas

ATLAS Shrugged ATLAS Shrugged Pat O Toole Toole Pat O (with apologies to Ayn Rand and

Macquarie Atlas Roads Limited Macquarie Atlas Roads International Limited 2016 Annual General

Top Properties from ATLAS Chris Young (CERN), on behalf of ATLAS 27th May 2020 1 / 19 Top

Atlas Summit 2016 C ALL FOR P RESENTA TION P ROPOSALS The Atlas Society is currently planning the

Data Management in ATLAS Angelos Molfetas on behalf of the ATLAS DQ2 team 1 ATLAS DDM

H result from ATLAS Lydia Brenner Introduction ATLAS I will try to compare some

Project ATLAS Michelle Warf NCDOT EAU Caitlyn Meyer ATLAS GIS Consultant February 25

ATLAS BUMP BONDING PROCESS Anna Maria Fiorello - Research Dept ATLAS-Pixel Project: Bump Bonding

Higgs Physics and SUSY Searches with ATLAS Max Goblirsch, on behalf of the MPP ATLAS group MPP

Some aspects of codes over rings Peter J. Cameron p.j.cameron@qmul.ac.uk Galway, July 2009 This

Error Codes Correcting Gary Lecture 11 toolkit CMU Preliminaries Setting Error of

Lattice Basis Reduction Part 1: Concepts Sanzheng Qiao Department of Computing and Software

McTiny: McEliece for tiny network servers Daniel J. Bernstein, uic.edu , rub.de Tanja Lange,

Non-Binary Polar Codes using Reed-Solomon Codes and Algebraic Geometry Codes Ryuhei Mori

On the properties and the construction of finite-row ( t , s )-sequences 1 Roswitha Hofer 2

Constructing -uniform states of non-minimal support Zahra Raissi, Adam Teixid, Christian

Welcome! Todays Agenda: Introduction The Prefix Sum Parallel Sorting Stream