How to get peak FLOPS (CPU) What I wish I knew when I was twenty - PowerPoint PPT Presentation

How to get peak FLOPS (CPU) — What I wish I knew when I was twenty about CPU — Kenjiro Taura 1 / 54

Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput 6 A simple yet fairly fast single-core matrix multiply 2 / 54

What you need to know to get a nearly peak FLOPS so you now know how to use multicores and SIMD instructions they are two key elements to get a nearly peak FLOPS the last key element: Instruction Level Parallelism (ILP) of superscalar processors 4 / 54

An endeavor to nearly peak FLOPS let’s run the simplest code you can think of ✞ #if __AVX512F__ 1 const int vwidth = 64; 2 #elif __AVX__ 3 const int vwidth = 32; 4 #else 5 #error "you’d better have a better machine" 6 #endif 7 8 const int valign = sizeof(float); 9 typedef float floatv attribute ((vector size(vwidth),aligned(valign))); 10 / ∗ SIMD lanes ∗ / 11 const int L = sizeof(floatv) / sizeof(float); 12 ✞ floatv a, x, c; 1 for (i = 0; i < n; i++) { 2 x = a * x + c; 3 } 4 the code performs L × n FMAs and almost nothing else 6 / 54

Notes on experiments the source code for the following experiments is in 06axpy directory the computation is trivial but the measurement part is Linux-specific perf event open to get CPU clocks (not reference clocks ) clock gettime to get time in nano second resolution it will compile on MacOS too, but the results are inaccurate reference clocks substitute for CPU clocks gettimeofday (micro second granularity) substitutes for clock gettime 7 / 54

Notes on experiments on Linux, you need to allow user processes to get performance events by ✞ $ sudo sysctl -w kernel.perf_event_paranoid=-1 1 exact results depend on the CPU microarchitecture and ISA ✞ $ cat /proc/cpuinfo 1 and google the model name (e.g., “Xeon Gold 6126”) the following experiments show results on an Skylake X CPU Skylake X is a variant of Skylake supporting AVX-512 login node or the big partition of IST cluster there is a Skylake, which has the same microarchitecture but does not support AVX-512 (marketing conspiracy?) your newest laptop will be Kaby Lake, even newer than Skylake but does not have AVX-512. Its limit is: 2 × 256-bit FMAs/cycle = 32 flops/cycle 8 / 54

Let’s run it! compile ✞ $ gcc -o axpy -march=native -O3 axpy.c 1 9 / 54

Let’s run it! compile ✞ $ gcc -o axpy -march=native -O3 axpy.c 1 and run! ✞ $ ./axpy simd 1 algo = simd 2 m = 8 3 n = 1000000000 4 flops = 32000000000 5 4000530984 CPU clocks, 2967749142 REF clocks, 1144168403 ns 6 4.000531 CPU clocks/iter, 2.967749 REF clocks/iter, 1.144168 ns/iter 7 7.998938 flops/CPU clock, 10.782583 flops/REF clock, 27.967911 GFLOPS 8 9 / 54

Let’s run it! compile ✞ $ gcc -o axpy -march=native -O3 axpy.c 1 and run! ✞ $ ./axpy simd 1 algo = simd 2 m = 8 3 n = 1000000000 4 flops = 32000000000 5 4000530984 CPU clocks, 2967749142 REF clocks, 1144168403 ns 6 4.000531 CPU clocks/iter, 2.967749 REF clocks/iter, 1.144168 ns/iter 7 7.998938 flops/CPU clock, 10.782583 flops/REF clock, 27.967911 GFLOPS 8 took ≈ 4 CPU clocks/iteration ≈ 8 flops/clock 9 / 54

Let’s run it! compile ✞ $ gcc -o axpy -march=native -O3 axpy.c 1 and run! ✞ $ ./axpy simd 1 algo = simd 2 m = 8 3 n = 1000000000 4 flops = 32000000000 5 4000530984 CPU clocks, 2967749142 REF clocks, 1144168403 ns 6 4.000531 CPU clocks/iter, 2.967749 REF clocks/iter, 1.144168 ns/iter 7 7.998938 flops/CPU clock, 10.782583 flops/REF clock, 27.967911 GFLOPS 8 took ≈ 4 CPU clocks/iteration ≈ 8 flops/clock ≈ 1/8 of the single core peak (64 flops/clock) 9 / 54

How to investigate put a landmark in the assembly code ✞ asm volatile ("# axpy simd: ax+c loop begin"); 1 for (i = 0; i < n; i++) { 2 x = a * x + c; 3 } 4 asm volatile ("# axpy simd: ax+c loop end"); 5 10 / 54

How to investigate put a landmark in the assembly code ✞ asm volatile ("# axpy simd: ax+c loop begin"); 1 for (i = 0; i < n; i++) { 2 x = a * x + c; 3 } 4 asm volatile ("# axpy simd: ax+c loop end"); 5 compile into assembly ✞ $ gcc -S -march=native -O3 axpy.c 1 and see axpy.s in your editor 10 / 54

Assembly ✞ # axpy_simd: ax+c loop begin 1 # 0 "" 2 2 #NO_APP 3 why? testq %rdi, %rdi 4 jle .L659 5 xorl %edx, %edx 6 .p2align 4,,10 7 .p2align 3 8 .L660: 9 addq $1,%rdx 10 vfmadd132ps %zmm0,%zmm1,%zmm2 11 cmpq %rdx,%rdi 12 jne .L660 13 .L659: 14 #APP 15 # 63 "axpy.cc" 1 16 # axpy_simd: ax+c loop end 17 11 / 54

Suspect looping overhead? if you suspect the overhead of other instructions, here is an unrolled version that has much fewer overhead instructions ✞ .L1662: 1 its performance is identical addq $8, %rdx 2 vfmadd132ps %zmm0,%zmm1,%zmm2 3 vfmadd132ps %zmm0,%zmm1,%zmm2 ✞ 4 #pragma GCC optimize("unroll-loops", 8) 1 cmpq %rdx,%rdi 5 long axpy_simd(long n, floatv a, 2 vfmadd132ps %zmm0,%zmm1,%zmm2 6 ⇒ floatv* X, floatv c) { 3 vfmadd132ps %zmm0,%zmm1,%zmm2 7 ... 4 vfmadd132ps %zmm0,%zmm1,%zmm2 8 for (i = 0; i < n; i++) { 5 vfmadd132ps %zmm0,%zmm1,%zmm2 9 x = a * x + c; 6 vfmadd132ps %zmm0,%zmm1,%zmm2 10 } 7 vfmadd132ps %zmm0,%zmm1,%zmm2 11 } 8 jne .L1662 12 12 / 54

Latency and throughput a (Skylake-X) core can execute two vfmaddps instructions every cycle yet, it does not mean the result of vfmaddps at line 3 below is available in the next cycle for vfmaddps at the next line ✞ .L1662: 1 addq $8, %rdx 2 vfmadd132ps %zmm0,%zmm1,%zmm2 3 vfmadd132ps %zmm0,%zmm1,%zmm2 4 cmpq %rdx,%rdi 5 vfmadd132ps %zmm0,%zmm1,%zmm2 6 ... 7 vfmadd132ps %zmm0,%zmm1,%zmm2 8 jne .L1662 9 14 / 54

Latency and throughput a (Skylake-X) core can execute two vfmaddps instructions every cycle yet, it does not mean the result of vfmaddps at line 3 below is available in the next cycle for vfmaddps at the next line ✞ .L1662: 1 addq $8, %rdx 2 vfmadd132ps %zmm0,%zmm1,%zmm2 3 vfmadd132ps %zmm0,%zmm1,%zmm2 4 cmpq %rdx,%rdi 5 vfmadd132ps %zmm0,%zmm1,%zmm2 6 ... 7 vfmadd132ps %zmm0,%zmm1,%zmm2 8 jne .L1662 9 what you need to know: “two vfmadd132ps instructions every cycle” refers to the throughput each instruction has a specific latency ( > 1 cycle) 14 / 54

Latencies instruction Haswell Broadwell Skylake fp add 3 3 4 fp mul 5 3 4 fp fmadd 5 5 4 typical integer ops 1 1 1 . . . . . . . . . . . . http://www.agner.org/optimize/ is an invaluable resource put the following two docs under your pillow 3. The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers 4. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs 15 / 54

Our code in light of latencies in our code, a vfmadd uses the result of the immediately preceding vfmadd that was obvious from the source code too ✞ .L1662: 1 addq $8, %rdx 2 vfmadd132ps %zmm0,%zmm1,%zmm2 3 ✞ vfmadd132ps %zmm0,%zmm1,%zmm2 for (i = 0; i < n; i++) { 4 1 cmpq %rdx,%rdi x = a * x + c; 5 2 vfmadd132ps %zmm0,%zmm1,%zmm2 } 6 3 ... 7 vfmadd132ps %zmm0,%zmm1,%zmm2 8 jne .L1662 9 Conclusion: the loop can’t run faster than 4 cycles/iteration vfmaddps vfmaddps vfmaddps vfmaddps zmm2 zmm2 zmm2 zmm2 zmm2 16 / 54

CPU clocks vs. reference clocks CPU changes clock frequency depending on the load (DVFS) reference clock runs at the same frequency (it is always proportional to the absolute time) an instruction takes a specified number of CPU clocks , not reference clocks the CPU clock is more predictable and thus more convenient for a precise reasoning of the code vfmaddps vfmaddps vfmaddps vfmaddps CPU clock reference clock absolute time 17 / 54

How to overcome latencies? increase parallelism (no other ways)! 18 / 54

How to overcome latencies? increase parallelism (no other ways)! you can’t make a serial chain of computation run faster (change the algorithm if you want to) vfmaddps vfmaddps vfmaddps vfmaddps zmm2 zmm2 zmm2 zmm2 zmm2 18 / 54

How to overcome latencies? increase parallelism (no other ways)! you can’t make a serial chain of computation run faster (change the algorithm if you want to) vfmaddps vfmaddps vfmaddps vfmaddps zmm2 zmm2 zmm2 zmm2 zmm2 you can only increase throughput , by running multiple independent chains 18 / 54

How to get peak FLOPS (CPU) What I wish I knew when I was twenty - PowerPoint PPT Presentation

How to get peak FLOPS (CPU) What I wish I knew when I was twenty about CPU Kenjiro Taura 1 / 54 Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput 6 A simple

Get the edge Get the edge Get the edge Get the edge Get the edge Get the edge Get the edge

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

WHAT I WISH I KNEW BEFORE SCALING UBER TO 1,000 SERVICES MATT RANNEY WHAT I WISH I KNEW BEFORE

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Knew Knew What I hat I Know Know Now ow How to Navigate Questrom Like a Boss REMIND ME

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

2 nd semester Topic 61: Grammar: Wish Wish + (that) + past simple: We can use wish

Latches and Flip-flops Latches and flip-flops are circuits with memory function. They are part of

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

Peak Biotech Company Profile July 2005 Peak Biotech A/S was founded Location Kvistgaard,

Develop A Peak Performing Value Proposition For Your _____ A. Develop A B. Develop A Peak

Computer Organization & Assembly Language Programming (CSE 2312) Lecture 3 Taylor Johnson

Chapter Chapter 1 Computer Abstractions and Technology 1.1 Introduction The Computer

CPU Performance Lecture 8 CAP 3103 06-11-2014 1.6 Performance Defining Performance Which

Scheduling Don Porter CSE 506 Housekeeping Paper reading assigned for next Tuesday

Lecture 2: Architectural Performance Laws and Rules of Thumb Prof. V. Catania Lab. Calcolatori

CS305 Computer Architecture Fall 2009 Lecture 04 Bhaskaran Raman Department of CSE, IIT Bombay

Cache 10/27/16 The Memory Hierarchy Smaller On 1 cycle to access Chip Faster Registers CPU

1 Memory Read Transaction (1) Memory Read Transaction (2) CPU places address A on the memory

How to get peak FLOPS (CPU) What I wish I knew when I was twenty - PowerPoint PPT Presentation

How to get peak FLOPS (CPU) What I wish I knew when I was twenty about CPU Kenjiro Taura 1 / 54 Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput 6 A simple

Get the edge Get the edge Get the edge Get the edge Get the edge Get the edge Get the edge

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

WHAT I WISH I KNEW BEFORE SCALING UBER TO 1,000 SERVICES MATT RANNEY WHAT I WISH I KNEW BEFORE

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Knew Knew What I hat I Know Know Now ow How to Navigate Questrom Like a Boss REMIND ME

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

2 nd semester Topic 61: Grammar: Wish Wish + (that) + past simple: We can use wish

Latches and Flip-flops Latches and flip-flops are circuits with memory function. They are part of

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

Peak Biotech Company Profile July 2005 Peak Biotech A/S was founded Location Kvistgaard,

Develop A Peak Performing Value Proposition For Your _____ A. Develop A B. Develop A Peak

Computer Organization &amp; Assembly Language Programming (CSE 2312) Lecture 3 Taylor Johnson

Chapter Chapter 1 Computer Abstractions and Technology 1.1 Introduction The Computer

CPU Performance Lecture 8 CAP 3103 06-11-2014 1.6 Performance Defining Performance Which

Scheduling Don Porter CSE 506 Housekeeping Paper reading assigned for next Tuesday

Lecture 2: Architectural Performance Laws and Rules of Thumb Prof. V. Catania Lab. Calcolatori

CS305 Computer Architecture Fall 2009 Lecture 04 Bhaskaran Raman Department of CSE, IIT Bombay

Cache 10/27/16 The Memory Hierarchy Smaller On 1 cycle to access Chip Faster Registers CPU

1 Memory Read Transaction (1) Memory Read Transaction (2) CPU places address A on the memory

Computer Organization & Assembly Language Programming (CSE 2312) Lecture 3 Taylor Johnson