Static Performance Analysis with LLVM Clment Courbet G. Chatelet, - PowerPoint PPT Presentation

Static Performance Analysis with LLVM Clément Courbet G. Chatelet, B. De Backer, O. Sykora, Google Compiler Research Confidential + Proprietary Confidential + Proprietary

2 Confidential + Proprietary

Low-precision matrix multiplication github.com/google/ gemmlowp Optimize this at (nearly) all costs ! 3 Confidential + Proprietary

Tools Benchmarks Static Analysis Closer to real-life performance Fast ● ● Slooooow Reproducible ● ● ● Requires access to hardware ● Hard to model input-dependent behaviour (branches, cache) 4 Confidential + Proprietary

Our Static Performance Analyzer MC Basic Block Annotated Trace 1: movdqu (%rdi),%xmm1 lea 0x10(%rdi),%rdi movdqu (%rsi),%xmm2 lea 0x10(%rsi),%rsi movdqa %xmm1,%xmm3 Port Pressure psubusb %xmm2,%xmm1 psubusb %xmm3,%xmm2 Simulator por %xmm2,%xmm1 movdqa %xmm1,%xmm2 punpcklbw %xmm5,%xmm1 punpckhbw %xmm5,%xmm2 Latency/ Inverse Throughput pmaddwd %xmm1,%xmm1 pmaddwd %xmm2,%xmm2 paddd %xmm1,%xmm0 paddd %xmm2,%xmm0 sub $0x10,%ecx Scheduler jg 1b � No semantics � 5 Confidential + Proprietary

Simulator API Target-independent Simulator interface: const vector<MCInst>& BasicBlock = … ; auto Simulator = Target->createSimulator(); SimulationLog Log = Simulator.Run(BasicBlock); Source: 6 Confidential + Proprietary http://www.realworldtech.com/haswell-cpu/6/

Simulator Internals: Components ● Generic → Reuse LLVM’s target-independent descriptions, e.g. Scheduler: llvm::MCSchedModel RegisterRenamer: llvm::MCRegisterInfo ● Target-specific, e.g. Intel Fetcher Source: 7 Confidential + Proprietary http://www.realworldtech.com/haswell-cpu/6/

analyzing 'libyuv_sumsquareerrorsse2.s' ran 20 iterations in 132 cycles Analysis: IACA-like frontend Block Inverse Throughput : [5-6] cycles per iteration Port Pressure (cycles per iteration): ------------------------------------------------------------------------------------------------------ | Port | HWDivider | HWPort0 | HWPort1 | HWPort2 | HWPort3 | HWPort4 | HWPort5 | HWPort6 | HWPort7 | ------------------------------------------------------------------------------------------------------ | Cycles | | 4.35 | 4.40 | 1.00 | 1.00 | | 4.30 | 1.95 | | ------------------------------------------------------------------------------------------------------ ----------------------------------------------------------------------------------------------------- | #Uops | HWDivider | HWPort0 | HWPort1 | HWPort2 | HWPort3 | HWPort4 | HWPort5 | HWPort6 | HWPort7 | ----------------------------------------------------------------------------------------------------- | 1 | | | | 1.00 | | | | | | movdqu xmm1, xmmword ptr [rdi] | 1 | | | 0.30 | | | | 0.70 | | | lea rdi, [rdi + 0x10] | 1 | | | | | 1.00 | | | | | movdqu xmm2, xmmword ptr [rsi] | 1 | | | 0.55 | | | | 0.45 | | | lea rsi, [rsi + 0x10] | 1 | | 0.35 | 0.65 | | | | | | | movdqa xmm3, xmm1 | 1 | | | 0.70 | | | | 0.30 | | | psubusb xmm1, xmm2 | 1 | | | 0.40 | | | | 0.60 | | | psubusb xmm2, xmm3 | 1 | | 0.95 | 0.05 | | | | | | | por xmm1, xmm2 | 1 | | 1.00 | | | | | | | | movdqa xmm2, xmm1 | 1 | | | | | | | 1.00 | | | punpcklbw xmm1, xmm5 | 1 | | | | | | | 1.00 | | | punpckhbw xmm2, xmm5 | 1 | | 1.00 | | | | | | | | pmaddwd xmm1, xmm1 | 1 | | 1.00 | | | | | | | | pmaddwd xmm2, xmm2 | 1 | | | 0.75 | | | | 0.25 | | | paddd xmm0, xmm1 | 1 | | | 1.00 | | | | | | | paddd xmm0, xmm2 | 1 | | 0.05 | | | | | | 0.95 | | sub ecx, 0x10 | 1 | | | | | | | | 1.00 | | jg .Ltmp0 ----------------------------------------------------------------------------------------------------- 8 Confidential + Proprietary

Automatic Scheduling Minimize the simulated latency (alternative to PostRAMachineSched ) Random/Exhaustive Search (CP Solver) ~hours-days ● Genetic Algorithms (Biased Random-Key Genetic Algorithms) <seconds ● mov $0xffffff45789, %rsi mov (%rsi), %eax mov 8(%rsi), %ebx imul %eax, %eax add %ebx, %eax 9 Confidential + Proprietary

Exhaustive Search ● gemmlowp’s SSE4_32_Kernel4x4Depth2: 0-2% faster (vs. implementation contributed by Intel). ● libwebp’s FTransform(): 0-5% faster. On benchmarks; no performance regressions 10 Confidential + Proprietary

Original: 9.5 Cycles/Iter Rescheduled: 8.5 Cycles/Iter movd mm0, dword ptr [r8] movd mm0, dword ptr [r8] Genetic Algorithms punpcklbw mm0, mm7 punpcklbw mm0, mm7 movq mm1, mm0 movq mm1, mm0 movq mm2, mm0 pmullw mm1, mm1 ● 100 milliseconds pmullw mm1, mm1 movq mm2, mm0 paddw mm1, mm3 pmullw mm2, mm6 psrlw mm1, 0x8 paddw mm1, mm3 pmullw mm0, mm1 psrlw mm1, 0x8 10% improvement paddw mm0, mm3 pmullw mm0, mm1 ● psrlw mm0, 0x8 pmullw mm1, mm5 (in theory) pmullw mm0, mm4 paddw mm0, mm3 pmullw mm1, mm5 psrlw mm0, 0x8 pmullw mm2, mm6 pmullw mm0, mm4 paddw mm0, mm1 paddw mm0, mm1 paddw mm0, mm2 paddw mm0, mm2 psrlw mm0, 0x6 psrlw mm0, 0x6 packuswb mm0, mm7 packuswb mm0, mm7 movd dword ptr [r8], mm0 movd dword ptr [r8], mm0 add r8, 0x4 add r8, 0x4 dec ecx dec ecx jne .Ltmp0 jne .Ltmp0 11 Confidential + Proprietary

Future Work ● Integrating into llvm-mca (in particular frontend simulation) ● Genetic scheduler → MachineFunctionPass 12 Confidential + Proprietary

Try It Out! https://github.com/google/EXEgesis/tree/master/llvm_sim 13 Confidential + Proprietary

Static Performance Analysis with LLVM Clment Courbet G. Chatelet, - PowerPoint PPT Presentation

Static Performance Analysis with LLVM Clment Courbet G. Chatelet, B. De Backer, O. Sykora, Google Compiler Research Confidential + Proprietary Confidential + Proprietary 2 Confidential + Proprietary Low-precision matrix multiplication

LLVM IR and the IoT Dvid Juhsz david.juhasz@imsystech.com 4/2/2018 1 FOSDEM 2018 LLVM

Porting LLVM to a new OS Kai Nacke 31 January 2016 LLVM devroom @ FOSDEM16 Porting LLVM

LLVM Binutils BoF 2019 EuroLLVM Developers' Meeting James Henderson (SN Systems) Jordan

LLVM/Clang Mouna Abidi & Manel Grichi 1 Plan What is LLVM? How will you be using it?

LLVM Coroutines Bringing resumable functions to LLVM LLVM Dev Meeting 2016 Gor Nishanov

Wring an LLVM Pass: 101 LLVM 2019 tutorial Andrzej Warzyski arm October 2019 Andrzejs

A Brief Introduction to Using LLVM Nick Sumner Spring 2013 What is LLVM? A compiler? What

Building, Testing and Debugging a Simple out-of-tree LLVM Pass October 29, 2015, LLVM

LLVM Simone Campanoni simonec@eecs.northwestern.edu Problems with Canvas? Problems with slides?

LLVM Passes Nick Sumner (see also https://github.com/nsumner/llvm-demo) Matt Dwyer (see also

The Many Faces of Instrumentation: Debugging and Better Performance using LLVM in HPC What are

Source Code Analysis for Security through LLVM Lu Zhao HP Fortify lu.zhao@hp.com Static Code

Controlling Virtual Register Pressure in LLVM Middle-End 1 Outline Motivation Related work

llvm.mix multi-stage compiler-assisted specializer generator built on LLVM Eugene Sharygin 1

Static and Method Overloading static One per class, not per object static variables

Compiling Scala to LLVM Geoff Reedy University of New Mexico Scala Days 2011 Introduction The

The M 3 (Measure-Measure-Model) Tool-Chain for Performance Prediction of Multi-tier Applications

Balancing TCP Buffer Size vs Parallel Streams in Application-Level Throughput Optimization Esma

OpenStack Workload Reference Architecture: Web Applications Web applications are the most

9. Architecture Venkat Subramaniam Arch-1 Whats Architecture? Description of sub-system

MCMC Diagnostics Review In the practical you used Metropolis-Hastings with a Gaussian proposal

Clustering & Unsupervised Learning Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter

Mining Frequent Itemsets in a Stream Toon Calders, TU/e (joint work with Bart Goethals and Nele

Methodological issues for Agent-Based Models in the Social Sciences Juliette Rouchier - GREQAM

Static Performance Analysis with LLVM Clment Courbet G. Chatelet, - PowerPoint PPT Presentation

Static Performance Analysis with LLVM Clment Courbet G. Chatelet, B. De Backer, O. Sykora, Google Compiler Research Confidential + Proprietary Confidential + Proprietary 2 Confidential + Proprietary Low-precision matrix multiplication

LLVM IR and the IoT Dvid Juhsz david.juhasz@imsystech.com 4/2/2018 1 FOSDEM 2018 LLVM

Porting LLVM to a new OS Kai Nacke 31 January 2016 LLVM devroom @ FOSDEM16 Porting LLVM

LLVM Binutils BoF 2019 EuroLLVM Developers' Meeting James Henderson (SN Systems) Jordan

LLVM/Clang Mouna Abidi &amp; Manel Grichi 1 Plan What is LLVM? How will you be using it?

LLVM Coroutines Bringing resumable functions to LLVM LLVM Dev Meeting 2016 Gor Nishanov

Wring an LLVM Pass: 101 LLVM 2019 tutorial Andrzej Warzyski arm October 2019 Andrzejs

A Brief Introduction to Using LLVM Nick Sumner Spring 2013 What is LLVM? A compiler? What

Building, Testing and Debugging a Simple out-of-tree LLVM Pass October 29, 2015, LLVM

LLVM Simone Campanoni simonec@eecs.northwestern.edu Problems with Canvas? Problems with slides?

LLVM Passes Nick Sumner (see also https://github.com/nsumner/llvm-demo) Matt Dwyer (see also

The Many Faces of Instrumentation: Debugging and Better Performance using LLVM in HPC What are

Source Code Analysis for Security through LLVM Lu Zhao HP Fortify lu.zhao@hp.com Static Code

Controlling Virtual Register Pressure in LLVM Middle-End 1 Outline Motivation Related work

llvm.mix multi-stage compiler-assisted specializer generator built on LLVM Eugene Sharygin 1

Static and Method Overloading static One per class, not per object static variables

Compiling Scala to LLVM Geoff Reedy University of New Mexico Scala Days 2011 Introduction The

The M 3 (Measure-Measure-Model) Tool-Chain for Performance Prediction of Multi-tier Applications

Balancing TCP Buffer Size vs Parallel Streams in Application-Level Throughput Optimization Esma

OpenStack Workload Reference Architecture: Web Applications Web applications are the most

9. Architecture Venkat Subramaniam Arch-1 Whats Architecture? Description of sub-system

MCMC Diagnostics Review In the practical you used Metropolis-Hastings with a Gaussian proposal

Clustering &amp; Unsupervised Learning Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter

Mining Frequent Itemsets in a Stream Toon Calders, TU/e (joint work with Bart Goethals and Nele

Methodological issues for Agent-Based Models in the Social Sciences Juliette Rouchier - GREQAM

LLVM/Clang Mouna Abidi & Manel Grichi 1 Plan What is LLVM? How will you be using it?

Clustering & Unsupervised Learning Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter