2 nd CERN Advanced Performance Tuning Workshop - introduction - PowerPoint PPT Presentation

2 nd CERN Advanced Performance Tuning Workshop - introduction Andrzej Nowak (CERN openlab) November 21 st 2013 2nd CERN Advanced Performance Tuning Workshop

Mont Blanc (4,808m) Geneva (pop. 190’000) Lake Geneva (310m deep) Andrzej ej N Now owak - 2nd C CERN A Advanc nced P Performanc nce T Tuni ning ng W Workshop

Andrzej ej N Now owak - 2nd C CERN A Advanc nced P Performanc nce T Tuni ning ng W Workshop

Worldwide LHC Computing Intense data pressure creates strong demand for computing Raw data: >25 350’000 IA 10’s of PB petabytes computing per second stored yearly cores A rigorous selection process enables us to find that one interesting event in 10 trillion (10 13 ) Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 4

Data flow from the LHC detectors Reconstruction Selection and reconstruction Online triggering and filtering in detectors Raw Data Event (100%) reprocessing Event summary data (10%) Event simulation Analysis Batch physics Analysis objects analysis (1%) Processed data Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 5

uArch level Pattern Load, load, do something, multiply, add, store > Load, load, do something, multiply, add, store > Efficiency is low: scalar DP, 1.0 CPI = 6% efficiency! FP Scalar double, 10-15% Significant portion of double precision floating point (10%+) CPI >1.0 > Loads/stores up to 60% of instructions Load/store 60% of instructions > Low number of instructions between jumps (<10) Inst/jump <10 > Low number of instructions between calls (several dozen) Inst/call <30-60 > Large regions of memory read only or accessed infrequently Memory Largely read-only > Conclusions:  Unfavorable for the x86 microarchitecture (even worse for others)  For the most part, code not fit for accelerators at all in its current shape Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 6

Workload classes CPU time on e on CPU us usag age Disk I k IO Net I t IO O the Gr e Grid (bw & l & lat at) Simulation High High Minimal Minimal Reconstruction Medium High Minimal Minimal Digitization Low High Varying Low Generation Low Med-High Low-Med Low Client/IT None Low Low Low Client/Analysis Varying Varying Varying Varying Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 7

Performance tuning processes in 2010 > Surveyed 6 major offline collaborations (20 MLOC)  ROOT, Geant4  ALICE, ATLAS, CMS, LHCb > Software performance is not a priority, but the quality of science is  Memory layout and usage patterns  Fragmentation, leaks, allocation leads to pressure and non- locality  Microartchitectural issues secondary and not well explored > Opportunistic optimization prevailed  Regression based - maintain constant overall performance rather than improve  All parties run nightly regression checks  2 out of 6 had dedicated „performance people”  3 out of 6 depended exclusively on best effort Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 9

Extracting benchmarks > Extracting a meaningful benchmark from several million lines of code is hard  There are loopy parts, but many of them  High fragmentation and large code base  Too many code paths – the outer layer/loop might be the same in many cases but the contents can vary wildly per „physics situation” and „per experiment”  Making it self-contained and independent > Two realistic options  Extract „snippets” – a single method + friends  Copy full frameworks Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 10

Fragmentation In this old CMSSW example, 44% of the time is consumed by hundreds of functions, each of which takes less than 0.5% of the total runtime From G. Eulisse Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 11

Fragmentation From G. Eulisse Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 12

Compilers > The best tuning aid we could possibly imagine  Very conservative options: -O2, -fPIC  Value safety very important > GCC base (recent GCC) + old system GLIBC > ICC and LLVM slowly picked up  ICC for performance • O3 very rarely used, -fast: never  LLVM for analysis and introspection > PGO produces penalties (code paths hard or impossible to predict) Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 13

Tools – functional requirements Track IO bottlenecks easily Memory Layout on heap, page sharing, usage histograms related Allocations and deallocations (usage patterns, allocation patterns, pressure, layout) statistics Categorize by calling stack Tracking down leaks Event Per-function based Per-module sampling With stack traces Non- Understandable by non-experts technical OSS, work in RHEL, without ROOT access guidelines: Stable and reliable on large code Call graph building Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 14

Performance tools used > PMU based  earlier: perfmon2  perf • Badly designed, painful to use • De facto standard • Gooda from Google  Intel tools (Amplifier – worked on the alpha, SEP, PTU)  Some PAPI adoption > Instrumentation  IgProf, Valgrind + friends (very popular)  PIN (slow)  Intel Amplifier  Intel Inspector (low success rate) > Own tools  Not many tools work with large applications  Scripts, analyzers parsing raw data Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 15

PMU techniques employed > Event Counting  Black-box studies and regression  Good for fragmentation > EBS IP Sampling  Wide range of tuning activities  Low precision on our code  Bad in a fragmented scenario > Time based sampling and time based displays of counts  Phase monitoring  Provides added value for discovery > Experience: high level brings most value since localized optimization is hard Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 16

Our issues with the PMU in a nutshell “I have 100’000 cache misses more because of this choice of data structure – so what?” (actual quote from a senior developer) Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 17

CERN/High Energy Physics Needs > See next talk > Ultimate goal: a simplified performance optimization process  It can only be achieved by striking a good balance between relieving the users of some of the burden and educating them about the microarchitecture at the same time > Access to advanced information and data  Much of this is inaccessible today but the hardware is there > Easier access to information  Visual reports; high level, composed reports based on advanced data > Easier access to the right optimization directions  Extra data allows to give extra advice > More intelligent tuning enabled by higher-level conclusions Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 18

Workshop structure > Lectures and interactive discussions with optional hands-on > Topics  Monitoring and tuning facilities (here: x86 and ARM)  Methodologies  Tools – open source and proprietary  Workloads: CERN needs, large workload specifics Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 19

Speakers > ARM  Al Grant  Michael Williams > Calxeda  Robert Richter (also an AMD expert) > CERN  Vincenzo Innocente > Google  Maria Dimakopoulou  Stephane Eranian  David Levinthal > Intel  Stanislav Bratanov  Michael Chynoweth  Ahmad Yasin > Versailles Exascale Lab  Andres S. Charif-Rubial Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 20

Thank you Other questions? Andrzej.Nowak@cern.ch Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 21

2 nd CERN Advanced Performance Tuning Workshop - introduction - PowerPoint PPT Presentation

2 nd CERN Advanced Performance Tuning Workshop - introduction Andrzej Nowak (CERN openlab) November 21 st 2013 2nd CERN Advanced Performance Tuning Workshop Mont Blanc (4,808m) Geneva (pop. 190000) Lake Geneva (310m deep) Andrzej ej N

Overview of the SPS LLRF upgrade Gregoire Hagmann (CERN) Mattia Rizzi (CERN) Philippe

Accelera'ng records management at CERN Andrew Short andrew.short@cern.ch CERN Accelerator

Marek Domaracky CERN IT Vidyo@CERN CERN WebRTC Future 3 VIDYO@CERN: SCALE AND

Benchmarking topics at Benchmarking topics at CERN CERN Helge Meinhard / CERN- -IT IT Helge

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

Oracle Advanced Compression Tests Svetozar Kapusta 15 th of October 2009 What is CERN? CERN is:

CHAPTER 9: PID TUNING Process Solve the tuning Apply, is the reaction curve problem. Requires

Oracle at CERN CERN openlab summer students programme 2011 Eric Grancher eric.grancher@cern.ch

Databases Services at CERN Databases Services at CERN for the Physics Community Luca Canali,

Minute of PACMAN kick-off meeting 20/11/2013 Participants: K. Artoos (CERN), F. Bordry (CERN), A.

AIDA - Abstract Interfaces for Data Analysis Andreas Pfeiffer, CERN/IT Andreas Pfeiffer, CERN/IT

IPv6 deployment at CERN ISGC, Taipei, 16 th March 2016 edoardo.martelli@cern.ch CERN IT

TUNING Russia: Development of master programmes in engineering education using the Tuning

Parameters vs hyperparameters Dr. Shirin Glander Data Scientist DataCamp Hyperparameter Tuning

SELF TUNING MEMORY MANAGEMENT FOR DATA SERVERS By Sangeetha Sivaprakasam Introduction : 1)

Hyperparameter tuning in caret Dr. Shirin Glander Data Scientist DataCamp Hyperparameter

Pr Profiling Energy Consumption of DASH Video St Streaming over 4G 4G LTE Networks Pr

Linux Systems Performance Brendan Gregg Senior Performance Architect Systems

ECE590-03 Enterprise Storage Architecture Fall 2016 Workload profiling and sizing Tyler Bletsch

Dynamic Binary Optimization Introduction Application profiling Optimizing translation

The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers

UL HPC School 2017 PS6: Debugging, profiling and performance analysis UL High Performance

Personal Data and Ci/zenship The Technical perspec/ve Claudia

COVID 19 INSIGHTS: The challenges for students and families in Australias disadvantaged