2 nd CERN Advanced Performance Tuning Workshop - introduction - - PowerPoint PPT Presentation

2 nd cern advanced performance tuning workshop
SMART_READER_LITE
LIVE PREVIEW

2 nd CERN Advanced Performance Tuning Workshop - introduction - - PowerPoint PPT Presentation

2 nd CERN Advanced Performance Tuning Workshop - introduction Andrzej Nowak (CERN openlab) November 21 st 2013 2nd CERN Advanced Performance Tuning Workshop Mont Blanc (4,808m) Geneva (pop. 190000) Lake Geneva (310m deep) Andrzej ej N


slide-1
SLIDE 1

2nd CERN Advanced Performance Tuning Workshop

2nd CERN Advanced Performance Tuning Workshop - introduction

Andrzej Nowak (CERN openlab)

November 21st 2013

slide-2
SLIDE 2

Mont Blanc (4,808m) Lake Geneva (310m deep) Geneva (pop. 190’000)

Andrzej ej N Now

  • wak - 2nd C

CERN A Advanc nced P Performanc nce T Tuni ning ng W Workshop

slide-3
SLIDE 3

Andrzej ej N Now

  • wak - 2nd C

CERN A Advanc nced P Performanc nce T Tuni ning ng W Workshop

slide-4
SLIDE 4

Worldwide LHC Computing

Intense data pressure creates strong demand for computing

350’000 IA computing cores >25 petabytes stored yearly Raw data: 10’s of PB per second

A rigorous selection process enables us to find that one interesting event in 10 trillion (1013)

Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 4

slide-5
SLIDE 5

Data flow from the LHC detectors

Online triggering and filtering in detectors Event simulation

Reconstruction

Analysis

Raw Data (100%) Selection and reconstruction Analysis

  • bjects

(1%) Event summary data (10%) Event reprocessing Processed data Batch physics analysis Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 5

slide-6
SLIDE 6

uArch level

> Load, load, do something, multiply, add, store > Efficiency is low: scalar DP, 1.0 CPI = 6% efficiency!

Significant portion of double precision floating point (10%+)

> Loads/stores up to 60% of instructions > Low number of instructions between jumps (<10) > Low number of instructions between calls (several dozen) > Large regions of memory read only or accessed

infrequently

> Conclusions:

  • Unfavorable for the x86 microarchitecture (even worse for others)
  • For the most part, code not fit for accelerators at all in its current

shape

Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 6

Pattern Load, load, do something, multiply, add, store FP Scalar double, 10-15% CPI >1.0 Load/store 60% of instructions Inst/jump <10 Inst/call <30-60 Memory Largely read-only

slide-7
SLIDE 7

Workload classes

CPU time on e on the Gr e Grid CPU us usag age Disk I k IO Net I t IO O (bw & l & lat at) Simulation High High Minimal Minimal Reconstruction Medium High Minimal Minimal Digitization Low High Varying Low Generation Low Med-High Low-Med Low Client/IT None Low Low Low Client/Analysis Varying Varying Varying Varying

Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 7

slide-8
SLIDE 8

Performance tuning processes in 2010

> Surveyed 6 major offline collaborations (20 MLOC)

  • ROOT, Geant4
  • ALICE, ATLAS, CMS, LHCb

> Software performance is not a priority, but the

quality of science is

  • Memory layout and usage patterns
  • Fragmentation, leaks, allocation leads to pressure and non-

locality

  • Microartchitectural issues secondary and not well explored

> Opportunistic optimization prevailed

  • Regression based - maintain constant overall performance

rather than improve

  • All parties run nightly regression checks
  • 2 out of 6 had dedicated „performance people”
  • 3 out of 6 depended exclusively on best effort

Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 9

slide-9
SLIDE 9

Extracting benchmarks

> Extracting a meaningful benchmark from

several million lines of code is hard

  • There are loopy parts, but many of them
  • High fragmentation and large code base
  • Too many code paths – the outer layer/loop might

be the same in many cases but the contents can vary wildly per „physics situation” and „per experiment”

  • Making it self-contained and independent

> Two realistic options

  • Extract „snippets” – a single method + friends
  • Copy full frameworks

Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 10

slide-10
SLIDE 10

Fragmentation

Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 11

In this old CMSSW example, 44% of the time is consumed by hundreds of functions, each

  • f which takes less than 0.5%
  • f the total runtime

From G. Eulisse

slide-11
SLIDE 11

Fragmentation

Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 12

From G. Eulisse

slide-12
SLIDE 12

Compilers

> The best tuning aid we could possibly

imagine

  • Very conservative options: -O2, -fPIC
  • Value safety very important

> GCC base (recent GCC) + old system GLIBC > ICC and LLVM slowly picked up

  • ICC for performance
  • O3 very rarely used, -fast: never
  • LLVM for analysis and introspection

> PGO produces penalties (code paths hard or

impossible to predict)

Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 13

slide-13
SLIDE 13

Tools – functional requirements

Memory related statistics

Track IO bottlenecks easily Layout on heap, page sharing, usage histograms

Allocations and deallocations (usage patterns, allocation patterns, pressure, layout)

Categorize by calling stack Tracking down leaks

Event based sampling

Per-function Per-module With stack traces

Non- technical guidelines:

Understandable by non-experts OSS, work in RHEL, without ROOT access Stable and reliable on large code

Call graph building

Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 14

slide-14
SLIDE 14

Performance tools used

> PMU based

  • earlier: perfmon2
  • perf
  • Badly designed, painful to use
  • De facto standard
  • Gooda from Google
  • Intel tools (Amplifier – worked on the alpha, SEP, PTU)
  • Some PAPI adoption

> Instrumentation

  • IgProf, Valgrind + friends (very popular)
  • PIN (slow)
  • Intel Amplifier
  • Intel Inspector (low success rate)

> Own tools

  • Not many tools work with large applications
  • Scripts, analyzers parsing raw data

Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 15

slide-15
SLIDE 15

PMU techniques employed

> Event Counting

  • Black-box studies and regression
  • Good for fragmentation

> EBS IP Sampling

  • Wide range of tuning activities
  • Low precision on our code
  • Bad in a fragmented scenario

> Time based sampling and time based displays of

counts

  • Phase monitoring
  • Provides added value for discovery

> Experience: high level brings most value since

localized optimization is hard

Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 16

slide-16
SLIDE 16

Our issues with the PMU in a nutshell

“I have 100’000 cache misses more because of this choice of data structure – so what?” (actual quote from a senior developer)

Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 17

slide-17
SLIDE 17

CERN/High Energy Physics Needs

> See next talk > Ultimate goal: a simplified performance optimization

process

  • It can only be achieved by striking a good balance between

relieving the users of some of the burden and educating them about the microarchitecture at the same time

> Access to advanced information and data

  • Much of this is inaccessible today but the hardware is there

> Easier access to information

  • Visual reports; high level, composed reports based on advanced

data

> Easier access to the right optimization directions

  • Extra data allows to give extra advice

> More intelligent tuning enabled by higher-level

conclusions

Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 18

slide-18
SLIDE 18

Workshop structure

> Lectures and interactive discussions with

  • ptional hands-on

> Topics

  • Monitoring and tuning facilities (here: x86 and ARM)
  • Methodologies
  • Tools – open source and proprietary
  • Workloads: CERN needs, large workload specifics

Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 19

slide-19
SLIDE 19

Speakers

> ARM

  • Al Grant
  • Michael Williams

> Calxeda

  • Robert Richter (also an AMD expert)

> CERN

  • Vincenzo Innocente

> Google

  • Maria Dimakopoulou
  • Stephane Eranian
  • David Levinthal

> Intel

  • Stanislav Bratanov
  • Michael Chynoweth
  • Ahmad Yasin

> Versailles Exascale Lab

  • Andres S. Charif-Rubial

Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 20

slide-20
SLIDE 20

Thank you

Other questions? Andrzej.Nowak@cern.ch

Andrzej Nowak - 2nd CERN Advanced Performance Tuning Workshop 21