[PPT] - IgProf The ignominious profiler. A generic memory and performance PowerPoint Presentation

SLIDE 1

IgProf

The ignominious profiler. A generic memory and performance profiler for linux applications.

Giulio Eulisse

Fermi National Accelerator Laboratory

http://igprof.org

SLIDE 2

Opinions expressed in this presentation are mine, only mine, and solely mine…. buahahahah

SLIDE 3

Why a profiler?

SLIDE 4

Where I work

CMS experiment @ CERN LHC

SLIDE 5

Where I work

Over 4000 researchers...

SLIDE 6

Where I work

Source: http://cms.web.cern.ch/content/cms-collaboration

43 countries, 191 institutes

SLIDE 7

What do we do?

+ =

5c646f63756d656e74636c6173735b613470617065722c313270745d7b61727469636c657d0a0a5c7573657061636b61676 57b657073667d0a5c7573657061636b6167657b67726170686963737d0a0a0965787477696474683d362e35696e0a5c6f64 64736964656d617267696e3d302e30696e0a5c6576656e736964656d617267696e3d302e30696e0a0a0a6577636f6d6d616 e647b5c6d65747d7b0d6d5c2c2f5c215c215c215c21455f7b547d7d0a0a6577636f6d6d616e647b5c65747d7b0d6d20455f 547d0a0a6577636f6d6d616e647b5c7a7072696d657d7b5a5e5c7072696d657d0a0a086567696e7b646f63756d656e747d0 a0a6f696e64656e747b086620546865205374616e64617264204d6f64656c3a205468652046696e616c204c616772616e67 69616e7d0a0b73706163657b302e31696e7d0a0a0a6f696e64656e740a4672616e6369732048616c7a656e2c20616e64204 16c616e20442e204d617274696e2c0a7b5c697420517561726b7320616e64204c6570746f6e733a20416e20496e74726f64 7563746f727920436f7572736520696e204d6f6465726e205061727469636c6520506879736963732c7d0a4a6f686e20576 96c6579205c2620536f6e732c202831393834292e0a0b73706163657b302e31696e7d0a0a0a6f696e64656e740a546f2073 756d6d6172697a6520746865207374616e6461726420285765696e626572672d53616c616d29206d6f64656c2c207765206 7617468657220746f67657468657220616c6c207468650a696e6772656469656e7473206f6620746865204c616772616e67 69616e2e2054686520636f6d706c657465204c616772616e6769616e2069733a0a0a086567696e7b63656e7465727d0a086 567696e7b746162756c61727d7b726c6c7d0a202020245c6d61746863616c7b4c7d203d240a202020202020202020202020 2020202620242d0c7261637b317d7b347d207b086620577d5f7b5c6d750a757d205c63646f74207b086620577d5e7b5c6d7 50a757d0a2020202020202020202020202020202020202d0c7261637b317d7b347d20202020202020425f7b5c6d750a757d 20202020202020202020202020425e7b5c6d750a757d240a2020202020202020202020202020202620245c6c6566745c7b2 0086567696e7b61727261797d7b6c7d0a20202020202020202020202020202020202020202020202020205c6d626f787b57 7d5e7b5c706d7d2c205c6d626f787b5a7d2c205c67616d6d61205c6d626f787b206b696e657469637d205c0a20202020202 020202020202020202020202020202020202020205c6d626f787b656e65726769657320616e647d20202020202020202020 20202020202020202020202020202020202020205c0a20202020202020202020202020202020202020202020202020205c6 d626f787b73656c662d696e746572616374696f6e737d0a2020202020202020202020202020202020202020202020202020 5c656e647b61727261797d0a2020202020202020202020202020202020200d696768742e240a20202020202020202020202 02020205c20262026205c0a202020202020202020202020202020262024086567696e7b61727261797d7b6c7d0a20202020 20202020202020202020202020202b0861727b4c7d5c67616d6d615e7b5c6d757d5c6c6566742820695c7061727469616c5 f7b5c6d757d0a20202020202020202020202020202020202020202020202020202020202020202020202020202020202020 20202d670c7261637b317d7b327d7b0866200961757d205c63646f74207b086620577d5f7b5c6d757d0a202020202020202 0202020202020202020202020202020202020202020202020202020202020202020202020202d67270c7261637b597d7b32 7d425f7b5c6d757d0a20202020202020202020202020202020202020202020202020202020202020202020202020200d696 7687429204c205c0a2020202020202020202020202020202020202b0861727b527d5c67616d6d615e7b5c6d757d5c6c6566 742820695c7061727469616c5f7b5c6d757d0a2020202020202020202020202020202020202020202020202020202020202 020202020202020202020202020202d67270c7261637b597d7b327d425f7b5c6d757d0a2020202020202020202020202020 2020202020202020202020202020202020202020202020200d69676874292052205c0a20202020202020202020202020202 02020205c656e647b61727261797d240a2020202020202020202020202020202620245c6c6566745c7b20086567696e7b61 727261797d7b6c7d0a20202020202020202020202020202020202020202020202020205c6d626f787b6c6570746f6e20616 e6420717561726b7d202020202020202020205c0a20202020202020202020202020202020202020202020202020205c6d62 6f787b6b696e6574696320656e6572676965737d202020202020202020205c0a20202020202020202020202020202020202 020202020202020205c6d626f787b616e642074686569727d20202020202020202020202020202020205c0a202020202020 20202020202020202020202020202020202020205c6d626f787b696e746572616374696f6e7320776974687d20202020202 02020205c0a20202020202020202020202020202020202020202020202020205c6d626f787b577d5e7b5c706d7d2c205c6d 626f787b5a7d2c205c67616d6d61205c0a20202020202020202020202020202020202020202020202020205c656e647b617 27261797d0a2020202020202020202020202020202020200d696768742e240a2020202020202020202020202020205c2026 2026205c0a2020202020202020202020202020202620242b5c6c6566747c5c6c6566742820695c7061727469616c5f7b5c6 d757d0a20202020202020202020202020202020202020202020202020202020202020202d670c7261637b317d7b327d7b08 66200961757d205c63646f74207b086620577d5f7b5c6d757d0a20202020202020202020202020202020202020202020202 020202020202020202d67270c7261637b597d7b327d425f7b5c6d757d0a2020202020202020202020202020202020202020 20202020200d6967687429205c7068690a202020202020202020202020202020202020200d696768747c5e32202d2056285 c70686929240a2020202020202020202020202020202620245c6c6566745c7b20086567696e7b61727261797d7b6c7d0a20 202020202020202020202020202020202020202020202020205c6d626f787b577d5e7b5c706d7d2c205c6d626f787b5a7d2 c205c67616d6d612c205c6d626f787b20616e642048696767737d205c0a2020202020202020202020202020202020202020 2020202020205c6d626f787b6d617373657320616e647d20202020202020202020202020202020202020202020202020202 020202020202020205c0a20202020202020202020202020202020202020202020202020205c6d626f787b636f75706c696e 67737d2020202020202020202020202020202020202020202020202020202020202020202020205c0a20202020202020202 020202020202020202020202020202020205c656e647b61727261797d0a2020202020202020202020202020202020200d69 6768742e240a2020202020202020202020202020205c20262026205c0a2020202020202020202020202020202620242d284 75f310861727b4c7d5c70686920520a20202020202020202020202020202020202020202b475f320861727b4c7d5c706869 5f6320520a20202020202020202020202020202020202020202b5c6d626f787b6865726d697469616e20636f6e6a7567617 4657d292e240a2020202020202020202020202020202620245c6c6566745c7b20086567696e7b61727261797d7b6c7d0a20 202020202020202020202020202020202020202020202020205c6d626f787b6c6570746f6e20616e6420717561726b7d202 05c0a20202020202020202020202020202020202020202020202020205c6d626f787b6d617373657320616e647d20202020 202020205c0a20202020202020202020202020202020202020202020202020205c6d626f787b636f75706c696e6720746f2 048696767737d205c0a20202020202020202020202020202020202020202020202020205c656e647b61727261797d0a2020 202020202020202020202020202020200d696768742e24205c0a5c656e647b746162756c61727d0a5c656e647b63656e746 5727d0a0a0a6f696e64656e740a244c242064656e6f7465732061206c6566742d68616e646564206665726d696f6e20286c 6570746f6e206f7220717561726b2920646f75626c65742c20616e6420522064656e6f74657320610a72696768742d68616 e646564206665726d696f6e2073696e676c65742e0a0a5c656e647b646f63756d656e747d0a

Mostly, we use 1.3 terawatt hours yearly* to smash particles and we take digital pictures

f the result with a sophisticated camera.

Every single collision, a “RAW Event”, is roughly 1 MB. LHC delivers 10^9 events per second (i.e. 1 PB) when running, which become a few hundreds evt/s after a first level of (HW) “trigger”. We stored to disk roughly 10^9 events in 2011-2013 period

Source: http://home.web.cern.ch/about/engineering/powering-cern

SLIDE 8

What do we do?

We process detector data with in custom written software and quickly obtain PhD theses and conference papers with it.

SLIDE 9

Over 5M SLOCs custom code

Large C++ / Python / Fortran codebase, developed over almost two decades by over 1300 researchers ranging from master students to Nobel prize candidates. Software comprises complex pattern recognition algorithms, data analysis tools and simulations. Small core of professional software engineers mostly doing application framework, release integration and QA.

Large and diverse application set

Many different “workflows” depending on the the kind of simulation / analysis being done. Over 3000 shared objects chained via python configuration language. Each workflow has ~300 MB of CODE sections, sparse in ~600 shared libraries. Roughly 600’000 symbols present in the process image. Over 300 level deep call-trees are the norm. 2GB RSS footprint

n average, large memory churn (up to 1M allocation per second). No single offender.

Large working dataset

~12 PB of RAW data acquired in 2011-13, more than double that if we count the processed via the WLGC Grid using 100k cores sparse over 5 continents. We expect a 2-3 order of magnitude increase in data volume in the next 10 - 15 years.

CMSSW...

9

SLIDE 10

...the WLGC Grid...

Tier-0 (CERN): data recording, reconstruction and distribution Tier-1:   permanent storage, re- processing, analysis Tier-2:   Simulation, end- user analysis

WLCG: An International collaboration to distribute and analyse LHC data Integrates computer centres worldwide that provide computing and storage resource into a single infrastructure accessible by all LHC physicists.

10-100 Gb links > 2 million jobs/day 500 PB of storage ~350’000 cores nearly 170 sites,   40 countries

Courtesy of Ian Bird

SLIDE 11

CMSSW

(CMSSW_7_3_0)

Firefox

(34.0.5)

OpenOffice Writer

(4.1.1)

PovRay

(3.7)

Clang

(3.5)

SLOCs

5.5M 6.4M 4.7M 0.6M 0.9M

Initial release

2005 2002 2002 1991 2007

Contributors

>1300 >1200 >140 ~40 ~200

Typical memory footprint (RSS)

~2 GB ~0.3 GB ~0.2 GB ~0.2 GB

Primary languages

C++, Python, Fortran C / C++, Javascript C++ / Java C / C++ C / C++ / ObjC

CMSSW in perspective

11

SLIDE 12

We needed a memory leak detector, because

that’s where the first loop issues of big software stacks always are.

In 2003 a brave duo (Lassi Tuura and yours truly)

decided waiting for Valgrind (a fantastic tool, BTW) to finish was not an option.

MemProfLib was born

A bit of history

12

SLIDE 13

A one day prototype:

Malloc Hooks

Used __malloc_hook & c. to keep track of allocation / deallocations. Allocation not freed by the end of the program were reported as possible leaks. Code injected into programs via standard LD_PRELOAD mechanism, using atexit to trigger the dump.

Flat output

xml (sigh) output, analyzed by some XSLT magic (double sigh).

Instant gratification

It’s quite amazing how much something like this can already catch. Always prove yourself you can do something cool before actually over-designing your tools.

MemProfLib

13

SLIDE 14

... auto foo = new std::vector<SomeClass *>(); // sigh… for (int i = 0; i < 1000000; ++i) foo->push_back(new SomeClass()); // double sigh delete foo; // triple sigh.. ...

...any reference to real facts or persons is purely coincidental...

14

SLIDE 15

ignominy |ˈɪgnəӚmɪni|

noun [ mass noun ] public shame or disgrace: the ignominy of being imprisoned.

ORIGIN mid 16th cent.: from French ignominie or Latin

ignominia, from in- ‘not’ + a variant of nomen ‘name’.

SLIDE 16

ignominy |ˈɪgnəӚmɪni|

noun [ mass noun ] public shame or disgrace: the ignominy of being imprisoned.

ORIGIN mid 16th cent.: from French ignominie or Latin

ignominia, from in- ‘not’ + a variant of nomen ‘name’.

The Ignominious Profiler was born...

SLIDE 17

Performance & memory profiling, with backtraces Should work in managed environment

No kernel support required
No superuser privileges required

Target audience: people in CMS

Low overhead must be able to allow interactive usage
Results must be understandable to non software professionals

Target application: CMS software

Support for dynamic code / libraries
Multiplatform: x86 / x86_64 / ARM32 / ARM64

Key design decisions

17

SLIDE 18

Dynamic instrumentation (IgHook)
Memory (by hooking into malloc) and performance

profiler (via SIGPROF / SIGALRM)

Full backtrace information (via libunwind)
Analyser tool (igprof-analyse)
Simple web frontend (igprof-navigator)

Reworking the internals

18

SLIDE 19

Avoid extensions and platform specific APIs

__malloc_hook & c. are glibc specific.

Flexibility

We wanted to hook into more than just malloc (e.g. read / write statements).

Safety

We have to hook into various places to make sure we can catch things which interfere with the profiler (e.g. fork), or even attempts to disable it (e.g. explicit calls to signal). Moreover we can safely hook into exit and _exit and dump the profile at that point.

Dynamic instrumentation

19

SLIDE 20

We sit on the shoulders of giants

(1) Jeffrey Richter, “Load Your 32-bit DLL into Another Pro- cess’s Address Space Using INJLIB”, Windows System Journal, Vol 9 No 5, May 1994. (2) Shaun Clowes, “injectso: Modifying and Spying on Running Processes Under Linux and Solaris”, The Black Hat Briefings, 2001, Amsterdam, http:// www.blackhat.com/presentations/bh-europe-01/shaun-clowes/bh-europe-01- clowes.ppt (3) “DynInst: An Application Program Interface (API) for Run-time Code Generation”, http://www.dyninst.org/ (4) https://github.com/rentzsch/mach_inject (5) https://github.com/rentzsch/mach_override

Dynamic instrumentation

20

SLIDE 21

41 56 push %r14 41 55 push %r13 41 54 push %r12 55 push %rbp 53 push %rbx … rest of function body … c3 retq

Symbol start

}

Almost every symbol has a preamble to save registers, frame pointer, etc. We need enough bytes to replace part of it with a jump to our own code.

Dynamic instrumentation

21

SLIDE 22

e9 9f 3c 01 00 jmpq 0x41f004 90 nop 55 push %rbp 53 push %rbx 55 push %rbp 53 push %rbx … rest of function body … c3 retq 41 56 push %r14 41 55 push %r13 41 54 push %r12

Jump to wrapper function Patch area (we might have copied %rip relative instructions in the preamble)

Trampoline structure

Jump to original code

Trampoline Original

22

SLIDE 23

e9 9f 3c 01 00 jmpq 0x41f004 90 nop 55 push %rbp 53 push %rbx 55 push %rbp 53 push %rbx … rest of function body … c3 retq 41 56 push %r14 41 55 push %r13 41 54 push %r12

Jump to wrapper function Patch area (we might have copied %rip relative instructions in the preamble) Jump to original code

Trampoline structure

Original Trampoline

23

SLIDE 24

e9 9f 3c 01 00 jmpq 0x41f004 90 nop 55 push %rbp 53 push %rbx 55 push %rbp 53 push %rbx … rest of function body … c3 retq 41 56 push %r14 41 55 push %r13 41 54 push %r12

Jump to wrapper function Patch area (we might have copied %rip relative instructions in the preamble) Jump back to

riginal code

Jump to original code

Trampoline structure

Original Trampoline

24

SLIDE 25

. . .

riginal();

. . . caller jump trampoline next . . . return

riginal

jump wrapper first few instructions

f the original

jump next patch area trampoline . . . replacement(); . . . return wrapper . . . hook.chain(); . . . return replacement chain . . . hook 1 2 3 4 5 6 7 8 9

from Filip Nybäck thesis

25

SLIDE 26

Relocability

%rip relative instructions in the preamble need to be properly relocated.

x86 assembly complexity

Parsing the preamble on x86 is not trivial, due to complexity of instruction set.

Short branches on ARM64

On ARM64 we likely have only 1 instruction (4 bytes) available for the jump. We limit trampoline distance so that we can use a B instruction (±128MB jump ought to be enough for everyone…).

Specificity to the rescue

The problem is simplified by the fact that usually we care about an handful of symbols so the phase space of the problem is usually limited (malloc, calloc, exit, signal, fork, etc.). For this reason the above are not usually an issue for standard memory and performance profiling, however they might make generic instrumentation complicated (if not impossible at all).

Issues with DI

26

SLIDE 27

Hooks into malloc & c

It only profiles heap allocations do not expect your 64MB array on stack to popup.

Three different kind of counters:

MEM_TOTAL: sum of allocations in a call-path
MEM_LIVE: sum of allocations from a given call-path,

still present when profile dumps results

MEM_MAX: largest single allocation in call-path

For each counter we store the number of calls and the allocated bytes. “Peak” mode also available.

Memory profiling

27

SLIDE 28

Counts Calls Peak

MEM_LIVE

10 10 10

MEM_TOTAL

10 10 10

MEM_MAX

1 10 1

MEM_LIVE and MEM_TOTAL are the same if the is no deallocation

Memory profiling

28

SLIDE 29

Counts Calls Peak

MEM_LIVE

10 10 10

MEM_TOTAL

10 10 10

MEM_MAX

1 10 1

MEM_MAX tracks single allocations

Memory profiling

29

SLIDE 30

Counts Calls Peak

MEM_LIVE

55 10 55

MEM_TOTAL

55 10 55

MEM_MAX

10 10 10

Counts are the sum of allocations, calls are how many times we called malloc

Memory profiling

30

SLIDE 31

Counts Calls Peak

MEM_LIVE

10

MEM_TOTAL

55 10 55

MEM_MAX

10 10 10

Once you start deallocating the difference between MEM_LIVE and MEM_TOTAL is obvious

Memory profiling

31

SLIDE 32

Counts Calls Peak

MEM_LIVE

10

MEM_TOTAL

55 10 55

MEM_MAX

10 10 10

MEM_LIVE in peak mode gives the largest block of memory at one point in time

Memory profiling

32

SLIDE 33

Counts Calls Peak

MEM_LIVE

n n n

MEM_TOTAL

𝚻1..n

𝚻1..n 𝚻1..n

MEM_MAX

n n n

<igprof started here> ... for (int n = 1; n <= 10; ++n) { void *x = malloc(1); <igprof dumps report*> free(x); } ... MEM_LIVE becomes a “likely leaks” checker!

Dumps can be triggered by calling a public symbol or by writing to a file specified by the -D

ption. In this particular

case you get n reports.

Memory profiling

33

SLIDE 34

Malloc overhead profiler (-mo)

It’s pretty common especially in “naive” Object Oriented code to underestimate the overhead of small allocations. For example malloc on x86_64 allocates at 16 bytes borders and keeps at least one extra pointer for each allocation. The - mo option tries to account for “unseen” (by the programmer) overhead of small allocations via malloc.

RSS profiling

When you are trying to reduce the RSS of your application (for example because your batch systems kills you if you are over 2GB of RSS), it’s important to remember that memory always gets mapped into real memory in pages. A single 1 byte allocation in a new page will swap in the associated 4Kb page. IgProf dumps contain the heap dump, and can report pages rather than bytes.

Allocations are not what they seem

34

SLIDE 35

Allocations are not what they seem

Some CMSSW memory map

35

SLIDE 36

More goodies

File descriptor profiling

The previous concepts can actually be extended for any workflow which handles generic “resources” with a “cost” attached. For example you can count the number of writes / reads to a file by hooking into read / write.

Tracing exceptions

A very common pattern we had was to use C++ exceptions as a way to communicate between different parts of the program. This is both slow and leads to unmaintainable code.

Empty memory profiler

Work done by Jakob Blomer @ CERN / SFT. Useful to tune your I/O buffers. On allocation, fill memory with some magic pattern (usually zeros or 0xaa depending

n what we are looking for). On free, scan for the same magic pattern counting

untouched 4KB pages. At profile dump we report untouched pages.

36

SLIDE 37

4K page

char MyBuffer*

4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page

0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa .........

On allocation, we fill memory with some magic pattern (usually zeros or 0xaa depending on what we are looking for).

*for simplicity sake let’s assume the buffer is 4KB aligned, works similarly if not.

Empty memory profiling

37

SLIDE 38

4K page

char MyBuffer*

4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page

0x24 0x20 0x0a 0x61 0x62 0x6c 0x61 0x09 0x69 0x6d 0x65 0x73 0x7b 0x08 0x66 0x20 0x48 0x7d 0x20 0x3d 0x20 0x7b 0x08 0x66 0x20 0x4a 0x7d 0x20 0x2b 0x20 0x7b 0x7b 0x5c 0x70 0x61 0x72 0x74 0x69 0x61 0x6c 0x7b 0x08 0x66 0x20 0x44 0x7d 0x7d 0x5c 0x6f 0x76 0x65 0x72 0x7b 0x5c 0x70 0x61 0x72 0x74 0x69 0x61 0x6c 0x20 0x74 0x7d 0x7d 0x20 0x24 0x0a .........

As the buffer is used, the magic pattern is lost.

*for simplicity sake let’s assume the buffer is 4KB aligned, works similarly if not.

Empty memory profiling

38

SLIDE 39

4K page

char MyBuffer*

4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page 4K page

On deallocation

f the buffer we

report all the pages which still has the magic pattern intact

4K page

0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa 0xaa ......... *for simplicity sake let’s assume the buffer is 4KB aligned, works similarly if not.

Empty memory profiling

39

SLIDE 40

How it works

Uses SIGPROF to have time uniform callbacks every ~ 1/100s. Callback stores the backtrace of where the signal happened. Supports both CPU and wall-clock

time. Biggest advantage is the limited interference with the program itself.

It does converge

If you wait long enough, this actually converges to the right distribution of time spent in any given function. Works brilliantly for repetitive payloads. Unsurprisingly results correlate with MEM_TOTAL.

Not so easy

Of course the difficult part here is making sure you are not interrupting something which shall not be interrupted. For example igprof has problems when a signal happens in dl_iterate_phdr. Another issue was making sure that we accounted for the time spent in fork due to the large RSS which had to be duplicated.

Performance profiling

40

SLIDE 41

< 0.5% 0.5 - 1% Time spent in functions with more than 1% of contribution

Vast majority of the time is spent in functions which themselves have little contributions. Say thanks to C++ encapsulation.

41

SLIDE 42

< 0.5% Number of functions with more than 0.5% of contribution

…and things look even worse if we count the symbols, rather than their contribution… Only 27 symbols have more than 0.5% of the time spent in them, out of ~6000 which gets counts...

42

SLIDE 43

void bar(int i) { malloc(i); } void foo() { malloc(1); bar(1); } int main(int, char **) { foo(); bar(2); }

In large programs what is actually interesting is to know not only which functions allocated most of the memory, but why.

Call tree

43

SLIDE 44

void bar(int i) { malloc(i); } void foo() { malloc(1); bar(1); } int main(int, char **) { foo(); bar(2); }

main foo

Counts

Calls Peak

MEM_LIVE

1 1 1

MEM_TOTAL

1 1 1

MEM_MAX

1 1 1

Every time an instrumented function is called we determine the full calltree.

Call tree

44

SLIDE 45

void bar(int i) { malloc(i); } void foo() { malloc(1); bar(1); } int main(int, char **) { foo(); bar(2); }

bar main foo

Counts

Calls Peak

MEM_LIVE

1 1 1

MEM_TOTAL

1 1 1

MEM_MAX

1 1 1

Counts

Calls Peak

MEM_LIVE

1 1 1

MEM_TOTAL

1 1 1

MEM_MAX

1 1 1

Call tree

45

SLIDE 46

void bar(int i) { malloc(i); } void foo() { malloc(1); bar(1); } int main(int, char **) { foo(); bar(2); }

bar main foo bar

Counts

Calls Peak

MEM_LIVE

2 1 2

MEM_TOTAL

2 1 1

MEM_MAX

2 1 2

Counts

Calls Peak

MEM_LIVE

1 1 1

MEM_TOTAL

1 1 1

MEM_MAX

1 1 1

Counts

Calls Peak

MEM_LIVE

1 1 1

MEM_TOTAL

1 1 1

MEM_MAX

1 1 1

Call tree

46

SLIDE 47

Backtrace using libunwind

This was initially done using backtrace(), but then we switched to libunwind*, due to reliability issues.

Optimizing libunwind

In our application we had up to 1M allocations per second which made memory profiling too slow due to many unwindings. Lassi implemented a fast path which does not do the full DWARF unwind but a simpler stack walk with fallback to the full blown method only when it does not work.

Contributing back to libunwind

x86_64 implementation by Lassi Tuura
ARM32 / ARM64 implementation by Filip Nybäck (30x speedup reported!)
Volunteers wanted to implement it on Power Architecture ;-)

*: http://www.nongnu.org/libunwind/

Fast path backtracing

47

SLIDE 48

Why not expanding beyond malloc?

Give user the flexibility to hook into any generic function. Needs to know the mangled symbol name and register signature.

Precise results

CALL_TIME: total time spent in the requested function (via RDTSC)
CALL_COUNT: number of times a function has been called

Support GCC -finstrument-functions

When recompiling is an option, GCC and others support instrumenting every function (including inlines) via the -finstrument-functions flag which introduces __cyg_profile_func_enter and __cyg_profile_func_exit entry / exit points. Overhead too large to actually think about doing it for every compilation unit, probably more interesting to understand exactly who calls what.

Generic Instrumentation

48

SLIDE 49

What is it?

Performance Application Programming Interface (PAPI) provides an interface and methodology for use of the performance counter hardware found in most major microprocessors. Very nice interface to read performance counters (using either of perf_events, PerfCtr, Perfmon). Requires kernel component so it’s optional in igprof. http://icl.cs.utk.edu/papi/.

How it works?

Similar to the Performance profiler, we get counted at regular intervals and check PAPI counters at that point. If the counter overflows we count 1 in the igprof results. While this should converge, deciding the overflow level is currently completely empiric.

Energy measurements

Recently there has been big noise about “Power efficient computing”. Intel provides nice counters which allow to measure power consumption with a low granularity (RAPL). Given PAPI in particular supports RAPL, we have started doing tests using it. The main problem is that energy consumption is really a global quantity, not a local (to the application) one. Not clear if it simply correlates to CPU performance. Different metrics (e.g. how many times CPU changes state)?

PAPI Support

49

SLIDE 50

igprof-analyse

Dumping a 50MB gzipped file full of profile data is useless if you cannot extract information from it. igprof-analyse takes a profile dump and produces (somewhat) human readable reports from it. Tries hard to be accurate when doing symbol name demangling, using nm and gdb. It also aggregates call-paths, allows applying various filters on the call-tree, both by changing contents of single nodes and by merging nodes together. Exports results in a gprof like text file, sqlite db, or JSON object

igprof-navigator

A poor man cgi script / python server which allows you to navigate results via a web interface. Uses the sqlite report as input.

Analyzing results

50

SLIDE 51

gprof like report format

For each node in the callpath, we aggregate edge information for nodes going into the selected node (i.e. callers) and nodes going out (i.e. callees). Considered node symbol is indented to highlight it.

bar main foo bar

Rank % total Self Self / Children Calls / Total Function [1] 100.0 4 0 / 4 3 main 50.0 ........ 2 / 3 1 / 2 bar(int) [2] 50.0 ........ 2 / 2 2 / 2 foo() [3]

Analyzing results

51

SLIDE 52

gprof like report format

Lines below the selected symbol are callees.

bar main foo bar

Rank % total Self Self / Children Calls / Total Function [1] 100.0 4 0 / 4 3 main 50.0 ........ 2 / 3 1 / 2 bar(int) [2] 50.0 ........ 2 / 2 2 / 2 foo() [3]

Analyzing results

52

SLIDE 53

gprof like report format

Lines below the selected symbol are callers.

bar main foo bar

Rank % total Self Self / Children Calls / Total Function 25.0 ........ 1 / 2 1 / 2 foo() [3] 50.0 ........ 2 / 4 1 / 3 main [1] [2] 75.0 3 3 / 0 2 bar(int)

Analyzing results

53

SLIDE 54

gprof like report format

Lines below the selected symbol are callers.

bar main foo bar

Rank % total Self Self / Children Calls / Total Function 50.0 ........ 2 / 4 2 / 3 main [1] [3] 50.0 2 1 / 1 2 foo() 25.0 ........ 1 / 3 1 / 2 bar(int) [2]

Analyzing results

54

SLIDE 55

zip main foo bar

Sometimes it can be useful to rename nodes so that slightly different versions

f the same are

merged together.

Merging nodes in reports

55

SLIDE 56

zip main foo bar zip main bar bar

mr “s/foo/bar/”

Merging nodes in reports

56

SLIDE 57

bar zip main bar zip main bar

Merging nodes in reports

57

SLIDE 58

Surviving template-ed designs

This is actually extremely useful in case of heavily template designs where (for example) you have methods template-ed on their inputs. class Foo { template <typename A> void useA(const A &a) { <do something with A> } }; While normally each useA<A> would appear as a single instance (different symbol!) using something like -mr “s/.*useA.*/Foo::useA/” allows merging all the various small contribution from each instantiation into one single contribution.

Merging nodes in reports

58

SLIDE 59

bar main foo bar

libbar.so a.out libbar.so libfoo.so

l

Might not be so interesting for few libraries, but it is if you have 600 between libraries and plugins

Merging libraries

SLIDE 60

Sometimes we want to distinguish between same symbol which gets invoked from different call path. One simple solution is to rename symbols associated to a node according to their parents.

bar main foo bar

Splitting by ancestor

60

SLIDE 61

bar main foo bar

Splitting by ancestor

fromFoo

main foo bar

ma “foo>bar/fromFoo”

61

SLIDE 62

Instant gratification

igprof–navigator is a simple CGI script which we use to present reports as webpages. Especially in an environment of non software developers, it is important to have a clickable report at which experts can use to drive non-experts through performance optimization.

Clang examples

MEM_LIVE, x86_64, dumped while compiling a monster machine generated file: http:// cern.ch/go/k9Sw

Firefox benchmark (RoboHornet) examples

MEM_TOTAL, x86_64: http://cern.ch/go/nG7C , PERF_TICKS, x86_64: http://cern.ch/go/H9ct

CMSSW examples

…and for the High Energy Physics enthusiasts among you, a few CMS experiment software examples: PERF_TICKS, x86_64: http://cern.ch/go/NWg9, PERF_TICKS, ARM64: http://cern.ch/go/ 7M9D

IgProf web

62

SLIDE 63

IgProf web

63

SLIDE 64

IgProf web

64

SLIDE 65

We use Free and Open Source Software

CMS experiment is a eager user of FOSS, and we evangelise with other experiments and CERN about its benefits. We actively report bugs and provide patches to tools like gcc, glibc and others.

We write FOSS

IgProf itself is GPLv2. Lassi and Filip contributed back to libunwind the fast path tracing which is now in upstream and available for everyone to use.

We train to use FOSS

We see lots of students coming through CERN and they all get a full immersion in using FOSS tools for their work. Kids get in they only know about Visual Studio and get out being vim wizards.

Contributing to FOSS

65

SLIDE 66

Patches & ideas welcome CERN / CMS institutes summer programs

CERN / CMS institutes have various student programs, where IgProf is usually one of possible projects

CERN @ GSoC

CERN has been mentoring organization for the last few years: http://cern.ch/go/ KM7W.

IgProf was part of GSoC2014

Great work by Filip Nybäck from Aalto University

ARM64 port itself.
Fast path backtrace in libunwind also for ARM64
PAPI support and initial energy profiling as a bonus

Working on IgProf

66

SLIDE 67

Support for more architectures

We have a POWER7 system and we might get a POWER8 dev board. x32 support is also missing and requested by a few people. This should probably in a contribution to libunwind as well.

“Big Data” igprof

When running igprof on the validation of your integration builds you end up with a bunch

f data which from which to extract sensible information. Right now this has to be done by
hand. The idea would be pushing igprof reports to some key-value storage and the use

automated tools to spot problems in your code.

Python / Javascript backtrace support

Right now, when profiling mixed C / C++ code invoking / invoked by scripting languages like python or javascript, igprof will happily show the cost of the interpreter / JIT itself, not the actual python code. The idea would be to instrument…

Profile more counters via PAPI

Ideas for GSoC 2015?

67

SLIDE 68

License x86 ARM

Power Architecture

HW counters

Generic Instrumentation

Heap Sampling kernel / root igprof GPL

✔ ✔ ✔* ✔ ✔ ✔

gprof GPL

✔ ✔ ✔ ✔

Google Performance Tools

BSD

✔ ✔ ✔ ✔ ✔

Oprofile GPL

✔ ✔ ✔ ✔ ✔

perfctr GPL

✔ ✔ ✔ ✔ ✔ ✔

perfmon2 BSD

✔ ✔ ✔ ✔ ✔ ✔

Pick your favorite

...and many others...

68

SLIDE 69

Andreas Pfeiffer, Chris Jones,  David Abdurachmanov, Filip Nybäck, Ian Bird, Jakob Blomer, Jukka Nurminen, Lassi Tuura, Liz Sexton-Kennedy, Lothar Bauerdick, Lucas Taylor, Matevž Tadel, Mikko Kurkela, Peter Elmer, René Meusel, Robert Lupton, Shahzad Muzaffar, Stephen Reucroft, Vincenzo Innocente

All Tintin pictures claimed to be “Fair use”, all trademarks are property of their respective owners.

Thanks!

SLIDE 70

@ktf giulio.eulisse@cern.ch @ktf http://igprof.org

Contact info

70