Memoro Scaling an LLVM-Based Heap Profiler Thierry Treyer Mark - - PowerPoint PPT Presentation

memoro
SMART_READER_LITE
LIVE PREVIEW

Memoro Scaling an LLVM-Based Heap Profiler Thierry Treyer Mark - - PowerPoint PPT Presentation

Memoro Scaling an LLVM-Based Heap Profiler Thierry Treyer Mark Santaniello James Larus Performance & Performance & EPFL IC School Dean Capacity Intern Capacity Engineer 1 vector<BigT> getValues( map<Id, BigT>&


slide-1
SLIDE 1

Memoro

Thierry Treyer

Performance & Capacity Intern

Mark Santaniello

Performance & Capacity Engineer

James Larus

EPFL IC School Dean

Scaling an LLVM-Based Heap Profiler

1

slide-2
SLIDE 2

vector<BigT> getValues( map<Id, BigT>& largeMap, vector<Id>& keys) { vector<BigT> values; values.reserve(largeMap.size()); for (const auto& key: keys) values.emplace_back(largeMap[key]); return values; }

2

slide-3
SLIDE 3

40 GiB

  • f DRAM wasted per server

3

slide-4
SLIDE 4

vector<BigT> getValues( map<Id, BigT>& largeMap, vector<Id>& keys) { vector<BigT> values; values.reserve(largeMap.size()); for (const auto& key: keys) values.emplace_back(largeMap[key]); return values; }

4

slide-5
SLIDE 5

vector<BigT> getValues( map<Id, BigT>& largeMap, vector<Id>& keys) { vector<BigT> values; values.reserve(largeMap.size()); for (const auto& key: keys) values.emplace_back(largeMap[key]); return values; }

5

slide-6
SLIDE 6

vector<BigT> getValues( map<Id, BigT>& largeMap, vector<Id>& keys) { vector<BigT> values; values.reserve(keys.size()); for (const auto& key: keys) values.emplace_back(largeMap[key]); return values; }

6

slide-7
SLIDE 7 LLVM Sanitizers Memoro

LLVM-Based Profiler

7

slide-8
SLIDE 8 LLVM Sanitizers Memoro

Manipulate the IR

LLVM-Based Profiler

7

slide-9
SLIDE 9 LLVM Sanitizers Memoro

Manipulate the IR Infrastructure

LLVM-Based Profiler

7

slide-10
SLIDE 10 LLVM Sanitizers Memoro

Manipulate the IR Infrastructure Collecting and Displaying data

LLVM-Based Profiler

7

slide-11
SLIDE 11

Visualizer Open Challenges Run-Time Overhead

Memoro +

8

slide-12
SLIDE 12

Source Code Compile Run Analyze

Overview

9

slide-13
SLIDE 13

Source Code Compile Run Analyze No modification

Overview

9

slide-14
SLIDE 14

Source Code Compile Run Analyze No modification Instrument loads/stores Instrument intrinsics Collect types

Overview

INSTRUMENTATION PASS (LLVM)

9

slide-15
SLIDE 15

Source Code Compile Run Analyze No modification Instrument loads/stores Intercept alloc/free Instrument intrinsics Collect types Intercept loads/stores Intercept syscalls Collect stats

Overview

INSTRUMENTATION PASS (LLVM) RUN-TIME (COMPILER-RT)

9

slide-16
SLIDE 16

Source Code Compile Run Analyze No modification Instrument loads/stores Intercept alloc/free Score AP Instrument intrinsics Collect types Intercept loads/stores Intercept syscalls Guide exploration Collect stats

Overview

INSTRUMENTATION PASS (LLVM) RUN-TIME (COMPILER-RT) VISUALIZER (ELECTRON)

9

slide-17
SLIDE 17

Source Code Compile Run Analyze No modification Instrument loads/stores Intercept alloc/free Score AP Instrument intrinsics Collect types Intercept loads/stores Intercept syscalls Guide exploration Collect stats

Overview

INSTRUMENTATION PASS (LLVM) RUN-TIME (COMPILER-RT) VISUALIZER (ELECTRON)

9

slide-18
SLIDE 18

Visualizer Open Challenges Run-Time Overhead

Memoro +

10

slide-19
SLIDE 19

Visualizer Open Challenges Run-Time Overhead

Memoro +

10

slide-20
SLIDE 20

1,000x

slowdown due to Memoro's run-time

11

slide-21
SLIDE 21

Run-Time Sampling

void interceptLoadStore(…) { // Sample accesses if (sample_count++ % access_sampling_rate != 0) return; /* Process access... */ } int sample_count = 0;

12

slide-22
SLIDE 22

Run-Time Sampling

void interceptLoadStore(…) { // Sample accesses if (sample_count++ % access_sampling_rate != 0) return; /* Process access... */ } THREADLOCAL int sample_count = 0;

12

slide-23
SLIDE 23

Power to the user!

MEMORO_OPTIONS="…" ./myapp

  • access_sampling_rate
  • ...

// Public API: memoro_interface.h #include <memoro_interface.h> void foo(…) { MemoroFlags *mflags = memoro::getFlags(); mflags->access_sampling_rate = 50; /* ... */ }

13

slide-24
SLIDE 24

99%

🕶

14

slide-25
SLIDE 25

Time spent by address type

0% 25% 50% 75% 100%

Primary Heap Secondary Heap Not Heap

99%

🕶

14

slide-26
SLIDE 26

Time spent by address type

0% 25% 50% 75% 100%

Primary Heap Secondary Heap Stack

99%

🕶

14

slide-27
SLIDE 27 Secondary − large allocations O(n) Primary O(1)

ld 0x…

Metadata Addr Size First Access Time Access Range Low …

🕶

The Allocators

15

slide-28
SLIDE 28 Secondary − large allocations O(n) Primary O(1)

ld 0x…

Metadata Addr Size First Access Time Access Range Low …

🔓

🕶

The Allocators

15

slide-29
SLIDE 29

🕶

Stack … Heap

Issue with non-heap addresses

16

slide-30
SLIDE 30

🕶

  • 1. Allocators only know about heap
Stack … Heap

Issue with non-heap addresses

16

slide-31
SLIDE 31

🕶

  • 1. Allocators only know about heap
  • 2. Traverse all allocations to discard them
Stack … Heap

Issue with non-heap addresses

16

slide-32
SLIDE 32

🕶

  • 1. Allocators only know about heap
  • 2. Traverse all allocations to discard them
  • 3. Takes a global lock
Stack … Heap

Issue with non-heap addresses

16

slide-33
SLIDE 33

🕶

  • 1. Allocators only know about heap
  • 2. Traverse all allocations to discard them
  • 3. Takes a global lock
Stack … Heap

Issue with non-heap addresses

0% 25% 50% 75% 100% Primary Heap Secondary Heap Stack

16

slide-34
SLIDE 34 Stack … Heap

Run-Time Filter

17

slide-35
SLIDE 35
  • 1. Thread start: store stack top
Stack … Heap

Run-Time Filter

0xABCD 17

slide-36
SLIDE 36
  • 1. Thread start: store stack top
  • 2. Get current stack bottom
Stack … Heap

Run-Time Filter

0xABCD 0xAAAA 17

slide-37
SLIDE 37
  • 1. Thread start: store stack top
  • 2. Get current stack bottom
  • 3. Discard if Addr. in this range
Stack … Heap

Run-Time Filter

0xABCD 0xAAAA 0xAABB 0x1234 17

slide-38
SLIDE 38

0.58% 0.58%

Time spent by address type

0% 25% 50% 75% 100%

Primary Heap Secondary Heap Not heap Stack Filtered

99% <2%

18

slide-39
SLIDE 39

1,000x

slowdown due to Memoro's run-time

19

slide-40
SLIDE 40

5x

slowdown due to Memoro's run-time

20

slide-41
SLIDE 41

Visualizer Open Challenges Run-Time Overhead

Memoro +

21

slide-42
SLIDE 42

Visualizer Open Challenges Run-Time Overhead

Memoro +

21

slide-43
SLIDE 43

+ 100,000

Stack Traces +

1B

Allocations

22

slide-44
SLIDE 44

23

slide-45
SLIDE 45

Truncate

Score 0% 10% 20% 30% 40% Bin Size 100 300 1k 3k 10k 30k 100k

24

slide-46
SLIDE 46

Truncate

Score 0% 10% 20% 30% 40% Bin Size 100 300 1k 3k 10k 30k 100k HIDE

24

slide-47
SLIDE 47

25

slide-48
SLIDE 48 BEFORE AFTER

25

slide-49
SLIDE 49

VS.

foo() bar()

Death by a thousand cuts

main() . . . main() . . .

26

slide-50
SLIDE 50

VS.

foo() bar()

Death by a thousand cuts

main() . . . main() . . .

26

slide-51
SLIDE 51

VS.

foo() bar()

Death by a thousand cuts

main() . . . main() . . . main()

26

slide-52
SLIDE 52

VS.

foo() bar()

Death by a thousand cuts

main() . . . main() . . . .

26

slide-53
SLIDE 53

VS.

foo() bar()

Death by a thousand cuts

main() . . . main() . . . .

26

slide-54
SLIDE 54

VS.

foo() bar()

Death by a thousand cuts

main() . . . main() . . . .

26

slide-55
SLIDE 55

VS.

foo() bar()

Death by a thousand cuts

main() . . . main() . . . bar()

26

slide-56
SLIDE 56

VS.

foo() bar()

Death by a thousand cuts

main() . . . main() . . .

26

slide-57
SLIDE 57

VS.

foo() bar()

Death by a thousand cuts

main() . . . main() . . .

26

slide-58
SLIDE 58

main() main()

Death by a thousand cuts

bar() . . . foo() . . .

27

slide-59
SLIDE 59

Memoro +

28

slide-60
SLIDE 60

vector<BigT> getValues( map<Id, BigT>& largeMap, vector<Id>& keys) { vector<BigT> values; values.reserve(largeMap.size()); for (const auto& key: keys) values.emplace_back(largeMap[key]); return values; }

29

slide-61
SLIDE 61

Demo

30

slide-62
SLIDE 62

Visualizer Open Challenges Run-Time Overhead

Memoro +

31

slide-63
SLIDE 63

Dumping Profile

Your regular
 service

32

slide-64
SLIDE 64

Dumping Profile

AtExit()

Your regular
 service

32

slide-65
SLIDE 65

Dumping Profile

AtExit()

Your regular
 service

32

slide-66
SLIDE 66

Dumping Profile

AtExit()

Facebook
 service

32

slide-67
SLIDE 67

Dumping Profile

AtExit()

Facebook
 service

32

slide-68
SLIDE 68

Dumping Profile

AtExit()

Facebook
 service

32

slide-69
SLIDE 69

Dumping Profile

AtExit()

Facebook
 service

lldb

call AtExit()

32

slide-70
SLIDE 70

Dumping Profile

AtExit()

Facebook
 service

lldb

call AtExit()

32

slide-71
SLIDE 71

Dumping Profile

AtExit()

Facebook
 service

lldb

call AtExit()

a. Signal to dump (SIGPROF)

32

slide-72
SLIDE 72

Dumping Profile

AtExit()

Facebook
 service

lldb

call AtExit()

a. Signal to dump (SIGPROF)

  • b. Ring buffer + Periodic write

32

slide-73
SLIDE 73

Compile-Time Stack Analysis

33

slide-74
SLIDE 74

Compile-Time Stack Analysis

ld/st 33

slide-75
SLIDE 75

Compile-Time Stack Analysis

llvm::GetUnderlyingObject() ld/st 33

slide-76
SLIDE 76

Compile-Time Stack Analysis

llvm::GetUnderlyingObject()

Ratio Instrumented load/store 22500 45000 67500 90000 GetUnderlyingObject(depth = X) 1 2 4 8

ld/st 33

slide-77
SLIDE 77

Compile-Time Stack Analysis

llvm::GetUnderlyingObject()

Ratio Instrumented load/store 22500 45000 67500 90000 GetUnderlyingObject(depth = X) 1 2 4 8

ld/st bar() foo() 33

slide-78
SLIDE 78

Compile-Time Stack Analysis

llvm::GetUnderlyingObject()

Ratio Instrumented load/store 22500 45000 67500 90000 GetUnderlyingObject(depth = X) 1 2 4 8

ld/st bar() foo() 33

slide-79
SLIDE 79

Thank you!

Thierry Treyer

Performance & Capacity Intern

Mark Santaniello

Performance & Capacity Engineer

James Larus

EPFL IC School Dean

34

github.com/epfl-vlsc/memoro