uniprof: Transparent Unikernel Performance Profiling & Debugging - - PowerPoint PPT Presentation

uniprof transparent unikernel
SMART_READER_LITE
LIVE PREVIEW

uniprof: Transparent Unikernel Performance Profiling & Debugging - - PowerPoint PPT Presentation

uniprof: Transparent Unikernel Performance Profiling & Debugging Florian Schmidt, Research Scientist, NEC Europe Ltd. Unikernels? Faster, smaller, better! 2 Unikernels? Faster, smaller, better! clip arts: clipproject.info Unikernels


slide-1
SLIDE 1

uniprof: Transparent Unikernel Performance Profiling & Debugging

Florian Schmidt, Research Scientist, NEC Europe Ltd.

slide-2
SLIDE 2

2

Unikernels?

▌Faster, smaller, better!

slide-3
SLIDE 3

3

Unikernels?

▌Faster, smaller, better! ▌But ever heard this?

Unikernels are hard to debug. Kernel debugging is horrible!

clip arts: clipproject.info

slide-4
SLIDE 4

4

Unikernels?

▌Faster, smaller, better! ▌But ever heard this? ▌Then you might say

Unikernels are hard to debug. Kernel debugging is horrible! But that’s not really true! Unikernels are a single linked binary. They have a shared address space. You can just use gdb!

clip arts: clipproject.info

slide-5
SLIDE 5

5

Unikernels?

▌Faster, smaller, better! ▌But ever heard this? ▌Then you might say ▌And while that is true… ▌… we are admittedly lacking tools

Unikernels are hard to debug. Kernel debugging is horrible! But that’s not really true! Unikernels are a single linked binary. They have a shared address space. You can just use gdb!

clip arts: clipproject.info

slide-6
SLIDE 6

6

Unikernels?

▌Faster, smaller, better! ▌But ever heard this? ▌Then you might say ▌And while that is true… ▌… we are admittedly lacking tools ▌Such as effective profilers

Unikernels are hard to debug. Kernel debugging is horrible! But that’s not really true! Unikernels are a single linked binary. They have a shared address space. You can just use gdb!

clip arts: clipproject.info

slide-7
SLIDE 7

7

Enter uniprof

▌Goals:

Performance profiler No changes to profiled code necessary Minimal overhead

slide-8
SLIDE 8

8

Enter uniprof

▌Goals:

Performance profiler No changes to profiled code necessary Minimal overhead  Useful in production environments

slide-9
SLIDE 9

9

Enter uniprof

▌Goals:

Performance profiler No changes to profiled code necessary Minimal overhead  Useful in production environments

▌So, a stack profiler

Collect stack traces at regular intervals

call_main+0x278 main+0x1c schedule+0x3a monotonic_clock+0x1a

slide-10
SLIDE 10

10

Enter uniprof

▌Goals:

Performance profiler No changes to profiled code necessary Minimal overhead  Useful in production environments

▌So, a stack profiler

Collect stack traces at regular intervals Many of them

call_main+0x278 main+0x1c schedule+0x3a monotonic_clock+0x1a call_main+0x278 main+0x1c netfront_rx+0xa call_main+0x278 main+0x1c netfront_rx+0xa netfront_get_responses+0x1c netfrontif_rx_handler+0x20 netfrontif_transmit+0x1a0 netfront_xmit_pbuf+0xa4 call_main+0x278 main+0x1c blkfront_aio_poll+0x32

slide-11
SLIDE 11

11

Enter uniprof

▌Goals:

Performance profiler No changes to profiled code necessary Minimal overhead  Useful in production environments

▌So, a stack profiler

Collect stack traces at regular intervals Many of them Analyze which code paths show up often

  • Either because they take a long time
  • Or because they are hit often

Point towards potential bottlenecks

call_main+0x278 main+0x1c schedule+0x3a monotonic_clock+0x1a call_main+0x278 main+0x1c netfront_rx+0xa call_main+0x278 main+0x1c netfront_rx+0xa netfront_get_responses+0x1c netfrontif_rx_handler+0x20 netfrontif_transmit+0x1a0 netfront_xmit_pbuf+0xa4 call_main+0x278 main+0x1c blkfront_aio_poll+0x32

slide-12
SLIDE 12

12

xenctx

▌Turns out, a stack profiler for Xen already exists

Well, kinda

slide-13
SLIDE 13

13

xenctx

▌Turns out, a stack profiler for Xen already exists

Well, kinda

▌xenctx is bundled with Xen

Introspection tool Option to print call stack

$ xenctx -f -s <symbol table file> <DOMID> [...] Call Trace: [<0000000000004868>] three+0x58 <-- 00000000000ffea0: [<00000000000044f2>] two+0x52 00000000000ffef0: [<00000000000046a6>] one+0x12 00000000000fff40: [<000000000002ff66>] 00000000000fff80: [<0000000000012018>] call_main+0x278

slide-14
SLIDE 14

14

xenctx

▌Turns out, a stack profiler for Xen already exists

Well, kinda

▌xenctx is bundled with Xen

Introspection tool Option to print call stack

▌So if we run this over and over, we have a stack profiler

Well, kinda

$ xenctx -f -s <symbol table file> <DOMID> [...] Call Trace: [<0000000000004868>] three+0x58 <-- 00000000000ffea0: [<00000000000044f2>] two+0x52 00000000000ffef0: [<00000000000046a6>] one+0x12 00000000000fff40: [<000000000002ff66>] 00000000000fff80: [<0000000000012018>] call_main+0x278

slide-15
SLIDE 15

15

xenctx

▌Downside: xenctx is slow

Very slow: 3ms+ per trace Doesn’t sound like much, but really adds up (e.g., 100 samples/s = 300ms/s) Can’t really blame it, not designed as a fast stack profiler

slide-16
SLIDE 16

16

xenctx

▌Downside: xenctx is slow

Very slow: 3ms+ per trace Doesn’t sound like much, but really adds up (e.g., 100 samples/s = 300ms/s) Can’t really blame it, not designed as a fast stack profiler

▌Performance isn’t just a nice-to-have

We interrupt the guest all the time Can’t walk stack while guest is running: race conditions High overhead can influence results! Low overhead is imperative for use on production unikernels

slide-17
SLIDE 17

17

xenctx

▌Downside: xenctx is slow

Very slow: 3ms+ per trace Doesn’t sound like much, but really adds up (e.g., 100 samples/s = 300ms/s) Can’t really blame it, not designed as a fast stack profiler

▌Performance isn’t just a nice-to-have

We interrupt the guest all the time Can’t walk stack while guest is running: race conditions High overhead can influence results! Low overhead is imperative for use on production unikernels

▌First question: extend xenctx or write something from scratch?

Spoiler: look at the talk title More insight when I come to the evaluation

slide-18
SLIDE 18

18

What do we need?

slide-19
SLIDE 19

19

What do we need?

▌Registers (for FP, IP)

This is pretty easy: getvcpucontext() hypercall

slide-20
SLIDE 20

20

What do we need?

▌Registers (for FP, IP)

This is pretty easy: getvcpucontext() hypercall

▌Access to stack memory (to read return addresses and next FPs)

This is the complicated step We need to do address resolution

slide-21
SLIDE 21

21

What do we need?

▌Registers (for FP, IP)

This is pretty easy: getvcpucontext() hypercall

▌Access to stack memory (to read return addresses and next FPs)

This is the complicated step We need to do address resolution

  • Memory introspection requires mapping memory over
  • We’re looking at (uni)kernel code
  • But there’s still a virtual  (guest) physical resolution
slide-22
SLIDE 22

22

What do we need?

▌Registers (for FP, IP)

This is pretty easy: getvcpucontext() hypercall

▌Access to stack memory (to read return addresses and next FPs)

This is the complicated step We need to do address resolution

  • Memory introspection requires mapping memory over
  • We’re looking at (uni)kernel code
  • But there’s still a virtual  (guest) physical resolution
  • Even in guest is PVH, can’t benefit from it, because we’re looking in from outside
  • So we need to manually walk page tables
slide-23
SLIDE 23

23

What do we need?

▌Registers (for FP, IP)

This is pretty easy: getvcpucontext() hypercall

▌Access to stack memory (to read return addresses and next FPs)

This is the complicated step We need to do address resolution

  • Memory introspection requires mapping memory over
  • We’re looking at (uni)kernel code
  • But there’s still a virtual  (guest) physical resolution
  • Even in guest is PVH, can’t benefit from it, because we’re looking in from outside
  • So we need to manually walk page tables

▌Symbol table (to resolve function names)

Thankfully, this is easy again: extract symbols from ELF with nm

slide-24
SLIDE 24

24

Stack

Local variables

Registers IP FP … … …

Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables

Stack trace:

slide-25
SLIDE 25

25

Stack

Local variables

Registers IP FP

function three() { […] }

… … …

Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables

Stack trace:

slide-26
SLIDE 26

26

Stack

Local variables

Registers IP FP

function three() { […] }

… … …

Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables

Stack trace: three +0xca

IP

slide-27
SLIDE 27

27

Stack

Local variables

Registers IP FP

function three() { […] }

… … …

Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables

Stack trace: three +0xca

IP

slide-28
SLIDE 28

28

Stack

Local variables

Registers IP FP

function two() { […] three(); } function three() { […] }

… … …

Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables

Stack trace: three +0xca

IP

slide-29
SLIDE 29

29

Stack

Local variables

Registers IP FP

function two() { […] three(); } function three() { […] }

… … …

Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables

Stack trace: three +0xca two +0xc1

IP FP+1word

slide-30
SLIDE 30

30

Stack

Local variables

Registers IP FP

function one() { […] two(); […] } function two() { […] three(); } function three() { […] }

… … …

Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables

Stack trace: three +0xca two +0xc1

IP FP+1word

slide-31
SLIDE 31

31

Stack

Local variables

Registers IP FP

function one() { […] two(); […] } function two() { […] three(); } function three() { […] }

… … …

Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables

Stack trace: three +0xca two +0xc1

  • ne +0x0d

IP FP+1word *FP+1word

slide-32
SLIDE 32

32

Stack

Local variables

Registers IP FP

function one() { […] two(); […] } function two() { […] three(); } function three() { […] }

… … …

Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables

Stack trace: three +0xca two +0xc1

  • ne +0x0d

IP FP+1word *FP+1word

slide-33
SLIDE 33

33

Stack

Local variables

Registers IP FP

function one() { […] two(); […] } function two() { […] three(); } function three() { […] }

… … …

Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables

Stack trace: three +0xca two +0xc1

  • ne +0x0d

[done]

IP FP+1word *FP+1word **FP==NULL

slide-34
SLIDE 34

34

Walking the page tables (x86-64)

CR3

virtual address

slide-35
SLIDE 35

35

Walking the page tables (x86-64)

CR3

virtual address

slide-36
SLIDE 36

36

Walking the page tables (x86-64)

CR3

virtual address

slide-37
SLIDE 37

37

Walking the page tables (x86-64)

CR3

L1 virtual address

slide-38
SLIDE 38

38

Walking the page tables (x86-64)

CR3

L1 virtual address

slide-39
SLIDE 39

39

Walking the page tables (x86-64)

CR3

L1 L2 virtual address

slide-40
SLIDE 40

40

Walking the page tables (x86-64)

CR3

L1 L2 L3 virtual address

slide-41
SLIDE 41

41

Walking the page tables (x86-64)

CR3

L1 L2 L3 L4 virtual address

slide-42
SLIDE 42

42

Walking the page tables (x86-64)

CR3

L1 L2 L3 L4 (guest) physical address virtual address

slide-43
SLIDE 43

43

Walking the page tables (x86-64)

CR3

L1 L2 L3 L4 (guest) physical address

▌So many maps:

5 per entry * stack depth

map! map! map! map! map!

virtual address

slide-44
SLIDE 44

44

Walking the page tables (x86-64)

CR3

L1 L2 L3 L4 (guest) physical address

▌So many maps:

5 per entry * stack depth

▌Then again, page table locations don’t change…

Neither do stack locations (exception: lots of thread spawning) Effective caching

map! map! map! map! map!

virtual address

slide-45
SLIDE 45

45

Walking the page tables (x86-64)

CR3

L1 L2 L3 L4 (guest) physical address

▌So many maps:

5 per entry * stack depth

▌Then again, page table locations don’t change…

Neither do stack locations (exception: lots of thread spawning) Effective caching

virtual address

map! map! map! map! map!

slide-46
SLIDE 46

46

Create Symbol Table

▌Stack only contains addresses ▌Symbol resolution necessary

slide-47
SLIDE 47

47

Create Symbol Table

▌Stack only contains addresses ▌Symbol resolution necessary ▌Trivial

Virtual addresses mapped 1:1 into unikernel address space nm is your friend

$ nm -n <ELF> > symtab $ head symtab 0000000000000000 T _start 0000000000000000 T _text 0000000000000008 a RSP_OFFSET 0000000000000017 t stack_start 00000000000000fc a KERNEL_CS_MASK 0000000000001000 t shared_info 0000000000002000 t hypercall_page 0000000000003000 t error_entry 000000000000304f t error_call_handler 0000000000003069 t hypervisor_callback

slide-48
SLIDE 48

48

Create Symbol Table

▌Stack only contains addresses ▌Symbol resolution necessary ▌Trivial

Virtual addresses mapped 1:1 into unikernel address space nm is your friend

▌Needs unstripped binary

You’re welcome to strip it afterwards

$ nm -n <ELF> > symtab $ head symtab 0000000000000000 T _start 0000000000000000 T _text 0000000000000008 a RSP_OFFSET 0000000000000017 t stack_start 00000000000000fc a KERNEL_CS_MASK 0000000000001000 t shared_info 0000000000002000 t hypercall_page 0000000000003000 t error_entry 000000000000304f t error_call_handler 0000000000003069 t hypervisor_callback

slide-49
SLIDE 49

49

What do we get?

slide-50
SLIDE 50

50

What do we get? Flamegraphs!

▌ Y Axis: call trace

 Bottom: main function, each layer: one call depth

▌ X Axis: relative run time

 Call paths are aggregated, no same call path twice in graph

https://github.com/brendangregg/flamegraph

slide-51
SLIDE 51

51

What do we get? Flamegraphs!

▌ Y Axis: call trace

 Bottom: main function, each layer: one call depth

▌ X Axis: relative run time

 Call paths are aggregated, no same call path twice in graph

▌ In this example: netfront functions “heavy hitters”

 netfront_xmit_pbuf  netfront_rx  but also blkfront_aio_poll

1 3 2 1 3 2 2 1 3

https://github.com/brendangregg/flamegraph

slide-52
SLIDE 52

52

What do we get? Flamegraphs!

▌ Y Axis: call trace

 Bottom: main function, each layer: one call depth

▌ X Axis: relative run time

 Call paths are aggregated, no same call path twice in graph

▌ In this example: netfront functions “heavy hitters”

 netfront_xmit_pbuf  netfront_rx  but also blkfront_aio_poll

1 3 2 1 3 2 2 1 3

Yep, it’s a MiniOS* doing network communication

*with lwip for TCP/IP

https://github.com/brendangregg/flamegraph

slide-53
SLIDE 53

53

Try 1: improving xenctx performance

▌xenctx translates and maps memory addresses every stack walk

 Huge overhead  Solution: cache mapped memory and virtualmachine translations

slide-54
SLIDE 54

54

Try 1: improving xenctx performance

▌xenctx translates and maps memory addresses every stack walk

 Huge overhead  Solution: cache mapped memory and virtualmachine translations

▌xenctx resolves symbols via linear search

 Solution: use binary search

slide-55
SLIDE 55

55

Try 1: improving xenctx performance

▌xenctx translates and maps memory addresses every stack walk

 Huge overhead  Solution: cache mapped memory and virtualmachine translations

▌xenctx resolves symbols via linear search

 Solution: use binary search  (Or, even better, do resolutions offline after tracing)

slide-56
SLIDE 56

56

Try 1: improving xenctx performance

▌xenctx translates and maps memory addresses every stack walk

 Huge overhead  Solution: cache mapped memory and virtualmachine translations

▌xenctx resolves symbols via linear search

 Solution: use binary search  (Or, even better, do resolutions offline after tracing)

slide-57
SLIDE 57

57

Try 1: improving xenctx performance

▌xenctx translates and maps memory addresses every stack walk

 Huge overhead  Solution: cache mapped memory and virtualmachine translations

▌xenctx resolves symbols via linear search

 Solution: use binary search  (Or, even better, do resolutions offline after tracing)

At this point, I abandoned xenctx and (re)wrote uniprof from scratch.

slide-58
SLIDE 58

58

Try 2: uniprof

▌100-fold improvement is nice! But we can do better:

Xen 4.7 introduced low-level libraries (libxencall, libxenforeigmemory) Another significant reduction by ~ factor of 3

slide-59
SLIDE 59

59

Try 2: uniprof

▌100-fold improvement is nice! But we can do better:

Xen 4.7 introduced low-level libraries (libxencall, libxenforeigmemory) Another significant reduction by ~ factor of 3

▌End result: overhead of ~0.1% @101 samples/s

slide-60
SLIDE 60

60

Performance on ARM

▌uniprof supports ARM (xenctx doesn’t)

Main challenge: different page table design

slide-61
SLIDE 61

61

Performance on ARM

▌uniprof supports ARM (xenctx doesn’t)

Main challenge: different page table design

▌ARM: much slower, overhead higher

But the CPU is much slower, too (Intel Xeon @3.7GHz vs. Cortex A20 @1GHz) So fewer samples/s needed for same effective resolution

slide-62
SLIDE 62

62

No Frame Pointer? No Problem!

▌Stack walking relies on frame pointer

Optimizations can reuse FP as general-purpose register (-fomit-frame-pointer)

slide-63
SLIDE 63

63

No Frame Pointer? No Problem!

▌Stack walking relies on frame pointer

Optimizations can reuse FP as general-purpose register (-fomit-frame-pointer)

▌But we can do without FPs

Use stack unwinding information

  • It’s already included if you use C++ (for exception handling)
  • It doesn’t change performance
  • Only binary size

DWARF standard

$ readelf –S <ELF> There are 13 section headers, starting at offset 0x40d58: Section Headers: [Nr] Name Type Address Offset Size EntSize Flags Link Info Align [...] [ 4] .eh_frame PROGBITS 0000000000035860 00036860 00000000000066f8 0000000000000000 A 0 0 8 [ 5] .eh_frame_hdr PROGBITS 000000000003bf58 0003cf58 000000000000128c 0000000000000000 A 0 0 4 [...]

slide-64
SLIDE 64

64

Unwinding without Frame Pointers

▌How does it work?

slide-65
SLIDE 65

65

Unwinding without Frame Pointers

▌How does it work?

slide-66
SLIDE 66

66

Unwinding without Frame Pointers

▌How does it work? ▌Lookup table

For every program address

  • The current frame size
  • Locations of registers to restore (GP and IP)
slide-67
SLIDE 67

67

Unwinding without Frame Pointers

▌How does it work? ▌Lookup table

For every program address

  • The current frame size
  • Locations of registers to restore (GP and IP)

Important for exception handling

  • Exit functions immediately until handler is found
slide-68
SLIDE 68

68

Unwinding without Frame Pointers

▌How does it work? ▌Lookup table

For every program address

  • The current frame size
  • Locations of registers to restore (GP and IP)

Important for exception handling

  • Exit functions immediately until handler is found

▌Index to quickly find table entry

slide-69
SLIDE 69

69

Unwinding without Frame Pointers

▌How does it work? ▌Lookup table

For every program address

  • The current frame size
  • Locations of registers to restore (GP and IP)

Important for exception handling

  • Exit functions immediately until handler is found

▌Index to quickly find table entry ▌Several library implementations

uniprof uses libunwind Actually, a libunwind patched for Xen guest introspection support Might be useful for other tools?

slide-70
SLIDE 70

70

Performance: uniprof w/ libunwind

▌Performance lower than with frame pointer

Reason: libunwind does more than we need (full register reconstruction etc.)

▌Different library or own implementation promising

But “good enough” for many cases And a good area for future work

slide-71
SLIDE 71

71

Enter the title.

Thank you! Questions?

uniprof: https://github.com/cnplab/uniprof libunwind-xen: https://github.com/cnplab/libunwind FlameGraphs: https://github.com/brendangregg/flamegraph