Analyzing the Scalability of Managed Language Applications with - - PowerPoint PPT Presentation

analyzing the scalability of managed language
SMART_READER_LITE
LIVE PREVIEW

Analyzing the Scalability of Managed Language Applications with - - PowerPoint PPT Presentation

Analyzing the Scalability of Managed Language Applications with Speedup Stacks Jennifer B. Sartor Kristof Du Bois Stijn Eyerman Lieven Eeckhout Understanding Scalability Problems n Multicore n Managed languages l Service threads Speedup


slide-1
SLIDE 1

Analyzing the Scalability of Managed Language Applications with Speedup Stacks

Jennifer B. Sartor Kristof Du Bois Stijn Eyerman Lieven Eeckhout

slide-2
SLIDE 2

Understanding Scalability Problems

n Multicore n Managed languages

l Service threads

Ø Speedup Stack

  • Bar graph that explains causes of sublinear

speedup

  • Ideal speedup of multi-threaded execution over

single-threaded versus actual speedup

  • p. 2

Speedup Stacks — Stijn Eyerman, Kristof Du Bois, Lieven Eeckhout —ISPASS-2012

slide-3
SLIDE 3

Original Speedup Stacks

imbalance synchronization memory interference cache interference speedup

ideal speedup (# of threads) actual speedup Speedup delimiters:

This factor is responsible for reducing speedup by this amount from the ideal speedup

  • r

If completely removed, gives indication of how much speedup could improve

  • p. 3

Speedup Stacks — Stijn Eyerman, Kristof Du Bois, Lieven Eeckhout —ISPASS-2012

Speedup

slide-4
SLIDE 4

Original Speedup Stacks

n Scalability delimiters

l Work imbalance l Spinning l Yielding l Last-level cache and memory interference

w Positive w Negative

❌ No managed components ❌ Dedicated hardware support

  • p. 4

Speedup Stacks — Stijn Eyerman, Kristof Du Bois, Lieven Eeckhout —ISPASS-2012

slide-5
SLIDE 5

Our Contribution

n Managed

service threads

n On native

hardware

0.5 1 1.5 2 2.5 3 3.5 4

Speedup

Garbage Collector Initialization Thread Imbalance Synchronization Other Overheads Measured

  • p. 5
slide-6
SLIDE 6

Managed Speedup Stacks

n Scalability delimiters

l Garbage collector l Managed runtime initialization l Synchronization l Thread imbalance l Other overheads

w Parallelization overhead w Shared hardware resource interference

n On native hardware

l Linux kernel modules l < 1% overhead on average

  • p. 6

10%*

*T. Cao, S. M. Blackburn, T. Gao, and K. S. McKinley, “The yin and yang of power and performance for asymmetric hardware and managed software,” ISCA, 2012

slide-7
SLIDE 7

Background

n Ideal speedup = # of threads (N)

Speedup = 𝑢𝑗𝑛𝑓+,-./01234056 𝑢𝑗𝑛𝑓78/2,1234056 𝑇 = :+

:; 5 10 15 20 T 0 T 3 T 1 T 2

Ts= 20 Tp = 5 𝑈𝑡 = > 𝑈𝑞 − > 𝑃𝑗𝑘

  • D

E ,

  • p. 7
slide-8
SLIDE 8

Background

n Ideal speedup = # of threads (N)

Speedup = 𝑢𝑗𝑛𝑓+,-./01234056 𝑢𝑗𝑛𝑓78/2,1234056 𝑈𝑡 = > 𝑈𝑞 − > 𝑃𝑗𝑘

  • D

E ,

𝑈𝑡 𝑈𝑞 = 𝑇 = 𝑂 − ∑ ∑ 𝑃𝑗𝑘

  • D

E ,

𝑈𝑞 𝑇 = :+

:;

N S Oij

  • p. 8
slide-9
SLIDE 9

Managed: Garbage Collection

n When application paused n In original speedup stacks: part of yielding

5 10 15 20 T 0 T 3 T 1 T 2

  • p. 9
slide-10
SLIDE 10

Managed: Garbage Collection

n When application paused n In original speedup stacks: part of yielding n If GC were perfectly scalable, component

would be 0 𝑇 = 𝑂 − 𝑂 × 𝑈𝐻𝐷, 𝑁𝑈 − 𝑈𝐻𝐷, 𝑇𝑈 𝑈𝑞 − > ∑ 𝑃1𝑗𝑘

  • D

𝑈𝑞

E ,

  • p. 10
slide-11
SLIDE 11

Managed: Runtime Initialization

n Java virtual machine initialization, compilation,

shutdown

n Application threads not yet spawned, or

paused

5 10 15 20 T 0 T 3 T 1 T 2

  • p. 11
slide-12
SLIDE 12

Managed: Runtime Initialization

n Java virtual machine initialization, compilation,

shutdown

n Application threads not yet spawned, or

paused

n If initialization were perfectly scalable,

component would be 0 𝑇 = 𝑂 − 𝑂 × 𝑈𝐻𝐷, 𝑁𝑈 − 𝑈𝐻𝐷, 𝑇𝑈 𝑈𝑞 − 𝑂 × 𝑈𝑗𝑜𝑗𝑢, 𝑁𝑈 − 𝑈𝑗𝑜𝑗𝑢, 𝑇𝑈 𝑈𝑞 − > ∑ 𝑃2𝑗𝑘

  • D

𝑈𝑞

E ,

  • p. 12
slide-13
SLIDE 13

Other Speedup Delimiters

n Synchronization

l When threads wait on each other l Measure wait time inside futex syscall

n Thread Imbalance

l When thread executes longer than other threads l Measure wait time inside exit syscall

n Other Overhead

l Parallelization overhead l Hardware interference l Estimated

  • p. 13
slide-14
SLIDE 14

Managed Speedup Stack 𝑇 = 𝑂 − 𝑂 × 𝑈𝐻𝐷, 𝑁𝑈 − 𝑈𝐻𝐷, 𝑇𝑈 𝑈𝑞 − 𝑂 × 𝑈𝑗𝑜𝑗𝑢, 𝑁𝑈 − 𝑈𝑗𝑜𝑗𝑢, 𝑇𝑈 𝑈𝑞 − ∑ 𝑇𝑧𝑜𝑑𝑗

E ,

𝑈𝑞 − ∑ 𝐹𝑦𝑗𝑢𝑗

E ,

𝑈𝑞 − > ∑ 𝑃4𝑗𝑘

  • D

𝑈𝑞

E ,

Garbage collector Initialization

# threads

Measured speedup Synchronization Thread imbalance Other

  • verheads
  • p. 14
slide-15
SLIDE 15

Garbage Collector Initialization Thread Imbalance Synchronization Other Overheads Measured

0.5 1 1.5 2 2.5 3 3.5 4 Speedup

Managed Speedup Stack 𝑇 = 𝑂 − 𝑂 × 𝑈𝐻𝐷, 𝑁𝑈 − 𝑈𝐻𝐷, 𝑇𝑈 𝑈𝑞 − 𝑂 × 𝑈𝑗𝑜𝑗𝑢, 𝑁𝑈 − 𝑈𝑗𝑜𝑗𝑢, 𝑇𝑈 𝑈𝑞 − ∑ 𝑇𝑧𝑜𝑑𝑗

E ,

𝑈𝑞 − ∑ 𝐹𝑦𝑗𝑢𝑗

E ,

𝑈𝑞 − > ∑ 𝑃4𝑗𝑘

  • D

𝑈𝑞

E ,

GC Initialize Imbalance Sync. Other Measured

  • p. 15
slide-16
SLIDE 16

Experimental Methodology

n Java applications from DaCapo 2009 suite n Jikes Research Virtual Machine 3.1.2 n Garbage collector

l 2 threads l 13th iteration for stable behavior l Heap size based on minimum with stop-the-world

(STW) collector

l STW generational Immix and concurrent collectors

n Intel Xeon E5, 8 cores per socket, 20MB LLC

n 3.2.37 Linux kernel

  • p. 16
slide-17
SLIDE 17

1 2 3 4 5 6 7 8 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads lusearch pmd sunflow xalan

Speedup Measured Other Overheads Synchronization Thread Imbalance Initialization Garbage Collector

Speedup Stacks with STW GC

  • p. 17
slide-18
SLIDE 18

Performance Counters for STW GC

0.5 1 1.5 2 2.5 3 3.5 4 4.5 Instructions L1-loads L1-loads-misses LLC-loads LLC-load-misses Instructions L1-loads L1-loads-misses LLC-loads LLC-load-misses Instructions L1-loads L1-loads-misses LLC-loads LLC-load-misses Instructions L1-loads L1-loads-misses LLC-loads LLC-load-misses lusearch pmd sunflow xalan

Relative to one thread

2 threads 4 threads 8 threads

  • p. 18
slide-19
SLIDE 19

Concurrent GC, Same Heap Size

1 2 3 4 5 6 7 8 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads lusearch pmd sunflow xalan

Speedup Measured Other Overheads Synchronization Thread Imbalance Initialization Garbage Collector

  • p. 19
slide-20
SLIDE 20

Concurrent GC, Large Heap

1 2 3 4 5 6 7 8 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads lusearch pmd sunflow xalan

Speedup Measured Other Overheads Synchronization Thread Imbalance Initialization Garbage Collector

  • p. 20
slide-21
SLIDE 21

Perf Cntrs, Concurrent GC, Large Heap

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Instructions L1-loads L1-loads-misses LLC-loads LLC-load-misses Instructions L1-loads L1-loads-misses LLC-loads LLC-load-misses Instructions L1-loads L1-loads-misses LLC-loads LLC-load-misses Instructions L1-loads L1-loads-misses LLC-loads LLC-load-misses lusearch pmd sunflow xalan

Relative to one thread

2 threads 4 threads 8 threads

  • p. 21
slide-22
SLIDE 22

Comparison Across Collectors

  • p. 22

1 2 3 4 5 6 7 8 stw conc conc large stw conc conc large stw conc conc large stw conc conc large lusearch pmd sunflow xalan

Speedup Measured Other Overheads Synchronization Thread Imbalance Initialization Garbage Collector

8 threads

slide-23
SLIDE 23

Related Work

n Commerical

l Intel VTune Amplifier XE l Sun Studio Performance Analyzer l Rogue Wave/Acumem ThreadSpotter l PGPROF

n IBM WAIT n Criticality stacks & Bottle graphs Ø None quantify gross scalability bottlenecks,

most don’t analyze service threads

  • p. 23
slide-24
SLIDE 24

0.5 1 1.5 2 2.5 3 3.5 4

Speedup

Garbage Collector Initialization Thread Imbalance Synchronization Other Overheads Measured

Conclusions: Managed Speedup Stacks

n Visualize scalability

bottlenecks

n Show relative contributions

  • f components

l Garbage collector l Managed runtime

initialization

n On native hardware at low

  • verhead

n Show where to focus

  • ptimization: application or

service threads

  • p. 24