Analyzing the Scalability of Managed Language Applications with - - PowerPoint PPT Presentation
Analyzing the Scalability of Managed Language Applications with - - PowerPoint PPT Presentation
Analyzing the Scalability of Managed Language Applications with Speedup Stacks Jennifer B. Sartor Kristof Du Bois Stijn Eyerman Lieven Eeckhout Understanding Scalability Problems n Multicore n Managed languages l Service threads Speedup
Understanding Scalability Problems
n Multicore n Managed languages
l Service threads
Ø Speedup Stack
- Bar graph that explains causes of sublinear
speedup
- Ideal speedup of multi-threaded execution over
single-threaded versus actual speedup
- p. 2
Speedup Stacks — Stijn Eyerman, Kristof Du Bois, Lieven Eeckhout —ISPASS-2012
Original Speedup Stacks
imbalance synchronization memory interference cache interference speedup
ideal speedup (# of threads) actual speedup Speedup delimiters:
This factor is responsible for reducing speedup by this amount from the ideal speedup
- r
If completely removed, gives indication of how much speedup could improve
- p. 3
Speedup Stacks — Stijn Eyerman, Kristof Du Bois, Lieven Eeckhout —ISPASS-2012
Speedup
Original Speedup Stacks
n Scalability delimiters
l Work imbalance l Spinning l Yielding l Last-level cache and memory interference
w Positive w Negative
❌ No managed components ❌ Dedicated hardware support
- p. 4
Speedup Stacks — Stijn Eyerman, Kristof Du Bois, Lieven Eeckhout —ISPASS-2012
Our Contribution
n Managed
service threads
n On native
hardware
0.5 1 1.5 2 2.5 3 3.5 4
Speedup
Garbage Collector Initialization Thread Imbalance Synchronization Other Overheads Measured
- p. 5
Managed Speedup Stacks
n Scalability delimiters
l Garbage collector l Managed runtime initialization l Synchronization l Thread imbalance l Other overheads
w Parallelization overhead w Shared hardware resource interference
n On native hardware
l Linux kernel modules l < 1% overhead on average
- p. 6
10%*
*T. Cao, S. M. Blackburn, T. Gao, and K. S. McKinley, “The yin and yang of power and performance for asymmetric hardware and managed software,” ISCA, 2012
Background
n Ideal speedup = # of threads (N)
Speedup = 𝑢𝑗𝑛𝑓+,-./01234056 𝑢𝑗𝑛𝑓78/2,1234056 𝑇 = :+
:; 5 10 15 20 T 0 T 3 T 1 T 2
Ts= 20 Tp = 5 𝑈𝑡 = > 𝑈𝑞 − > 𝑃𝑗𝑘
- D
E ,
- p. 7
Background
n Ideal speedup = # of threads (N)
Speedup = 𝑢𝑗𝑛𝑓+,-./01234056 𝑢𝑗𝑛𝑓78/2,1234056 𝑈𝑡 = > 𝑈𝑞 − > 𝑃𝑗𝑘
- D
E ,
𝑈𝑡 𝑈𝑞 = 𝑇 = 𝑂 − ∑ ∑ 𝑃𝑗𝑘
- D
E ,
𝑈𝑞 𝑇 = :+
:;
N S Oij
- p. 8
Managed: Garbage Collection
n When application paused n In original speedup stacks: part of yielding
5 10 15 20 T 0 T 3 T 1 T 2
- p. 9
Managed: Garbage Collection
n When application paused n In original speedup stacks: part of yielding n If GC were perfectly scalable, component
would be 0 𝑇 = 𝑂 − 𝑂 × 𝑈𝐻𝐷, 𝑁𝑈 − 𝑈𝐻𝐷, 𝑇𝑈 𝑈𝑞 − > ∑ 𝑃1𝑗𝑘
- D
𝑈𝑞
E ,
- p. 10
Managed: Runtime Initialization
n Java virtual machine initialization, compilation,
shutdown
n Application threads not yet spawned, or
paused
5 10 15 20 T 0 T 3 T 1 T 2
- p. 11
Managed: Runtime Initialization
n Java virtual machine initialization, compilation,
shutdown
n Application threads not yet spawned, or
paused
n If initialization were perfectly scalable,
component would be 0 𝑇 = 𝑂 − 𝑂 × 𝑈𝐻𝐷, 𝑁𝑈 − 𝑈𝐻𝐷, 𝑇𝑈 𝑈𝑞 − 𝑂 × 𝑈𝑗𝑜𝑗𝑢, 𝑁𝑈 − 𝑈𝑗𝑜𝑗𝑢, 𝑇𝑈 𝑈𝑞 − > ∑ 𝑃2𝑗𝑘
- D
𝑈𝑞
E ,
- p. 12
Other Speedup Delimiters
n Synchronization
l When threads wait on each other l Measure wait time inside futex syscall
n Thread Imbalance
l When thread executes longer than other threads l Measure wait time inside exit syscall
n Other Overhead
l Parallelization overhead l Hardware interference l Estimated
- p. 13
Managed Speedup Stack 𝑇 = 𝑂 − 𝑂 × 𝑈𝐻𝐷, 𝑁𝑈 − 𝑈𝐻𝐷, 𝑇𝑈 𝑈𝑞 − 𝑂 × 𝑈𝑗𝑜𝑗𝑢, 𝑁𝑈 − 𝑈𝑗𝑜𝑗𝑢, 𝑇𝑈 𝑈𝑞 − ∑ 𝑇𝑧𝑜𝑑𝑗
E ,
𝑈𝑞 − ∑ 𝐹𝑦𝑗𝑢𝑗
E ,
𝑈𝑞 − > ∑ 𝑃4𝑗𝑘
- D
𝑈𝑞
E ,
Garbage collector Initialization
# threads
Measured speedup Synchronization Thread imbalance Other
- verheads
- p. 14
Garbage Collector Initialization Thread Imbalance Synchronization Other Overheads Measured
0.5 1 1.5 2 2.5 3 3.5 4 Speedup
Managed Speedup Stack 𝑇 = 𝑂 − 𝑂 × 𝑈𝐻𝐷, 𝑁𝑈 − 𝑈𝐻𝐷, 𝑇𝑈 𝑈𝑞 − 𝑂 × 𝑈𝑗𝑜𝑗𝑢, 𝑁𝑈 − 𝑈𝑗𝑜𝑗𝑢, 𝑇𝑈 𝑈𝑞 − ∑ 𝑇𝑧𝑜𝑑𝑗
E ,
𝑈𝑞 − ∑ 𝐹𝑦𝑗𝑢𝑗
E ,
𝑈𝑞 − > ∑ 𝑃4𝑗𝑘
- D
𝑈𝑞
E ,
GC Initialize Imbalance Sync. Other Measured
- p. 15
Experimental Methodology
n Java applications from DaCapo 2009 suite n Jikes Research Virtual Machine 3.1.2 n Garbage collector
l 2 threads l 13th iteration for stable behavior l Heap size based on minimum with stop-the-world
(STW) collector
l STW generational Immix and concurrent collectors
n Intel Xeon E5, 8 cores per socket, 20MB LLC
n 3.2.37 Linux kernel
- p. 16
1 2 3 4 5 6 7 8 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads lusearch pmd sunflow xalan
Speedup Measured Other Overheads Synchronization Thread Imbalance Initialization Garbage Collector
Speedup Stacks with STW GC
- p. 17
Performance Counters for STW GC
0.5 1 1.5 2 2.5 3 3.5 4 4.5 Instructions L1-loads L1-loads-misses LLC-loads LLC-load-misses Instructions L1-loads L1-loads-misses LLC-loads LLC-load-misses Instructions L1-loads L1-loads-misses LLC-loads LLC-load-misses Instructions L1-loads L1-loads-misses LLC-loads LLC-load-misses lusearch pmd sunflow xalan
Relative to one thread
2 threads 4 threads 8 threads
- p. 18
Concurrent GC, Same Heap Size
1 2 3 4 5 6 7 8 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads lusearch pmd sunflow xalan
Speedup Measured Other Overheads Synchronization Thread Imbalance Initialization Garbage Collector
- p. 19
Concurrent GC, Large Heap
1 2 3 4 5 6 7 8 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads lusearch pmd sunflow xalan
Speedup Measured Other Overheads Synchronization Thread Imbalance Initialization Garbage Collector
- p. 20
Perf Cntrs, Concurrent GC, Large Heap
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Instructions L1-loads L1-loads-misses LLC-loads LLC-load-misses Instructions L1-loads L1-loads-misses LLC-loads LLC-load-misses Instructions L1-loads L1-loads-misses LLC-loads LLC-load-misses Instructions L1-loads L1-loads-misses LLC-loads LLC-load-misses lusearch pmd sunflow xalan
Relative to one thread
2 threads 4 threads 8 threads
- p. 21
Comparison Across Collectors
- p. 22
1 2 3 4 5 6 7 8 stw conc conc large stw conc conc large stw conc conc large stw conc conc large lusearch pmd sunflow xalan
Speedup Measured Other Overheads Synchronization Thread Imbalance Initialization Garbage Collector
8 threads
Related Work
n Commerical
l Intel VTune Amplifier XE l Sun Studio Performance Analyzer l Rogue Wave/Acumem ThreadSpotter l PGPROF
n IBM WAIT n Criticality stacks & Bottle graphs Ø None quantify gross scalability bottlenecks,
most don’t analyze service threads
- p. 23
0.5 1 1.5 2 2.5 3 3.5 4
Speedup
Garbage Collector Initialization Thread Imbalance Synchronization Other Overheads Measured
Conclusions: Managed Speedup Stacks
n Visualize scalability
bottlenecks
n Show relative contributions
- f components
l Garbage collector l Managed runtime
initialization
n On native hardware at low
- verhead
n Show where to focus
- ptimization: application or
service threads
- p. 24