Systems Research Group Department of Computer Science The University of Hong Kong
Adaptive Sam pling-Based Profiling Techniques for Optim izing the Distributed JVM Runtim e
IPDPS’10, Atlanta, Georgia, USA
Adaptive Sam pling-Based Profiling Techniques for Optim izing the - - PowerPoint PPT Presentation
Adaptive Sam pling-Based Profiling Techniques for Optim izing the Distributed JVM Runtim e King Tin Lam, Yang Luo, Cho-Li Wang Speaker: King Tin Lam Date: Apr 20, 2010 Systems Research Group Department of Computer Science The University
IPDPS’10, Atlanta, Georgia, USA
2 2
3
4
System Developer I m plem entation Level Granularity Consistency Model IVY Yale Library + OS Page (1KB) SC Munin Rice Library + OS Variable ERC TreadMarks Rice Library Page (4KB) LRC CVM Maryland Library Page LRC, SC Midway CMU Library + Compiler Variable EC, PC, RC NCP2 UFRJ, Brail Library + Hardware support Page (4KB) EC, RC Quarks Utah Library Region, Page RC, SC softFLASH Stanford OS Page (16KB) RC, DIRC Cashmere-2L Rochester Library Page (8KB) HLRC Brazos Rice Library Page ScC Shasta DEC WRL Compiler Variable SC Mermaid Toronto Library+OS Page (1KB, 8KB) SC Mirage UCLA OS 512Bytes SC JIAJIA CAS, China Library Page (4KB) ScC Simple-COMA SICS (Sweden) and SUN OS Page SC Blizzard-S Wisconsin Library Cache line SC Shrimp Princeton OS+Hardware support Page AURC, SC Linda Yale Language Variable SC Orca Vrije Univ., Netherlands Language Variable EC-like
5
6
7
8
(e.g. cJVM, JavaSplit, JESSICA, …)
9
T1 T2 node 1 node 2 remote access
10
T1 T2 node 1 node 2 remote access
11
T1 T2 node 1 node 2 remote access
12
Local Heap Local Heap Local Heap Local Heap Local Heap Local Heap
Thread 3
Java Method Area Java Method Area
Thread 2 Thread 1 PC
Execution Engine Execution Engine Class Loader Class Loader
Registers Stack Frames Thread Scheduler Thread Scheduler
Master JVM
Thread 3
Java Method Area Java Method Area
Thread 2 Thread 1 PC
Class Loader Class Loader
Registers Load Monitor Daemon Thread Scheduler Thread Scheduler Thread 3
Java Method Area Java Method Area
Thread 2 Thread 1 PC
Execution Engine Execution Engine Class Loader Class Loader
Registers Load Monitor Daemon Stack Frames Thread Scheduler Thread Scheduler
Execution Engine Execution Engine
Stack Frames
Remote Class Loading Thread Migration Source Code Source Code Java Compiler Java Compiler Class Files Class Files Portable Java Frames
Load Monitor Daemon
Host Manager Host Manager OS Hardware
Worker JVM
Host Manager Host Manager OS Hardware
Worker JVM
Host Manager Host Manager OS Hardware Communication Network
A cluster-wide JVM with
13
Local Heap Local Heap Local Heap Local Heap Local Heap Local Heap
Thread 3
Java Method Area Java Method Area
Thread 2 Thread 1 PC
Execution Engine Execution Engine Class Loader Class Loader
Registers Stack Frames Thread Scheduler Thread Scheduler
Master JVM
Thread 3
Java Method Area Java Method Area
Thread 2 Thread 1 PC
Class Loader Class Loader
Registers Load Monitor Daemon Thread Scheduler Thread Scheduler Thread 3
Java Method Area Java Method Area
Thread 2 Thread 1 PC
Execution Engine Execution Engine Class Loader Class Loader
Registers Load Monitor Daemon Stack Frames Thread Scheduler Thread Scheduler
Heap (Global Object Space) Heap (Global Object Space)
Execution Engine Execution Engine
Stack Frames
Remote Class Loading Thread Migration Source Code Source Code Java Compiler Java Compiler Class Files Class Files Portable Java Frames
Load Monitor Daemon
Host Manager Host Manager OS Hardware
Worker JVM
Host Manager Host Manager OS Hardware
Worker JVM
Host Manager Host Manager OS Hardware Communication Network
A cluster-wide JVM with
14
Host Manager Host Manager OS Hardware Thread Scheduler Thread Scheduler Thread Space Local Heap Local Heap … Stack Profiler Stack Profiler OS Hardware OS Hardware Interconnection Network Correlation Collector Correlation Collector OS Hardware Access Profiler Access Profiler Stack Worker JVM 1 Host Manager Host Manager Thread Scheduler Thread Scheduler Thread Space Local Heap Local Heap … Stack Profiler Stack Profiler Migration Engine Migration Engine Correlation Collector Correlation Collector Access Profiler Access Profiler Stack Worker JVM 2 Host Manager Host Manager Thread Scheduler Thread Scheduler Thread Space Local Heap Local Heap … Stack Profiler Stack Profiler Migration Engine Migration Engine Correlation Collector Correlation Collector Access Profiler Access Profiler Stack Worker JVM 3 Host Manager Host Manager Global Load Balancer Global Load Balancer Correlation Map (Simplified View) Master JVM mig in/out mig in/out mig in/out Portable Java Frames Migration Engine Migration Engine Migration Requests
Correlation Analyzer Correlation Analyzer
Now equipped with
thread sharing -> thread-thread relation
each thread -> thread migration cost
thread migration -> global locality improvement
15
16
17
node 1 node 2 node 3 … e.g. Water-Spatial 32 threads placed
18
Simulation Barnes-Hut: 32 threads, 4K bodies (<100 bytes each), dist=7.0 Page size: 4KB Page size: 128 byte
32 times more tracking
19
1 1
T n i frame n i restore capture mig
… (2)
n i frame n i restore capture mig
1 1
… (1)
network latency & bandwidth
20
acquire(L) release(L) fetch(A)
A
fetch(B)
B
read(A) read(B) fetch(C)
C
read(C) read(C) read(C) acquire(L) fetch(A)
A
fetch(B)
B
read(A) read(B) fetch(C)
C
read(C) read(C) read(C) fetch(C)
C T1 migrated
release(L)
T1 T1 (1) Without migration: (2) With migration: Fetching roundtrips = 3 Fetching roundtrips = 4
read(C) read(C)
21
acquire(L) fetch(A)
A
fetch(B)
B
read(A) read(B) fetch(C)
C
read(C) read(C) read(C)
C T1 migrated
release(L)
T1 (3) With migration prefetching WT1 : Fetching roundtrips = 3 A B
A (1 time), B (1 time), C (4 times)
read(C)
22
acquire(L) fetch(A)
A
fetch(B)
B
read(A) read(B) fetch(C)
C
read(C) read(C) read(C)
C T1 migrated
release(L)
T1 (3) With migration prefetching WT1 : Fetching roundtrips = 3 A B
A (1 time), B (1 time), C (4 times)
read(C)
23
24
25
26
27
28
29
30
31
2 1 1 2 1 1
ij N j N i ij ij N j N i EUC
ij N j N i ij ij N j N i ABS
1 1 1 1
(Euclidean distance) (Absolute distance)
32
50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100% 512X 256X 128X 64X 32X 16X 8X 4X 2X 1X Absolute/ABS Relative/ABS Absolute/EUC Relative/EUC 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100% 512X 256X 128X 64X 32X 16X 8X 4X 2X 1X Absolute/ABS Relative/ABS Absolute/EUC Relative/EUC
50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100% 512X 256X 128X 64X 32X 16X 8X 4X 2X 1X Absolute/ABS Relative/ABS Absolute/EUC Relative/EUC
(a) SOR (b) Barnes-Hut (c) Water-Spatial
33
34
stack stack
int float
int
t0 p0
stack stack
int float double
t0 p1
stack stack int
int int int int
t1 p0
stack stack
int int
t1 p1 Time: Processor:
35
stack stack int float
int
t0 p0
stack stack
int float double
t0 p1
stack stack int
int int int int
t1 p0
stack stack
int int
t1 p1 Time: Processor:
36
stack stack
int float
int
t0 p0
stack stack
int float double
t0 p1
stack stack int
int int int int
t1 p0
stack stack
int int
t1 p1 Time: Processor:
37
38
Sticky set Invariant references Stack Size estimated via
Sampled objects Objects referenced invariantly by stack
Key:
Unsampled objects
39
40
stack state 1 stack state 2
= extracted frame = unvisited frame = stack invariant =non-invariant C A B D A B A = comparison E F A G = raw frame A G H
stack state 3 stack state 4 stack state 5
41
42
43
Benchmark Problem Size Sharing Data set Rounds Granularity Object size SOR 2K × 2K 10 Coarse each row at least several KB Barnes-Hut 4K bodies 5 Fine each body less than 100 bytes Water-Spatial 512 molecules 5 Medium each molecule about 512 bytes
44
CPU Overhead of logging accesses into OALs Overhead of Sending OALs
45
46
Bench mark Data Set Size Baseline Exe Time + Stack Sampling Overhead + Sticky-set Footprinting Overhead + Sticky- set Resolution Overhead Immediate Extraction Lazy Extraction Nonstop Timer-based (100ms) 4ms 16ms 4ms 16ms 4X Full 4X Full SOR 1K×1K 6201 6216 (0.24%) 6207 (0.10%) 6211 (0.17%) 6206 (0.08%) 6714 (8.28%) 6707 (8.17%) 6519 (5.13%) 6480 (4.50%) 6639 (1.85%) Barnes
4K 93857 94947 (1.16%) 94657 (0.85%) 94697 (0.89%) 95209 (1.44%) 98968 (5.45%) 102190 (8.88%) 93649 (-0.22%) 102334 (9.03%) 97585 (4.20%) Water- Spatial 512 59105 59232 (0.21%) 59161 (0.09%) 59209 (0.17%) 59124 (0.03%) 59834 (1.23%) 61985 (4.87%) 59501 (0.67%) 60313 (2.04%) 60002 (0.84%)
47
Epoch 1 Epoch 2 Epoch 3 With thread migration enabled, the system strives for upkeep of most of the locality (see right fig). Execution time shorten by over 60% compared to no migration.
48
50