 
              DVFS PERFORMANCE PREDICTION FOR MANAGED MULTITHREADED APPLICATIONS Shoaib Akram, Jennifer B. Sartor, Lieven Eeckhout Ghent University, Belgium Shoaib.Akram@elis.UGent.be
DVFS Performance PredicEon performance à many applicaEons here memory bound frequency à Sample at all DVFS states L Es/mate performance J 2
Managed MulEthreaded ApplicaEons 3
Background Base Frequency Target Frequency • r = Base/Target t base • S à S * r CPU DRAM • NS à No change Eme à • t target = (S*r) + NS • t base sum of • Not simple – Scaling (S) • OOO+MLP – Non-Scaling (NS) 4
State of the Art • CRIT esEmates non-scaling by – Measuring criEcal path through loads – Ignoring store operaEons R. Mi&akhutdinov, E. Ebrahimi, and Y. N. Pa8. Predic;ng performance impact of DVFS for realis;c memory systems. MICRO, 2012. 5
MulEthreaded CRIT (M+CRIT) Base Frequency Target Frequency 2X t base t target criEcal T0 T0 T1 T1 0 1 0 0.5 1 Eme à Eme à Use CRIT to idenEfy each thread’s non-scaling High error for mulEthreaded Java! 6
Sources of Inaccuracy in M+CRIT busy wait store burst app0 app1 gc0 gc1 ApplicaEon CollecEon ApplicaEon Scaling or non-scaling? 7
Sources of Inaccuracy in M+CRIT busy wait store burst DEP DEP app0 DEP app1 gc0 DEP DEP gc1 BURST ApplicaEon CollecEon ApplicaEon Scaling or non-scaling? 8
Our ContribuEon busy wait store burst DEP DEP DEP+BURST app0 DEP app1 gc0 DEP DEP A New DVFS Performance Predictor gc1 BURST ApplicaEon CollecEon ApplicaEon Scaling or non-scaling? 9
Our ContribuEon DEP+BURST A New DVFS Performance Predictor 10
Example: Inter-thread Dependences T1 T0 while (cond0) while (cond1) { { 1 2 … … } } wait --- Acquire(lock) Acquire(lock) crit_sec() … crit_sec() … Release(lock) Release(lock) 4 3 wake ... ... • Intercept synchronizaEon acEvity • Reconstruct execuEon at target frequency 11
IdenEfying SynchronizaEon Epochs Base Frequency Target Frequency T0 T1 loop loop Epoch # 1 wait() crit_sec() wait Epoch # 2 wake() crit_sec() Epoch # 3 !me 12
IdenEfying SynchronizaEon Epochs Base Frequency Target Frequency T0 T1 Epoch # 1 Epoch # 2 Epoch # 3 !me 13
IdenEfying SynchronizaEon Epochs Base Frequency Target Frequency T0 T1 Epoch 10 10 # 1 Epoch 10 # 2 Epoch 10 10 # 3 !me = 30 units 14
ReconstrucEon at Target Frequency Base Frequency Target Frequency 2X T0 T0 T1 T1 7 5 # 1 Epoch 10 10 # 1 5 # 2 CRIT Epoch 10 5 5 # 2 # 3 Epoch 10 10 # 3 !me 15
ReconstrucEon at Target Frequency Base Frequency Target Frequency 2X T0 T0 T1 T1 7 5 # 1 Epoch 10 10 # 1 3 5 # 2 Epoch 10 5 5 # 2 # 3 = 17 units Longest running in an epoch Epoch 10 10 + Zero book-keeping # 3 !me - Not accurate 16
ReconstrucEon at Target Frequency Base Frequency Target Frequency 2X T0 T0 T1 T1 7 5 # 1 Epoch 10 10 # 1 3 5 # 2 Epoch 10 5 5 # 2 # 3 = 15 units CriEcal thread across epochs Epoch 10 10 + Accurate # 3 !me - Book-keeping = 30 units 17
DEP: Summary Sync AcEvity • Sync Epochs Decompose • Perf Counters Epochs @ Tgt. Reconstruct Aggregate Predicted Total Time 18
Our ContribuEon DEP+BURST A New DVFS Performance Predictor 19
Our ContribuEon DEP+BURST A New DVFS Performance Predictor 20
Store Bursts • Reasons – Zero iniEalizaEon – Copying collectors • Modeling Steps – Track how long the store queue is full – Add to the non-scaling component 21
Methodology • Jikes RVM 3.1.2 • ProducEon collector (Immix) • # GC threads = 2 • 2x min. heap Version 6.0 • 4 cores, 1.0 GHz à 4.0 GHz • 3-level cache hierarchy • LLC fixed to 1.5 GHz • DVFS semngs for 22 nm Haswell • Seven mulEthreaded benchmarks • Four applicaEon threads 22
Accuracy M+CRIT M+CRIT+BURST DEP+BURST 30 % average absolute error 27% 20 13% 10 6% 0 2.0 GHz 3.0 GHz 4.0 GHz Baseline Frequency = 1.0 GHz 23
Energy Manager tolerable_performance_degradaEon New Freq1 New Freq2 4 GHz Quantum 5 ms 24
Energy Savings Performance DegradaEon Energy ReducEon 25 20 15 % 10 5 0 25 Memory Intensive Compute Intensive
Conclusions • DEP+BURST: First predictor that accounts for – ApplicaEon and service threads – SynchronizaEon à inter-thread dependencies – Store bursts • High accuracy – Less than 10% esEmaEon error for seven Java bmarks. • Negligible hardware cost – One extra performance counter – Minor book-keeping across epochs • Demonstrated energy savings – 20 % avg. for a 10% slowdown (mem-intensive Java apps.) 26
DVFS PERFORMANCE PREDICTION FOR MANAGED MULTITHREADED APPLICATIONS Thank You ! Shoaib.Akram@elis.UGent.be
Recommend
More recommend