DVFS PERFORMANCE PREDICTION FOR MANAGED MULTITHREADED APPLICATIONS - PowerPoint PPT Presentation

DVFS PERFORMANCE PREDICTION FOR MANAGED MULTITHREADED APPLICATIONS Shoaib Akram, Jennifer B. Sartor, Lieven Eeckhout Ghent University, Belgium Shoaib.Akram@elis.UGent.be

DVFS Performance PredicEon performance à many applicaEons here memory bound frequency à Sample at all DVFS states L Es/mate performance J 2

Managed MulEthreaded ApplicaEons 3

Background Base Frequency Target Frequency • r = Base/Target t base • S à S * r CPU DRAM • NS à No change Eme à • t target = (S*r) + NS • t base sum of • Not simple – Scaling (S) • OOO+MLP – Non-Scaling (NS) 4

State of the Art • CRIT esEmates non-scaling by – Measuring criEcal path through loads – Ignoring store operaEons R. Mi&akhutdinov, E. Ebrahimi, and Y. N. Pa8. Predic;ng performance impact of DVFS for realis;c memory systems. MICRO, 2012. 5

MulEthreaded CRIT (M+CRIT) Base Frequency Target Frequency 2X t base t target criEcal T0 T0 T1 T1 0 1 0 0.5 1 Eme à Eme à Use CRIT to idenEfy each thread’s non-scaling High error for mulEthreaded Java! 6

Sources of Inaccuracy in M+CRIT busy wait store burst app0 app1 gc0 gc1 ApplicaEon CollecEon ApplicaEon Scaling or non-scaling? 7

Sources of Inaccuracy in M+CRIT busy wait store burst DEP DEP app0 DEP app1 gc0 DEP DEP gc1 BURST ApplicaEon CollecEon ApplicaEon Scaling or non-scaling? 8

Our ContribuEon busy wait store burst DEP DEP DEP+BURST app0 DEP app1 gc0 DEP DEP A New DVFS Performance Predictor gc1 BURST ApplicaEon CollecEon ApplicaEon Scaling or non-scaling? 9

Our ContribuEon DEP+BURST A New DVFS Performance Predictor 10

Example: Inter-thread Dependences T1 T0 while (cond0) while (cond1) { { 1 2 … … } } wait --- Acquire(lock) Acquire(lock) crit_sec() … crit_sec() … Release(lock) Release(lock) 4 3 wake ... ... • Intercept synchronizaEon acEvity • Reconstruct execuEon at target frequency 11

IdenEfying SynchronizaEon Epochs Base Frequency Target Frequency T0 T1 loop loop Epoch # 1 wait() crit_sec() wait Epoch # 2 wake() crit_sec() Epoch # 3 !me 12

IdenEfying SynchronizaEon Epochs Base Frequency Target Frequency T0 T1 Epoch # 1 Epoch # 2 Epoch # 3 !me 13

IdenEfying SynchronizaEon Epochs Base Frequency Target Frequency T0 T1 Epoch 10 10 # 1 Epoch 10 # 2 Epoch 10 10 # 3 !me = 30 units 14

ReconstrucEon at Target Frequency Base Frequency Target Frequency 2X T0 T0 T1 T1 7 5 # 1 Epoch 10 10 # 1 5 # 2 CRIT Epoch 10 5 5 # 2 # 3 Epoch 10 10 # 3 !me 15

ReconstrucEon at Target Frequency Base Frequency Target Frequency 2X T0 T0 T1 T1 7 5 # 1 Epoch 10 10 # 1 3 5 # 2 Epoch 10 5 5 # 2 # 3 = 17 units Longest running in an epoch Epoch 10 10 + Zero book-keeping # 3 !me - Not accurate 16

ReconstrucEon at Target Frequency Base Frequency Target Frequency 2X T0 T0 T1 T1 7 5 # 1 Epoch 10 10 # 1 3 5 # 2 Epoch 10 5 5 # 2 # 3 = 15 units CriEcal thread across epochs Epoch 10 10 + Accurate # 3 !me - Book-keeping = 30 units 17

DEP: Summary Sync AcEvity • Sync Epochs Decompose • Perf Counters Epochs @ Tgt. Reconstruct Aggregate Predicted Total Time 18

Store Bursts • Reasons – Zero iniEalizaEon – Copying collectors • Modeling Steps – Track how long the store queue is full – Add to the non-scaling component 21

Methodology • Jikes RVM 3.1.2 • ProducEon collector (Immix) • # GC threads = 2 • 2x min. heap Version 6.0 • 4 cores, 1.0 GHz à 4.0 GHz • 3-level cache hierarchy • LLC fixed to 1.5 GHz • DVFS semngs for 22 nm Haswell • Seven mulEthreaded benchmarks • Four applicaEon threads 22

Accuracy M+CRIT M+CRIT+BURST DEP+BURST 30 % average absolute error 27% 20 13% 10 6% 0 2.0 GHz 3.0 GHz 4.0 GHz Baseline Frequency = 1.0 GHz 23

Energy Manager tolerable_performance_degradaEon New Freq1 New Freq2 4 GHz Quantum 5 ms 24

Energy Savings Performance DegradaEon Energy ReducEon 25 20 15 % 10 5 0 25 Memory Intensive Compute Intensive

Conclusions • DEP+BURST: First predictor that accounts for – ApplicaEon and service threads – SynchronizaEon à inter-thread dependencies – Store bursts • High accuracy – Less than 10% esEmaEon error for seven Java bmarks. • Negligible hardware cost – One extra performance counter – Minor book-keeping across epochs • Demonstrated energy savings – 20 % avg. for a 10% slowdown (mem-intensive Java apps.) 26

DVFS PERFORMANCE PREDICTION FOR MANAGED MULTITHREADED APPLICATIONS Thank You ! Shoaib.Akram@elis.UGent.be

DVFS PERFORMANCE PREDICTION FOR MANAGED MULTITHREADED APPLICATIONS - PowerPoint PPT Presentation

DVFS PERFORMANCE PREDICTION FOR MANAGED MULTITHREADED APPLICATIONS Shoaib Akram, Jennifer B. Sartor, Lieven Eeckhout Ghent University, Belgium Shoaib.Akram@elis.UGent.be DVFS Performance PredicEon performance many applicaEons here memory

ClkScrew Aaron Zhang Outline Introduction to DVFS and background information. What makes

SE350: Operating Systems Lecture 5: Multithreaded Kernels Outline Use cases for multithreaded

Introducing Sterling Managed Accounts Managed Accounts Like a managed fund (and fund of funds)

DVFS-Control Techniques for Dense Linear Algebra Operations on Multi-Core Processors Pedro Alonso

RPPM: Rapid Performance Prediction of Multithreaded Workloads on Multicore Processors Sander De

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

An Energy-aware Scheduling Algorithm in DVFS-enabled Networked Data Centers CLOSER 2016 - TEEC

In-network Monitoring and Control Policy for DVFS of CMP Networks- on-Chip and Last Level Caches

Managed Lanes in California: Where Weve Been Where We ve Been Where Were Going Joe Rouse

Managed Services Managed Services Managed Services Welcome to Kaseya.edu www.kaseya.com

RadixVM: Scalable address spaces for multithreaded applications Austin T. Clements M. Frans

Issues with Multithreaded Parallelism on Multicore Architectures Marc Moreno Maza University of

Testing of Multithreaded Programs Kari Khknen, Olli Saarikivi, Keijo Heljanko The Problem

Trace-driven Simulation of Multithreaded Applications Alejandro Rico, Alejandro Duran, Felipe

Analysis of Multithreaded Algorithms Marc Moreno Maza University of Western Ontario, London,

A TOOL FOR FREQUENCY-ANNOTATED CONTROL FLOW GRAPH GENERATION EuroLLVM 2017 Universit della

Zeroes When working with n-gram models, zero probabilities can be real show-stoppers

Single-blind testing of a regional, continuous monitoring system for finding methane leaks from

TPM-Fail TPM meets Timing and Lattice Attacks Daniel Moghimi Berk Sunar Thomas Eisenbarth

Orthogonality-Sabotaging Attacks against OFDMA-based Wireless Networks Shangqing Zhao, Zhuo Lu,

Optical clocks with trapped ions and search for temporal variations of fundamental constants E.

SI485i : NLP Set 4 Smoothing Language Models Fall 2013 : Chambers Review: evaluating n-gram

Continuous Profiling: (It's 10:43; Do You Know Where Your Cycles Are?) Jennifer Anderson Lance