DVFS PERFORMANCE PREDICTION FOR MANAGED MULTITHREADED APPLICATIONS - - PowerPoint PPT Presentation

dvfs performance prediction for managed multithreaded
SMART_READER_LITE
LIVE PREVIEW

DVFS PERFORMANCE PREDICTION FOR MANAGED MULTITHREADED APPLICATIONS - - PowerPoint PPT Presentation

DVFS PERFORMANCE PREDICTION FOR MANAGED MULTITHREADED APPLICATIONS Shoaib Akram, Jennifer B. Sartor, Lieven Eeckhout Ghent University, Belgium Shoaib.Akram@elis.UGent.be DVFS Performance PredicEon performance many applicaEons here memory


slide-1
SLIDE 1

DVFS PERFORMANCE PREDICTION FOR MANAGED MULTITHREADED APPLICATIONS

Shoaib Akram, Jennifer B. Sartor, Lieven Eeckhout Ghent University, Belgium Shoaib.Akram@elis.UGent.be

slide-2
SLIDE 2

DVFS Performance PredicEon

2

Sample at all DVFS states L Es/mate performance J

frequency à performance à memory bound many applicaEons here

slide-3
SLIDE 3

3

Managed MulEthreaded ApplicaEons

slide-4
SLIDE 4

4

Background

Base Frequency Target Frequency

CPU DRAM

Eme à

  • tbase sum of

– Scaling (S) – Non-Scaling (NS)

  • r = Base/Target
  • S à S * r
  • NS à No change
  • ttarget = (S*r) + NS
  • Not simple
  • OOO+MLP

tbase

slide-5
SLIDE 5
  • CRIT esEmates non-scaling by

– Measuring criEcal path through loads – Ignoring store operaEons

5

  • R. Mi&akhutdinov, E. Ebrahimi, and Y. N. Pa8. Predic;ng

performance impact of DVFS for realis;c memory

  • systems. MICRO, 2012.

State of the Art

slide-6
SLIDE 6

6

High error for mulEthreaded Java!

MulEthreaded CRIT (M+CRIT)

Base Frequency Target Frequency

Eme à

T0 T1

Eme à

T0 T1

ttarget tbase

Use CRIT to idenEfy each thread’s non-scaling

2X

1 0.5 1

criEcal

slide-7
SLIDE 7

7

app0 ApplicaEon CollecEon busy wait store burst

Scaling or non-scaling?

Sources of Inaccuracy in M+CRIT

ApplicaEon app1 gc0 gc1

slide-8
SLIDE 8

8

app0 ApplicaEon CollecEon busy wait store burst

Scaling or non-scaling?

Sources of Inaccuracy in M+CRIT

ApplicaEon app1 gc0 gc1

BURST

DEP DEP DEP DEP DEP

slide-9
SLIDE 9

9

app0 ApplicaEon CollecEon busy wait store burst

Scaling or non-scaling?

Our ContribuEon

ApplicaEon app1 gc0 gc1

BURST

DEP DEP DEP DEP DEP

DEP+BURST

A New DVFS Performance Predictor

slide-10
SLIDE 10

10

Our ContribuEon

DEP+BURST

A New DVFS Performance Predictor

slide-11
SLIDE 11

11

while (cond0) { … } Acquire(lock) crit_sec() … Release(lock) ... while (cond1) { … } Acquire(lock) crit_sec() … Release(lock) ...

T0 T1

Example: Inter-thread Dependences

2 1

3

  • Intercept synchronizaEon acEvity
  • Reconstruct execuEon at target frequency

wait --- wake 4

slide-12
SLIDE 12

12

T0 T1

loop wait

IdenEfying SynchronizaEon Epochs

crit_sec() loop crit_sec()

Base Frequency Target Frequency

!me wait() wake() Epoch

# 1

Epoch # 2 Epoch # 3

slide-13
SLIDE 13

13

T0 T1

IdenEfying SynchronizaEon Epochs

Base Frequency Target Frequency

!me Epoch

# 1

Epoch # 2 Epoch # 3

slide-14
SLIDE 14

14

T0 T1

IdenEfying SynchronizaEon Epochs

Base Frequency Target Frequency

!me Epoch

# 1

Epoch # 2 Epoch # 3

10 10 10 10 10 = 30 units

slide-15
SLIDE 15

15

T0 T1

ReconstrucEon at Target Frequency

Base Frequency Target Frequency

!me Epoch

# 1

Epoch # 2 Epoch # 3

2X

10 10 10 10 10 T0 T1 5 7 5 5 CRIT 5

# 1

# 2 # 3

slide-16
SLIDE 16

16

T0 T1

ReconstrucEon at Target Frequency

Base Frequency Target Frequency

!me Epoch

# 1

Epoch # 2 Epoch # 3

2X

10 10 10 10 10 T0 T1 5 7 5 3 5

# 1

# 2 # 3

Longest running in an epoch + Zero book-keeping

  • Not accurate

= 17 units 5

slide-17
SLIDE 17

17

T0 T1

ReconstrucEon at Target Frequency

Base Frequency Target Frequency

!me Epoch

# 1

Epoch # 2 Epoch # 3

2X

10 10 10 10 10 T0 T1 5 7 5 5 5

# 1

# 2 # 3

CriEcal thread across epochs + Accurate

  • Book-keeping

= 15 units = 30 units 3

slide-18
SLIDE 18

18

Decompose Reconstruct Aggregate

Sync AcEvity

  • Sync Epochs
  • Perf Counters

Epochs @ Tgt. Predicted Total Time

DEP: Summary

slide-19
SLIDE 19

19

Our ContribuEon

DEP+BURST

A New DVFS Performance Predictor

slide-20
SLIDE 20

20

Our ContribuEon

DEP+BURST

A New DVFS Performance Predictor

slide-21
SLIDE 21
  • Reasons

– Zero iniEalizaEon – Copying collectors

  • Modeling Steps

– Track how long the store queue is full – Add to the non-scaling component

21

Store Bursts

slide-22
SLIDE 22

22

  • Jikes RVM 3.1.2
  • ProducEon collector (Immix)
  • # GC threads = 2
  • 2x min. heap
  • Seven mulEthreaded benchmarks
  • Four applicaEon threads
  • 4 cores, 1.0 GHz à 4.0 GHz
  • 3-level cache hierarchy
  • LLC fixed to 1.5 GHz
  • DVFS semngs for 22 nm Haswell

Version 6.0

Methodology

slide-23
SLIDE 23

23

Baseline Frequency = 1.0 GHz 10 20 30 2.0 GHz 3.0 GHz 4.0 GHz % average absolute error M+CRIT M+CRIT+BURST DEP+BURST

27% 13% 6%

Accuracy

slide-24
SLIDE 24

24

Quantum 5 ms

4 GHz New Freq1 tolerable_performance_degradaEon New Freq2

Energy Manager

slide-25
SLIDE 25

25

5 10 15 20 25 % Performance DegradaEon Energy ReducEon

Memory Intensive Compute Intensive

Energy Savings

slide-26
SLIDE 26
  • DEP+BURST: First predictor that accounts for

– ApplicaEon and service threads – SynchronizaEon à inter-thread dependencies – Store bursts

  • High accuracy

– Less than 10% esEmaEon error for seven Java bmarks.

  • Negligible hardware cost

– One extra performance counter – Minor book-keeping across epochs

  • Demonstrated energy savings

– 20 % avg. for a 10% slowdown (mem-intensive Java apps.)

26

Conclusions

slide-27
SLIDE 27

Thank You !

Shoaib.Akram@elis.UGent.be DVFS PERFORMANCE PREDICTION FOR MANAGED MULTITHREADED APPLICATIONS