Perf rform rmance ance Inter terfer ference ence on Multico - - PowerPoint PPT Presentation

perf rform rmance ance inter terfer ference ence on
SMART_READER_LITE
LIVE PREVIEW

Perf rform rmance ance Inter terfer ference ence on Multico - - PowerPoint PPT Presentation

An Empir irical ical Model el for Predicting dicting Cross ss-Cor Core Perf rform rmance ance Inter terfer ference ence on Multico ticore Processor essors Jiacheng Zhao Institute of Computing Technology, CAS In Conjunction


slide-1
SLIDE 1

An Empir irical ical Model el for Predicting dicting Cross ss-Cor Core Perf rform rmance ance Inter terfer ference ence

  • n

Multico ticore Processor essors

Jiacheng Zhao Institute of Computing Technology, CAS

In Conjunction with Prof. Jingling Xue, UNSW, Australia Sep 11, 2013

slide-2
SLIDE 2

Problem – Resource Utilization in Datacenters

How?

2013/9/11

ASPLOS’09 by Dav avid id Meisne ner+

slide-3
SLIDE 3

Problem – Resource Utilization in Datacenters

Applications

Core Core L1 L1

Co-Runners

Core Core L1 L1

Shar ared ed Cach ache

Memory Contr troll

  • ller

er

2013/9/11

  • Co-located applications
  • Contention for shared cache, shared IMC, etc.
  • Negative and unpredictable interference
  • Two types of applications
  • Batch – No QoS guarantees
  • Latency Sensitive - Attain high QoS
  • Co-location is disabled
  • Low server utilization
  • Lacking the knowledge of interference
slide-4
SLIDE 4

Problem – Resource Utilization in Datacenters

  • Co-located applications
  • Contention for shared cache, shared IMC, etc.
  • Negative and unpredictable interference
  • Two types of applications
  • Batch – No QoS guarantees
  • Latency Sensitive - Attain high QoS
  • Co-location is disabled
  • Low server utilization
  • Lacking the knowledge of interference

2013/9/11

slide-5
SLIDE 5

Problem – Resource Utilization in Datacenters

  • Co-located applications
  • Contention for shared cache, shared IMC, etc.
  • Negative and unpredictable interference
  • Two types of applications
  • Batch – No QoS guarantees
  • Latency Sensitive - Attain high QoS
  • Co-location is disabled
  • Low server utilization
  • Lacking the knowledge of interference

2013/9/11

Figure: Task placement in datacenters

[Micro’11 by Jason Mars+]

slide-6
SLIDE 6

Our Goals: Predicting the interference

  • Quantitatively predict the cross-core performance interference
  • Applicable for arbitrarily co-locations
  • Identify any “safe” co-locations
  • Deployable for datacenters

2013/9/11

slide-7
SLIDE 7

Our Intuition – Mining a model from large training data

2013/9/11

 Using machine learning approaches Training Set

slide-8
SLIDE 8

Motivation example

2013/9/11

𝑄𝐸𝑛𝑑𝑔 = 0.485𝑄𝑐𝑥 + 0.183𝑄𝑑𝑏𝑑ℎ𝑓 − 0.138, 𝑗𝑔 𝑄𝑐𝑥 < 3.2 0.706𝑄𝑐𝑥 + 1.725𝑄𝑑𝑏𝑑ℎ𝑓 − 0.220, 𝑗𝑔 3.2 ≤ 𝑄𝑐𝑥 ≤ 9.6 0.907𝑄𝑐𝑥 + 3.087𝑄𝑑𝑏𝑑ℎ𝑓 − 0.561, 𝑗𝑔 𝑄𝑐𝑥 > 9.6

slide-9
SLIDE 9

Outline

  • Introduction
  • Our Key Observations
  • Our Approach – Two-Phase Approach
  • Experimental Results
  • Conclusion

2013/9/11

slide-10
SLIDE 10

Our Key Observations

  • Observation 1: The function depends only on the pressure on shared

resources, regardless of individual pressures from one co-runner. For an application A, PDA = f(Pcache, Pbw) (Pcache, Pbw) = g(A1,A2,…,Am)

2013/9/11

slide-11
SLIDE 11

Our Key Observations

  • Observation 2:
  • The function f is piecewise.

2013/9/11

slide-12
SLIDE 12

Our Key Observations

  • Naively, we can create A’s prediction model using brute-force approach
  • BUT, we can NOT apply brute force approach for each application!
  • Thousands of applications in one datacenter
  • Frequent software updates
  • Different generations of processors
  • Even steps for one application is expensive
  • Observation 3:
  • The function form is platform-dependent and application independent
  • Only the coefficients are application-dependent

2013/9/11

slide-13
SLIDE 13

Outline

  • Introduction
  • Our Key Observations
  • Our Approach - Two-Phase Approach
  • Experimental Results
  • Conclusion

2013/9/11

slide-14
SLIDE 14

Our Approach - Two-Phase Approach

2013/9/11

Phas ase 1: Get the ab abstr tract ct mode del

  • Find

a function form best suitable for all applications

  • n

a given platform

Phas ase 2: Instantia tantiate te the ab abstr tract ct model

  • Determine

the application-specific coefficients (a11, etc.) Training Applications Co-running Trainer

  • Heavy

– many training workloads

  • Run
  • nce

for

  • ne

platform 𝑄D = 𝑏11𝑄𝑐𝑥 + 𝑏12𝑄𝑑𝑏𝑑ℎ𝑓 + 𝑏13, 𝑡𝑣𝑐𝑒𝑝𝑛𝑏𝑗𝑜1 𝑏21𝑄𝑐𝑥 + 𝑏22𝑄𝑑𝑏𝑑ℎ𝑓 + 𝑏23, 𝑡𝑣𝑐𝑒𝑝𝑛𝑏𝑗𝑜2 𝑏31𝑄𝑐𝑥 + 𝑏32𝑄𝑑𝑏𝑑ℎ𝑓 + 𝑏33, 𝑡𝑣𝑐𝑒𝑝𝑛𝑏𝑗𝑜3 One Application Co-running Trainer

  • Light-weighted,

with a small number

  • f

trainings

  • Run
  • nce

for

  • ne

application 𝑄𝐸𝑛𝑑𝑔 = 0.49𝑄𝑐𝑥 + 0.18𝑄𝑑𝑏𝑑ℎ𝑓 − 0.13, 𝑄𝑐𝑥 < 3.2 0.71𝑄𝑐𝑥 + 1.73𝑄𝑑𝑏𝑑ℎ𝑓 − 0.22, 𝑝𝑢ℎ𝑓𝑠𝑡 0.91𝑄𝑐𝑥 + 3.09𝑄𝑑𝑏𝑑ℎ𝑓 − 0.56, 𝑄𝑐𝑥 > 9.6

slide-15
SLIDE 15

Our Approach - Two-Phase Approach

2013/9/11

slide-16
SLIDE 16

Our Approach - Two-Phase Approach

2013/9/11

Q1: What are selected as application features Q2: How? Q3: What’s the cost

  • f

the training?

slide-17
SLIDE 17

Our Approach – Some Key Points

2013/9/11

  • Q1:

What are selected as application features?

  • Runtime

profiles

  • Shared

cache consumption

  • Bandwidth

consumption

slide-18
SLIDE 18

Our Approach – Some Key Points

2013/9/11

  • Q2:

How to create the abstract model?

  • Regression

analysis

  • Configurable
  • Each

configuration binding to a function form

  • Searching

for the best function form for all applications in the training set

slide-19
SLIDE 19

Our Approach – Some Key Points

2013/9/11

  • Q3:

What’s the cost

  • f

the training when instantiation

  • Cover

all sub-domains

  • f

the piecewise function, say S

  • Constant

points for each sub-domain, say C

  • The

constant depends

  • n

the form

  • f

abstraction model

  • C*S

training runs in total

  • Usually

C and S are small,

  • ur

experience: C=4, S=3

slide-20
SLIDE 20

Outline

  • Introduction
  • Our Key Observations
  • Our Approach - Two-Phase Approach
  • Experimental Results
  • Conclusion

2013/9/11

slide-21
SLIDE 21

Experimental Results

2013/9/11

  • Accuracy
  • f
  • ur

two-phase regression approach

  • Prediction

precision

  • Error

analysis

  • Deployment

in a datacenter

  • Utilization

gained

  • QoS

enforced and violated

slide-22
SLIDE 22

Experimental Results

2013/9/11

  • Benchmarks:
  • SPEC2006
  • Nine

real-world datacenter applications

  • Nlp-mt,
  • penssl,
  • penclas,

MR-iindex, etc.

  • Platforms:
  • Intel

quad-core Xeon E5506 (main)

  • Datacenter:
  • 300

quad-core Xeon E5506

slide-23
SLIDE 23

Some Predictor Function

2013/9/11

slide-24
SLIDE 24

Prediction precision for SPEC Benchmarks

2013/9/11

  • Prediction

Error: Average 0. 0.2% 2%, from 0.0% to 8.6%.

slide-25
SLIDE 25

Prediction precision for datacenter applications

  • 15 workloads for each datacenter applications

2013/9/11

  • Prediction

Error: Average 0.3%, from 0.0% to 5%.

slide-26
SLIDE 26

Error Distribution

2013/9/11

  • 4.00%
  • 3.00%
  • 2.00%
  • 1.00%

0.00% 1.00% 2.00% 3.00% 4.00%

Error Distribution

slide-27
SLIDE 27

Prediction Efficiency

  • Precision
  • Two-Phase:

0.0~11.7%, Average: 0.40%

  • Brute-Force

0.0~10.1%, Average: 0.23%

  • Efficiency
  • co-running: ~200  12

2013/9/11

0% 10% 20% 30% 40% 50% 60% 70% 80% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Performance Degradation Workload ID Real Two-Phase Brute-Force

slide-28
SLIDE 28

Benefits of piecewise predictor functions

2013/9/11

slide-29
SLIDE 29

Benefits of piecewise predictor functions

2013/9/11

slide-30
SLIDE 30

Deployment in a datacenter

2013/9/11

  • 300 quad-core Xeon
  • 1200 tasks when fully occupied
  • Applications
  • Latency sensitive: Nlp-mt
  • machine translation
  • 600 dedicated cores, 2/chip
  • Batch job
  • 600 tasks, kmeans, MR
  • Our Purpose
  • QoS policy
  • Issue batch jobs to idle cores
slide-31
SLIDE 31
  • Cross-platform applicability
  • Six-core Intel Xeon

2013/9/11 0% 20% 40% 60% 80% 1 6 11 16 21 26 AVG Performance Degradation Workload ID Real Predicted

  • Prediction

Error: Average 0.1%, range from 0.0% to 10.2%

slide-32
SLIDE 32
  • Cross-platform applicability
  • Quad-core AMD

2013/9/11

  • Prediction

Error: Average 0.3%, range from 0.0% to 5.1%

0% 10% 20% 30% 40% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 AVG Performance Degradation Workload ID Real Predicted

slide-33
SLIDE 33

Outline

  • Introduction
  • Our Key Observations
  • Our Approach - Two-Phase Approach
  • Experimental Results
  • Conclusion

2013/9/11

slide-34
SLIDE 34

Conclusion

  • An empirical model, based on our key observations
  • Using aggregated resource consumptions to create the predictor function, thus

working for arbitrarily co-locations

  • Piecewise is reasonable and effective
  • Breaking the model creation into two phases, for efficiency

2013/9/11

slide-35
SLIDE 35

2013/9/11

slide-36
SLIDE 36

Backup slides

2013/9/11

  • How

to make the training set representative?

  • Partition

the space into grids

  • Sample

for each grid

slide-37
SLIDE 37

Backup slides

2013/9/11

  • How

to do domain partitioning?

  • Specified

in configuration file

  • Syntax:

(shared resourcei, conditioni), e.g. (Pbw, equal(4))

  • Empirical

knowledge to perform this task

#Aggregation #Pre-Processing: none, exp(2), log(2), pow(2) #mode: add, mul #Domain Partitioning: {((Pbw), equal(4)), ((Pcache), equal(4)), ((Pcache, Pbw), equal(4, 4))}, #Function: linear, polynomial(2)

slide-38
SLIDE 38

Backup slides

  • Two sources of error:
  • Estimation for shared resources

consumption

  • L2 LinesIn
  • Phase behavior of applications

2013/9/11