Evaluating Benchmark Subsetting Approaches Joshua J. Yi 1 , Resit - - PowerPoint PPT Presentation

evaluating benchmark subsetting approaches
SMART_READER_LITE
LIVE PREVIEW

Evaluating Benchmark Subsetting Approaches Joshua J. Yi 1 , Resit - - PowerPoint PPT Presentation

Evaluating Benchmark Subsetting Approaches Joshua J. Yi 1 , Resit Sendag 2 , Lieven Eeckhout 3 , Ajay Joshi 4 , David J. Lilja 5 , Lizy K. John 4 1 Freescale Semiconductor 2 University of Rhode Island 3 Ghent University, Belgium 4 University of


slide-1
SLIDE 1

Evaluating Benchmark Subsetting Approaches

Joshua J. Yi1, Resit Sendag2, Lieven Eeckhout3, Ajay Joshi4, David J. Lilja5, Lizy K. John4

1Freescale Semiconductor 2University of Rhode Island 3Ghent University, Belgium 4University of Texas at Austin 5University of Minnesota

IISWC — Oct 26, 2006

slide-2
SLIDE 2
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Introduction

  • Architects often select specific benchmarks to:

– Reduce simulation time – Focus on specific characteristics (e.g., memory behavior) – Build a benchmark suite

  • Key challenge for selecting or subsetting benchmarks is:

– To select a representative subset

slide-3
SLIDE 3
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Benchmark Subsetting Approaches

  • Popular/emerging subsetting approaches include:

– By principal component analysis (PCA) – By performance bottlenecks (Plackett and Burman) – By percentage of floating-point instructions (integer vs. floating-point) – Compute-bound or memory-bound – By programming language – Randomly

  • But, which approach:

– Produces the most accurate subset for a given subset size?

  • Absolute accuracy vs. relative accuracy

– Produces the most accurate subset with the least profiling cost? – Most efficiently covers the space of benchmark characteristics?

slide-4
SLIDE 4
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Benchmark Subsetting Approach #1

  • By principal component analysis (PCA):

– Profile benchmarks to collect program characteristics

  • Instruction mix, amount of ILP, I/D footprints, data stream strides,

etc.

– Remove correlation between characteristics using Principal Component Analysis

  • Principal components are linear combinations of original

characteristics

  • For more information on PCA, see [Eeckhout et al., PACT 2002]

– Cluster the benchmarks based on their principal components into N clusters – Select one representative benchmark from each cluster to form the subset

slide-5
SLIDE 5
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Removing Correlation using PCA

– Remove correlation between program characteristics – Principal Components (PCs) are linear combination of original characteristics – Var(PC1) ≥ Var(PC2) ≥ ... – PC2 is less important to explain variation – Reduce No. of variables – Throw away PCs with negligible variance V a r i a b l e 2 P C 1 P C 2 Variable 1

..... 3 ..... 2 ..... 1

3 33 2 32 1 31 3 23 2 22 1 21 3 13 2 12 1 11

+ + + = + + + = + + + = x a x a x a PC x a x a x a PC x a x a x a PC

slide-6
SLIDE 6
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Clustering using k-means, Part 1

Cluster Analysis

Step 1: Randomly select K cluster centroids Step 2: Assign benchmarks to nearest cluster centroids K-Means Clustering algorithm

slide-7
SLIDE 7
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Clustering using k-means, Part 2

Cluster Analysis Step 3: Recalculate centroids and repeat Step 2 and 3 until algorithm converges Step 4: Choose representative programs that are closest to the centroid of the clusters K-Means Clustering algorithm

slide-8
SLIDE 8
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Benchmark Subsetting Approach #2

  • By performance bottlenecks (Plackett and Burman – P&B)

– Use P&B design to quantify the magnitudes of all performance bottlenecks (CPI) in the processor and memory subsystem

  • Rank microarchitecture parameters based on their impact on overall

performance

  • For more information on the P&B design, see [Yi et al., HPCA 2003]

– Cluster the benchmarks into N clusters based on:

  • Rank of magnitudes
  • Magnitudes
  • Percentage of CPI variation due to single bottlenecks
  • Percentage of CPI variation due to single bottlenecks and all interactions

– Bottlenecks can be determined

  • Per benchmark
  • Across all benchmarks

– Select one benchmark from each cluster to form the subset

slide-9
SLIDE 9
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Benchmark Subsetting Approaches #3 – #6

  • By percentage of floating-point instructions (integer vs.

floating-point)

– SPECint vs. SPECfp

  • Compute-bound vs. memory-bound

– Compute-bound vs. memory-bound

  • Compute-bound: less than 6% L1 D$ miss rate for a 32KB cache
  • By programming language

– C vs. FORTRAN

  • Randomly

Randomly choose benchmarks from each group Form 30 different subsets for each group and report average results

slide-10
SLIDE 10
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Benchmark Subsetting Approach #7

  • High-frequency

– The de facto approach by computer architects – Form subsets based on descending order of frequency-

  • f-use [Citron 2003, ISCA 2003 panel]
  • Choose most frequently used benchmark when subset size is 1
  • Choose two most frequently used benchmarks when subset size

is 2

  • etc.
slide-11
SLIDE 11
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Methodology and Experimental Setup

  • PCA profiling: ATOM
  • Simulator:

– SMARTS simulation framework (based on SimpleScalar)

  • U=1000 instructions, W=2000 instructions
  • 99.7% confidence interval, ±3% confidence level

– P&B profiling: Added user-configurable latencies and throughputs

  • Benchmark information

– All SPEC CPU 2000 benchmarks and input sets

  • Except vpr-place and perlbmk-perfect crash SMARTS

– Benchmark-input pair used synonymously with benchmark

  • Processor configurations:

– 4 4-way issue configurations, 4 8-way configurations – For each issue width, configurations represent range of configurations

slide-12
SLIDE 12
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Quantifying Representativeness

  • Absolute accuracy

– Important when extrapolating results of subset for performance prediction of entire suite – Error in estimated CPI or EDP when using subset vs. full suite

  • Relative accuracy

– Important when comparing alternative designs during early design space exploration studies – Error in estimated speedup when using subset vs. full suite

  • Coverage of the workload space

– Important when selecting a subset of programs when designing a benchmark suite – Minimum Euclidean distance of the benchmark’s characteristics of each subset away all individual benchmarks

slide-13
SLIDE 13
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Absolute CPI Accuracy, Part 1

10 20 30 40 50 60 70 80 90 100 110 120 130 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

Number of Benchmarks in Each Subset Percentage CPI Error

PCA (7PCs) PB (Interaction across, 05D) Random Frequency (All input sets)

slide-14
SLIDE 14
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

10 20 30 40 50 60 70 80 90 100 110 120 130 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

Number of Benchmarks in Each Subset Percentage CPI Error

Integer Floating-Point Core Memory C FORTRAN

C

Absolute CPI Accuracy, Part 2

slide-15
SLIDE 15
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Absolute EDP Accuracy, Part 1

25 50 75 100 125 150 175 200 225 250 275 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

Number of Benchmarks in Each Subset Percentage EDP Error

PCA (5PCs) PB (Interaction across, 05D) Random Frequency (All input sets)

slide-16
SLIDE 16
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Absolute EDP Accuracy, Part 2

25 50 75 100 125 150 175 200 225 250 275 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

Number of Benchmarks in Each Subset Percentage EDP Error

Integer Floating-Point Core Memory C FORTRAN

slide-17
SLIDE 17
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Key Conclusions for Absolute Accuracy

  • Most accurate approaches:

– PCA with 7 principal components – P&B using Top 5 bottlenecks – If want < 5% CPI error, need at least 17 benchmark-input pairs (1/3 of the entire suite)

  • Int vs. float, compute vs. memory, language, and random approaches

have poor and inconsistent CPI/EDP

– Results based on these approaches may be misleading

  • High-frequency approach

– Overly optimistic DL1 and L2 cache hit rates – Some subsets may be pessimistic about branch prediction accuracy

  • Statistical approaches are the most reliable way to subset benchmarks
slide-18
SLIDE 18
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Computing Relative Accuracy

  • Compute the average speedup across entire benchmark

suite for the following enhancements:

– 4X larger ROB and LSQ – Next-line prefetching with prefetch buffers – 4X larger DL1 and L2 caches, 8-way associativity, same hit latency

  • Compute the average speedup across benchmarks in each

subset

  • Compute speedup error when using a subset and when

using the entire suite

– Relative error = (Speedupw/SS – Speedupwo/SS) / Speedupwo/SS * 100

slide-19
SLIDE 19
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Relative CPI Accuracy (ROB), Part 1

5 10 15 20 25 30 35 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

Number of Benchmarks in Each Subset Percentage Speedup Error

PCA (5PCs) PB (No interaction per, 03D) Random Frequency (All input sets)

slide-20
SLIDE 20
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Relative CPI Accuracy (ROB), Part 2

5 10 15 20 25 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

Number of Benchmarks in Each Subset Percentage Speedup Error

Integer Floating-Point Core Memory C FORTRAN

slide-21
SLIDE 21
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Key Conclusions for Relative Accuracy

  • Conclusions are similar to those for absolute accuracy

– PCA and P&B are more accurate – Other 5 approaches are not accurate

  • Accuracy generally improves with larger subset sizes
  • Similar results across all processor configurations
  • Key difference: Relative error is lower than absolute error

– Relative error is typically < 20% for most approaches/subset sizes – Absolute error is typically > 20% for most approaches/subset sizes – Less variation in CPI across configurations (i.e., for relative accuracy) than across benchmarks (i.e., for absolute accuracy)

  • Matched-pairs comparison
slide-22
SLIDE 22
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Computing Coverage

  • Vectorize the performance and power metrics for each benchmark

– Performance metrics: IPC; branch prediction accuracy; L1 D-cache, L1 I- cache, L2 cache hit rates, D-TLB, and I-TLB hit rates – Power metrics: Power for rename logic, branch predictor, reorder buffer, load-store queue, register file, L1 D-cache, L1 I-cache, L2 cache, functional units, result bus, and clock network

  • Normalize each metric and scale to 100

– Normalize performance metrics to maximum possible value

  • Maximum IPC = Issue width

– Normalize power metrics to their percentage of the total power consumption

  • Compute the Euclidean distance between each benchmark NOT in the

subset to each benchmark IN the subset

  • For each benchmark NOT in the subset, assign the minimum Euclidean

distance as its distance

  • Sum the Euclidean distances for all benchmarks NOT in the subset and

assign that number as the total minimum Euclidean distance for that subset size

slide-23
SLIDE 23
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Coverage Intuition

  • Total Euclidean distance represents how well the

benchmarks in the subset are spread throughout the entire space of benchmarks

  • A smaller total Euclidean distance means that benchmarks

that are not in the subset are very close to a benchmark in the subset

– Benchmarks not in the subset are accurately represented by a benchmark in the subset – Or, from the viewpoint of coverage, the benchmarks in the subset effectively cover the benchmark suite

slide-24
SLIDE 24
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Performance Coverage, Part 1

100 200 300 400 500 600 700 800 900 1000 1100 1200 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

Number of Benchmarks in Each Subset Euclidean distance of performance metrics

PCA (5PCs) PB (Interaction across, 05D) Random Frequency (All input sets)

slide-25
SLIDE 25
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Performance Coverage, Part 2

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

Number of Benchmarks in Each Subset Euclidean distance of performance metrics

Integer Floating-Point Core Memory C FORTRAN

slide-26
SLIDE 26
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Key Conclusions for Coverage

  • Conclusions are similar to those for absolute accuracy

– PCA and P&B have good coverage – Other 5 approaches do not have good coverage – Same conclusions for absolute/relative accuracy and coverage

  • Coverage generally improves with larger subset sizes
  • Similar results across all processor configurations
  • Smaller Euclidean distances for power metrics:

– Smaller maximum values for power metrics – Less variability in power results

slide-27
SLIDE 27
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Accuracy vs. Profiling Cost

  • Subsetting approaches have different accuracies, but what

is their profiling cost?

– PCA

  • Specialized functional simulator or instrumentation
  • Single run or many runs
  • Requires a couple of months to gather profiling data

– P&B

  • Requires performance simulator
  • 88 very different processor configurations (i.e., some very slow)
  • Requires several months to gather profiling data

– No profiling cost for the other 5 approaches

  • Based on accuracy and profiling cost, we recommend using

PCA to subset benchmark suites

slide-28
SLIDE 28
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Conclusions

  • Computer architects frequently use subsetting…

… but accuracy of subsetting approaches is unknown

– Absolute accuracy – Relative accuracy – Coverage

  • PCA and P&B design

– Have the best absolute and relative CPI/EDP accuracy – Error less than 5% for 20+ benchmark-input pairs – Most efficiently cover the space of performance and power characteristics

  • Other 5 approaches have poor accuracy and coverage
  • PCA has the highest accuracy at the lowest profiling cost
slide-29
SLIDE 29
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Thank you

Evaluating Benchmark Subsetting Approaches

Joshua J. Yi, Resit Sendag, Lieven Eeckhout, Ajay Joshi, David J. Lilja, Lizy K. John

slide-30
SLIDE 30
  • J. Yi et al.

Freescale, Rhode Island, Ghent, Texas, Minnesota

Acknowledgements

  • This research has been supported by the following:

– National Science Foundation (CCF-0541162, 0429806) – University of Minnesota Digital Technology Design Center – University of Minnesota Supercomputing Institute – European HiPEAC Network of Excellence – Fund for Scientific Research – Flanders (Belgium) (F.W.O Vlaanderen) – European SCALA project No. 27648. – IBM Centers for Advanced Studies – Intel