Benchmark Design for Robust Profile-Directed Optimization SPEC - - PowerPoint PPT Presentation

benchmark design for robust profile directed optimization
SMART_READER_LITE
LIVE PREVIEW

Benchmark Design for Robust Profile-Directed Optimization SPEC - - PowerPoint PPT Presentation

Benchmark Design for Robust Profile-Directed Optimization SPEC Workshop 2007 Paul Berube and Jos Nelson Amaral University of Alberta NSERC Alberta Ingenuity iCore January 21, 2007 Paul Berube 1 In this talk SPEC: SPEC


slide-1
SLIDE 1 January 21, 2007 Paul Berube 1

Benchmark Design for Robust Profile-Directed Optimization

SPEC Workshop 2007 Paul Berube and José Nelson Amaral University of Alberta

NSERC Alberta Ingenuity iCore
slide-2
SLIDE 2 January 21, 2007 Paul Berube 2

In this talk

  • SPEC:

SPEC CPU

  • PDF:

Offline, profile-guided optimization

  • Test:

Evaluate

  • Data/Inputs:

Program input data

slide-3
SLIDE 3 January 21, 2007 Paul Berube 3

PDF in Research

  • SPEC benchmarks and inputs used, but rules

seldom followed exactly

– PDF will continue regardless of admissibility in reported results

  • Some degree of profiling is taken as a given

in many recent compiler and architecture works

slide-4
SLIDE 4 January 21, 2007 Paul Berube 4

An Opportunity to Improve

  • No PDF for base in CPU2006

– An opportunity to step back and consider

  • Current evaluation methodology for PDF is

not rigorous

– Dictated by inputs/rules provided in SPEC CPU – Usually followed when reporting PDF research

slide-5
SLIDE 5 January 21, 2007 Paul Berube 5

peak_static

Current Methodology

Test

  • ptimizing

compiler

input.ref

Static optimization Flag Tuning

slide-6
SLIDE 6 January 21, 2007 Paul Berube 6

peak_pdf

Current Methodology

Train Test

input.train

PDF

  • ptimizing

compiler

input.ref

PDF optimization

Instrumenting compiler

Flag Tuning Profile

slide-7
SLIDE 7 January 21, 2007 Paul Berube 7

Current Methodology

Train Test

input.train

PDF

  • ptimizing

compiler

input.ref

PDF optimization

Instrumenting compiler

Flag Tuning Profile

if(peak_pdf > peak_static) peak := peak_pdf;

slide-8
SLIDE 8 January 21, 2007 Paul Berube 8

Current Methodology

Train Test

input.train

PDF

  • ptimizing

compiler

input.ref

PDF optimization

Instrumenting compiler

Flag Tuning Profile

if(peak_pdf > peak_static) peak := peak_pdf; else peak := peak_static;

slide-9
SLIDE 9 January 21, 2007 Paul Berube 9

if(peak_pdf > peak_static) peak := peak_pdf; else peak := peak_static;

Current Methodology

Train Test

input.train

PDF

  • ptimizing

compiler

input.ref

PDF optimization

Instrumenting compiler

Flag Tuning Profile

(peak_pdf > peak_static) (peak_pdf > other_pdf)

Does 1 training and 1 test input predict PDF performance? Is this comparison sound?

slide-10
SLIDE 10 January 21, 2007 Paul Berube 10

if(peak_pdf > peak_static) peak := peak_pdf; else peak := peak_static;

Current Methodology

Train Test

input.train

PDF

  • ptimizing

compiler

input.ref

PDF optimization

Instrumenting compiler

Flag Tuning Profile

(peak_pdf > peak_static) (peak_pdf > other_pdf)

Does 1 training and 1 test input predict PDF performance? Is this comparison sound? Variance between inputs can be larger than reported improvements!

slide-11
SLIDE 11 January 21, 2007 Paul Berube 11

bzip2 – Train on xml

12 10 8 6 4 2

  • 2
  • 4
  • 6

> 14% combined compressed docs gap graphic jpeg xml log mp3 mpeg program random reuters pdf source

  • vs. Static
slide-12
SLIDE 12 January 21, 2007 Paul Berube 12

PDF is like Machine Learning

  • Complex parameter space
  • Limited observed data (training)
  • Adjust parameters to match observed data

– maximize expected performance

slide-13
SLIDE 13 January 21, 2007 Paul Berube 13

Evaluation of Learning Systems

  • Must take sensitivity to training and

evaluation inputs into account

– PDF specializes code according to training data – Changing inputs can greatly alter performance

  • Performance results must have statistical

significance measures

– Differentiate between gains/losses and noise

slide-14
SLIDE 14 January 21, 2007 Paul Berube 14

Overfitting

  • Specializing for the training data too closely
  • Exploiting particular properties of the

training data that do not generalize

  • Causes:

– insufficient quantity of training data – insufficient variation among training data – deficient learning system

slide-15
SLIDE 15 January 21, 2007 Paul Berube 15

Overfitting

  • Currently:

✗Engineer the compiler to not overfit the single

training data (underfitting)

✗No clear rules for input selection ✗Some benchmark authors replicate data between

train and ref

  • Overfitting can be rewarded!
slide-16
SLIDE 16 January 21, 2007 Paul Berube 16

Criteria for Evaluation

  • Predict expected future performance
  • Measure performance variance
  • Do not reward overfitting
  • Same evaluation criteria as ML

– Cross-validation addresses these criteria

slide-17
SLIDE 17 January 21, 2007 Paul Berube 17

Cross-Validation

  • Split a collection of inputs into two or more

non-overlapping sets

  • Train on one set, test on the other set(s)
  • Repeat, using a different set for training

Train Test

slide-18
SLIDE 18 January 21, 2007 Paul Berube 18

Leave-one-out Cross-Validation

  • If little data, reduce test set to 1 input

– Leave N out: only N inputs in test

Train Test

slide-19
SLIDE 19 January 21, 2007 Paul Berube 19

Cross-Validation

  • The same data is NEVER in both the training

and the testing set

– Overfitting will not enhance performance

  • Multiple evaluations allows statistical

measure to be calculated on the results

– Standard deviation, confidence intervals...

  • Set of training inputs allows system to

exploit commonalities between inputs

slide-20
SLIDE 20 January 21, 2007 Paul Berube 20

Proposed Methodology

  • PDFPeak score, distinct from peak

– Report with standard deviation

  • Provide a PDF workload

– Inputs used for both training and evaluation, so “medium” sized (~2 min running time) – 9 inputs needed for meaningful statistical measures

slide-21
SLIDE 21 January 21, 2007 Paul Berube 21

Proposed Methodology

  • Split inputs into 3 sets (at design time)
  • For each input in each evaluation, calculate

speedup compared to (non-PDF) peak

  • Calculate (over all evaluations)

– mean speedup – standard deviation of speedups

slide-22
SLIDE 22 January 21, 2007 Paul Berube 22

Example

jpeg mpeg xml html text doc pdf source program

PDF Workload (9 inputs):

slide-23
SLIDE 23 January 21, 2007 Paul Berube 23

Example – Split workload

jpeg xml pdf mpeg html source text doc program

A B C

jpeg mpeg xml html text doc pdf source program

PDF Workload (9 inputs):

slide-24
SLIDE 24 January 21, 2007 Paul Berube 24

Example – Train and Run

A

Train

Instrumenting compiler
slide-25
SLIDE 25 January 21, 2007 Paul Berube 25

Example – Train and Run

A

Train PDF

  • ptimizing

compiler

Instrumenting compiler

Profile(A)

slide-26
SLIDE 26 January 21, 2007 Paul Berube 26

Example – Train and Run

A B+C

mpeg 1% html 5% text 4% doc -3% source 4% program 2%

Train Test PDF

  • ptimizing

compiler

Instrumenting compiler

Profile(A)

slide-27
SLIDE 27 January 21, 2007 Paul Berube 27

Mpeg 2% html 5% text 3% doc -7% source 1% program 1%

Example – Train and Run

B A+C

jpeg 4% xml

  • 1%

text 5% doc 1% pdf 4% program 1%

Train Test PDF

  • ptimizing

compiler

Instrumenting compiler

Profile(B)

slide-28
SLIDE 28 January 21, 2007 Paul Berube 28

Mpeg 2% html 5% text 3% doc -7% source 1% program 1%

Example – Train and Run

A+B C

jpeg 2% xml

  • 3%

text 2% doc 2% pdf 3% program-1% jpeg 5% xml 2% mpeg -1% html 3% pdf 3% source 3%

Train Test PDF

  • ptimizing

compiler

Instrumenting compiler

Profile(C)

slide-29
SLIDE 29 January 21, 2007 Paul Berube 29

doc 1% doc -3% html 3% html 5% jpeg 5% jpeg 4% mpeg -1% mpeg 1% pdf 3% pdf 4% program 1% program 2% source 3% source 4% text 5% text 4% xml

  • 1%

xml 2%

Example – Evaluate

Average: 2.33

slide-30
SLIDE 30 January 21, 2007 Paul Berube 30

Example – Evaluate

Average: 2.33

  • Std. Dev: 2.30

doc 1% doc -3% html 3% html 5% jpeg 5% jpeg 4% mpeg -1% mpeg 1% pdf 3% pdf 4% program 1% program 2% source 3% source 4% text 5% text 4% xml

  • 1%

xml 2%

slide-31
SLIDE 31 January 21, 2007 Paul Berube 31

Example – Evaluate

Average: 2.33

  • Std. Dev: 2.30

PDF improves performance:

  • 2.33±2.30%, 17 times out of 25
  • 2.33±4.60%, 19 times out of 20

doc 1% doc -3% html 3% html 5% jpeg 5% jpeg 4% mpeg -1% mpeg 1% pdf 3% pdf 4% program 1% program 2% source 3% source 4% text 5% text 4% xml

  • 1%

xml 2%

slide-32
SLIDE 32 January 21, 2007 Paul Berube 32

Example – Evaluate

PDF improves performance:

  • 2.33±2.30%, 17 times out of 25
  • 2.33±4.60%, 19 times out of 20

(peak_pdf > peak_static)? (new_pdf > other_pdf)?

Depends on mean and variance of both!

slide-33
SLIDE 33 January 21, 2007 Paul Berube 33

Pieces of Effective Evaluation

  • Workload of inputs
  • Education about input selection

– Rules and guidelines for authors

  • Adoption of a new methodology for PDF

evaluation

slide-34
SLIDE 34 January 21, 2007 Paul Berube 34

Practical Concerns

  • Benchmark user

– Many additional runs, but on smaller inputs – Two additional program compilation

  • Benchmark author

– Most INT benchmarks use multiple data, and/or additional data is easily available – PDF input set could be used for REF

slide-35
SLIDE 35 January 21, 2007 Paul Berube 35

Conclusion

  • PDF is here: important for compilers and

architecture, in research and in practice

  • The current methodology for PDF evaluation

is not reliable

  • Proposed a methodology for meaningful

evaluation

slide-36
SLIDE 36 January 21, 2007 Paul Berube 36

Thanks

Questions?