Monte Carlo Processor Modeling Monte Carlo Processor Modeling of - - PowerPoint PPT Presentation

monte carlo processor modeling monte carlo processor
SMART_READER_LITE
LIVE PREVIEW

Monte Carlo Processor Modeling Monte Carlo Processor Modeling of - - PowerPoint PPT Presentation

Monte Carlo Processor Modeling Monte Carlo Processor Modeling of Contemporary Computer of Contemporary Computer Architectures Architectures Jeanine Cook Jeanine Cook Students: Waleed Alkohlani, Ram Srinivasan Students: Waleed Alkohlani, Ram


slide-1
SLIDE 1

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Monte Carlo Processor Modeling

  • f Contemporary Computer

Architectures Monte Carlo Processor Modeling

  • f Contemporary Computer

Architectures

Jeanine Cook Students: Waleed Alkohlani, Ram Srinivasan New Mexico State University Jeanine Cook Students: Waleed Alkohlani, Ram Srinivasan New Mexico State University

slide-2
SLIDE 2

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Problem Problem

Need tools to do performance analysis of contemporary architectures (design, prediction, procurement) Cycle-accurate simulation

Great for accuracy, hard on time! Lack of freely available simulators that simulate contemporary architectures

Analytic models

Hard to use Not very accurate or robust

Need tools to do performance analysis of contemporary architectures (design, prediction, procurement) Cycle-accurate simulation

Great for accuracy, hard on time! Lack of freely available simulators that simulate contemporary architectures

Analytic models

Hard to use Not very accurate or robust

slide-3
SLIDE 3

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Solution Solution

Statistical model

Based on processor and application characteristics Generates fast, accurate predictions Can do more than just predict execution time Robust

Statistical model

Based on processor and application characteristics Generates fast, accurate predictions Can do more than just predict execution time Robust

slide-4
SLIDE 4

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Monte Carlo Processor Modeling Monte Carlo Processor Modeling

Processor pipeline abstracted into statistical model using

dynamic application profiles processor microarchitecture characteristics

Based on CPI = CPII + CPIS

CPII ==> Intrinsic CPI based on issue width CPIS ==> CPI due to stalls

Processor pipeline abstracted into statistical model using

dynamic application profiles processor microarchitecture characteristics

Based on CPI = CPII + CPIS

CPII ==> Intrinsic CPI based on issue width CPIS ==> CPI due to stalls

slide-5
SLIDE 5

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Current Capabilities Current Capabilities

Single and multi-core In-order instruction execution Flexible cache model Captures instruction sequence relationships Niagara 1 and 2, Cell, Itanium Single and multi-core In-order instruction execution Flexible cache model Captures instruction sequence relationships Niagara 1 and 2, Cell, Itanium

slide-6
SLIDE 6

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Future Capabilities Future Capabilities

Improved flexible cache model Implement out-of-order model methodology Develop method for modeling multi-threaded processors Implement power models for consumption prediction Integrate into communication model ==> MP model Modeling framework Improved flexible cache model Implement out-of-order model methodology Develop method for modeling multi-threaded processors Implement power models for consumption prediction Integrate into communication model ==> MP model Modeling framework

slide-7
SLIDE 7

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Cell Model Cell Model

PPE (PowerPC) - 2- issue, in-order, 2-way SMT SPEs - 2-issue, in-

  • rder, SIMD

EIB - 96 bytes/cycle PPE (PowerPC) - 2- issue, in-order, 2-way SMT SPEs - 2-issue, in-

  • rder, SIMD

EIB - 96 bytes/cycle

slide-8
SLIDE 8

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Synergistic Processing Elements (SPEs) Synergistic Processing Elements (SPEs)

SPU - statically scheduled, 128x128- bit regs, 256KB local store (LS) MFC - handles communication; DMA requests (from PPE and SPUs), mailboxes, signals SPU - statically scheduled, 128x128- bit regs, 256KB local store (LS) MFC - handles communication; DMA requests (from PPE and SPUs), mailboxes, signals

slide-9
SLIDE 9

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

SPU EUs SPU EUs

FPD not fully pipelined; when insn issued, stalls global insn issue for 6 cycles FPD not fully pipelined; when insn issued, stalls global insn issue for 6 cycles

LSU(6), BR(4), SHUF(4), SPR(4), LNOP(0) Odd FP6(6), FP7(7), FPD(13), FX2(2), FX3(4), FXB(4), NOP(0) Even

EU (and latency in cycles) Partition

slide-10
SLIDE 10

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

12 Steps (1) 12 Steps (1)

1. Issue mechanism: SPU stall-at-use 2. CPIi: should be 1/2, but due to even/odd restrictions, we measured from dynamic insn stream 3. Stall reasons: unresolved dependences, mis-speculated branches 4. EU characteristics: on prior slide; not shared 5. Cache characteristics: no cache hierarchy 6. Memory characteristics: only modeled SPUs; no access directly to memory; access latency LS 6 cycles 7. Branch predictor characteristics: 18 cycle fixed penalty branch mis-predict 1. Issue mechanism: SPU stall-at-use 2. CPIi: should be 1/2, but due to even/odd restrictions, we measured from dynamic insn stream 3. Stall reasons: unresolved dependences, mis-speculated branches 4. EU characteristics: on prior slide; not shared 5. Cache characteristics: no cache hierarchy 6. Memory characteristics: only modeled SPUs; no access directly to memory; access latency LS 6 cycles 7. Branch predictor characteristics: 18 cycle fixed penalty branch mis-predict

slide-11
SLIDE 11

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

12 Steps (2) 12 Steps (2)

8. Variable latency: branch unit depends on quality of branch hints 9. Application characteristics: CPIi, dynamic insn mix (generate transition probs), dependence distance histograms, hint-to-branch histogram, prob of taken and hinted branches

  • 10. Collect application profile: designed instrumentation

tool for Cell

  • 11. Model
  • 12. Validation

8. Variable latency: branch unit depends on quality of branch hints 9. Application characteristics: CPIi, dynamic insn mix (generate transition probs), dependence distance histograms, hint-to-branch histogram, prob of taken and hinted branches

  • 10. Collect application profile: designed instrumentation

tool for Cell

  • 11. Model
  • 12. Validation
slide-12
SLIDE 12

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Cell SPU Model Cell SPU Model

slide-13
SLIDE 13

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Token Generation Token Generation

Instruction mix translated to probability Tokens encoded as integers for each insn class Markov token generator Instruction mix translated to probability Tokens encoded as integers for each insn class Markov token generator

slide-14
SLIDE 14

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Dependence Generator and Stall Unit Dependence Generator and Stall Unit

Based on application dependence histograms (e.g., FP-use, LD-use) Based on application dependence histograms (e.g., FP-use, LD-use)

slide-15
SLIDE 15

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Branch Hints Branch Hints

SPUs statically predict branches not taken (0 cycle penalty) 18 cycle penalty for taken (mis-predicted) branches Hinting mechanism to reduce penalty

Hints take effect after 4th pipe stage Take 9 more cycles to fetch target If branch appears within 4 cycles of hint, hint does nothing

SPUs statically predict branches not taken (0 cycle penalty) 18 cycle penalty for taken (mis-predicted) branches Hinting mechanism to reduce penalty

Hints take effect after 4th pipe stage Take 9 more cycles to fetch target If branch appears within 4 cycles of hint, hint does nothing

slide-16
SLIDE 16

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Hint-to-Branch: Taken Hint-to-Branch: Taken

slide-17
SLIDE 17

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Problem with Branch Hints Problem with Branch Hints

Hinted, not-taken branches can stall up to 27 cycles!

Hinting not-taken branch - hint is probably wrong

Hinted, not-taken branches can stall up to 27 cycles!

Hinting not-taken branch - hint is probably wrong

slide-18
SLIDE 18

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Hint-to-Branch: Not Taken Hint-to-Branch: Not Taken

slide-19
SLIDE 19

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Hint Unit Hint Unit

Service time of BR unit based on hint-to-branch distance histogram Statistically determine from application

probability that branch taken/not taken; branch is hinted

Service time of BR unit based on hint-to-branch distance histogram Statistically determine from application

probability that branch taken/not taken; branch is hinted

slide-20
SLIDE 20

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Model Parameters Model Parameters

CPIi Instruction transition probabilities Dependence distance histograms Hint-to-branch distance histograms Probability of taken branch Probability of hinted branch CPIi Instruction transition probabilities Dependence distance histograms Hint-to-branch distance histograms Probability of taken branch Probability of hinted branch

slide-21
SLIDE 21

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Results Results

slide-22
SLIDE 22

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Error Attribution Error Attribution

slide-23
SLIDE 23

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Decomposing CPI Decomposing CPI

slide-24
SLIDE 24

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Fully Pipelined, Faster FPD Fully Pipelined, Faster FPD

slide-25
SLIDE 25

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Hardware Branch Predictor Hardware Branch Predictor

slide-26
SLIDE 26

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Conclusions and Future Work Conclusions and Future Work

Method produces accurate, fast predictive models

Cell, Niagara 1 and 2, Itanium, Opteron

Complete Niagara 2, Opteron Cell improvements

Model communication (DMA) Model the PPE

Extend methodology for multithreading, power models, better flexible cache model Integrate into communication model for MP system model Method produces accurate, fast predictive models

Cell, Niagara 1 and 2, Itanium, Opteron

Complete Niagara 2, Opteron Cell improvements

Model communication (DMA) Model the PPE

Extend methodology for multithreading, power models, better flexible cache model Integrate into communication model for MP system model

slide-27
SLIDE 27

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Thanks Thanks

Any Questions??? Any Questions???

slide-28
SLIDE 28

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

Extra Slides Extra Slides

slide-29
SLIDE 29

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

12-Step Method (1) 12-Step Method (1)

1. Determine if stall-at-issue or stall-at-use 2. Determine CPII 3. Determine factors that influence CPIS (e.g., data dependences, branch mis-predicts, partially pipelined EUs 4. Identify type, number, latency, behavior under contention of EUs 5. Determine cache access times 6. Determine main memory access latency 7. Determine branch mis-prediction penalty 1. Determine if stall-at-issue or stall-at-use 2. Determine CPII 3. Determine factors that influence CPIS (e.g., data dependences, branch mis-predicts, partially pipelined EUs 4. Identify type, number, latency, behavior under contention of EUs 5. Determine cache access times 6. Determine main memory access latency 7. Determine branch mis-prediction penalty

slide-30
SLIDE 30

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

12-Step Method (2) 12-Step Method (2)

8. Determine microarchitectural structures with variable service time (e.g., EUs, memory, branch predictors) 9. Collect application profile

  • CPII
  • Dynamic instruction mix
  • Dependence distance histograms
  • Cache hit/miss statistics
  • Branch predictor accuracy statistics
  • Special histograms such as prefetch-load distance

(Itanium), hint-branch distance (Cell) relevant to Step 8

8. Determine microarchitectural structures with variable service time (e.g., EUs, memory, branch predictors) 9. Collect application profile

  • CPII
  • Dynamic instruction mix
  • Dependence distance histograms
  • Cache hit/miss statistics
  • Branch predictor accuracy statistics
  • Special histograms such as prefetch-load distance

(Itanium), hint-branch distance (Cell) relevant to Step 8

slide-31
SLIDE 31

LACSS 2008

QuickTime™ and a decompressor are needed to see this picture.

12-Step Method (3) 12-Step Method (3)

  • 10. Identify tools to collect application profile

(typically performance counters and binary instrumenters)

  • 11. Build model
  • 12. Validate model
  • 10. Identify tools to collect application profile

(typically performance counters and binary instrumenters)

  • 11. Build model
  • 12. Validate model