Accelerating Multiprocessor Simulation with a Memory Timestamp - - PowerPoint PPT Presentation

accelerating multiprocessor simulation with a memory
SMART_READER_LITE
LIVE PREVIEW

Accelerating Multiprocessor Simulation with a Memory Timestamp - - PowerPoint PPT Presentation

Accelerating Multiprocessor Simulation with a Memory Timestamp Record Kenneth Barr Heidi Pan Michael Zhang Krste Asanovic Massachusetts Institute of Technology March 21, 2005 Intelligent sampling gives best speed-accuracy tradeoff for


slide-1
SLIDE 1

Accelerating Multiprocessor Simulation with a Memory Timestamp Record

Kenneth Barr Heidi Pan Michael Zhang Krste Asanovic March 21, 2005

Massachusetts Institute of Technology

slide-2
SLIDE 2

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 2

Intelligent sampling gives best speed-accuracy tradeoff for uniprocessors (Yi, HPCA `05)

  • Single sample
  • Fastforward +

single sample

  • Fastforward +

Warmup + sample detailed ignored ISA only detailed ignored ISA+µarch ISA+MTR Update Reconstruct caches

d e t a i l e d

ISA only ignored

measure

  • Selective Sampling

(SimPoints)

  • Statistical Sampling
  • Statistical sampling w/

Fast Functional Warming (SMARTS, FFW)

  • Memory Timestamp

Record

slide-3
SLIDE 3

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 3

Snapshots amortize fast-forwarding, but require slow warming or bind to a particular µarch

ISA+µarch snapshots:

Fast (less warmup), but tied to µarch

ISA only snapshots:

Slow due to warmup, but allows any µarch

MTR snapshots:

Fast, NOT tied to µarch, supports multiprocessors…

slide-4
SLIDE 4

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 4

Multiprocessors simulation is especially slow

  • More cores →

More state/complexity → Long, complex simulations

CPU1 CPU2 CPUn $ $ $ Memory Directory

time CPUs

  • Full system,

threaded apps → More variability → More simulation

slide-5
SLIDE 5

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 5

(Alameldeen and Wood, 2003)

  • All produce same result, each has different runtime

– DRAM refresh – Hard disk arrangement delays DMA – Incoming packet interrupts application – Locking order reversed – Processes migrate

  • Is our new gizmo a success? Maybe OS just ordered threads

differently!

Time = 2.5 CPU

For full-system simulations of commercial workloads, subtle variation matters!

Time = 1.8 Time = 2.1 1 2 4 3

slide-6
SLIDE 6

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 6

What is the Memory Timestamp Record (MTR)?

  • MTR is abstract picture of

an multiprocessor’s coherence state

… … …

… CPUn-1 Last Writetime

… …

CPU0 N-1

Block Address

Last Writer Last Readtime

slide-7
SLIDE 7

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 7

What is the Memory Timestamp Record (MTR)?

  • MTR is abstract picture of

an multiprocessor’s coherence state

– Allow quick update during fast forwarding – Fill in concrete caches and directory prior to sampling

… … …

… CPUn-1 Last Writetime

… …

CPU0

N

  • 1

Block Address Last Writer Last Readtime

CPU1 CPU2 CPUn

$ $ $ Memory

Directory CPU1 CPU2 CPUn

$ $ $ Memory

Directory

slide-8
SLIDE 8

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 8

MTR: Memory Trace:

3 1 4 2 Time CPU1 CPU0 d c … CPUn-1 Last Writetime … … … CPU0 b e a Block Address Last Writer Last Readtime

MTR example: update

  • MTR contains one entry per

memory block; locality keeps it sparse.

  • New access times overwrite
  • ld (self-compressing)
slide-9
SLIDE 9

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 9

MTR: Memory Trace:

3 1 4 2 Time Read a CPU1 CPU0 d c … CPUn-1 Last Writetime … … … CPU0 b e a Block Address Last Writer Last Readtime

MTR example: update

  • MTR contains one entry per

memory block; locality keeps it sparse.

  • New access times overwrite
  • ld (self-compressing)
slide-10
SLIDE 10

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 10

MTR: Memory Trace:

3 Read e 1 4 2 Time Read a CPU1 CPU0 d c … CPUn-1 Last Writetime … … … CPU0 b 1 e a Block Address Last Writer Last Readtime

MTR example: update

  • MTR contains one entry per

memory block; locality keeps it sparse.

  • New access times overwrite
  • ld (self-compressing)
slide-11
SLIDE 11

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 11

MTR: Memory Trace:

3 Read e 1 4 2 Time Read b Read a CPU1 CPU0 d c … CPUn-1 Last Writetime … … … CPU0 2 b 1 e a Block Address Last Writer Last Readtime

MTR example: update

  • MTR contains one entry per

memory block; locality keeps it sparse.

  • New access times overwrite
  • ld (self-compressing)
slide-12
SLIDE 12

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 12

MTR: Memory Trace:

Read c 3 Read e 1 4 2 Time Read b Read a CPU1 CPU0 d 3 c … CPUn-1 Last Writetime … … … CPU0 2 b 1 e a Block Address Last Writer Last Readtime

MTR example: update

  • MTR contains one entry per

memory block; locality keeps it sparse.

  • New access times overwrite
  • ld (self-compressing)
slide-13
SLIDE 13

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 13

MTR: Memory Trace:

Read c 3 Read e 1 4 2 Time Write b Read b Read a CPU1 CPU0 d 3 c … CPUn-1 4 Last Writetime … … … CPU0 CPU1 2 b 1 e a Block Address Last Writer Last Readtime

MTR example: update

  • MTR contains one entry per

memory block; locality keeps it sparse.

  • New access times overwrite
  • ld (self-compressing)
slide-14
SLIDE 14

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 14

  • 1. Coalesce: determining correct cache tags

… … … … …

… CPUn-1 Last Writetime

… … …

CPU0 Block Address Last Writer Last Readtime

Set 1 Set 0 Set 3 Set 2 Way 1 Way 0

MTR: Cache:

slide-15
SLIDE 15

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 15

  • 1. Coalesce: determining correct cache tags

… … … … …

… CPUn-1 Last Writetime

… … …

CPU0 Block Address Last Writer Last Readtime

Set 1 Set 0 Set 3 Set 2 Way 1 Way 0

MTR: Cache:

slide-16
SLIDE 16

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 16

  • 1. Coalesce: determining correct cache tags

… … … … …

… CPUn-1 Last Writetime

… … …

CPU0 Block Address Last Writer Last Readtime

Set 1 Set 0 Set 3 Set 2 Way 1 Way 0

MTR: Cache:

slide-17
SLIDE 17

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 17

MTR example: coalesce

  • Choose organization

– One set, two ways

  • Coalesce

– Determine which blocks map to same set – Only ways most recent timestamps are present. Check validity later. d 3 c … CPUn-1 4 Last Writetime … … … CPU0 CPU1 2 b 1 e a Block Address Last Writer Last Readtime Set 1 Set 0 Way 1 Way 0 Set 1 Set 0 Way 1 Way 0

slide-18
SLIDE 18

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 18

MTR example: coalesce

  • Choose organization

– One set, two ways

  • Coalesce

– Determine which blocks map to same set – Only ways most recent timestamps are present. Check validity later.

  • What are the contents of CPU0 cache?

d 3 c … CPUn-1 4 Last Writetime … … … CPU0 CPU1 2 b 1 e a Block Address Last Writer Last Readtime Set 1 Set 0 Way 1 Way 0

slide-19
SLIDE 19

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 19

MTR example: coalesce

  • Choose organization

– One set, two ways

  • Coalesce

– Determine which blocks map to same set – Only ways most recent timestamps are present. Check validity later.

  • What are the contents of CPU0 cache?

d 3 c … CPUn-1 4 Last Writetime … … … CPU0 CPU1 2 b 1 e a Block Address Last Writer Last Readtime

a

Set 1 Set 0 Way 1 Way 0

slide-20
SLIDE 20

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 20

MTR example: coalesce

  • Choose organization

– One set, two ways

  • Coalesce

– Determine which blocks map to same set – Only ways most recent timestamps are present. Check validity later.

  • What are the contents of CPU0 cache?

d 3 c … CPUn-1 4 Last Writetime … … … CPU0 CPU1 2 b 1 e a Block Address Last Writer Last Readtime

b a 2

Set 1 Set 0 Way 1 Way 0

slide-21
SLIDE 21

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 21

MTR example: coalesce

  • Choose organization

– One set, two ways

  • Coalesce

– Determine which blocks map to same set – Only ways most recent timestamps are present. Check validity later.

  • What are the contents of CPU0 cache?

d 3 c … CPUn-1 4 Last Writetime … … … CPU0 CPU1 2 b 1 e a Block Address Last Writer Last Readtime

c b a 2

Set 1

3

Set 0 Way 1 Way 0

slide-22
SLIDE 22

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 22

MTR example: coalesce

  • Choose organization

– One set, two ways

  • Coalesce

– Determine which blocks map to same set – Only ways most recent timestamps are present. Check validity later.

  • What are the contents of CPU0 cache?

d 3 c … CPUn-1 4 Last Writetime … … … CPU0 CPU1 2 b 1 e a Block Address Last Writer Last Readtime

c b e 2

Set 1

3 1

Set 0 Way 1 Way 0

slide-23
SLIDE 23

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 23

MTR example: coalesce

  • Choose organization

– One set, two ways

  • Coalesce

– Determine which blocks map to same set – Only ways most recent timestamps are present. Check validity later.

  • What are the contents of CPU0 cache?

CPU1?

d 3 c … CPUn-1 4 Last Writetime … … … CPU0 CPU1 2 b 1 e a Block Address Last Writer Last Readtime

c b e 2

Set 1

3 1

Set 0 Way 1 Way 0

4 b

Set 1 Set 0 Way 1 Way 0

slide-24
SLIDE 24

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 24

2 Fixup: determine correct status bits

Set 1 Set 0 Set 3 Set 2 Way 1 Way 0

Cache 0

Set 1 Set 0 Set 3 Set 2 Way 1 Way 0

Cache 1

Set 1 Set 0 Set 3 Set 2 Way 1 Way 0

Cache n-1

slide-25
SLIDE 25

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 25

MTR example: fixup

  • Reads prior to a write are invalid, valid writes are dirty, etc…

d 3 c … CPUn-1 4 Last Writetime … … … CPU0 CPU1 2 b 1 e a Block Address Last Writer Last Readtime

invalid Valid, dirty

Which cache has the most recent copy of ‘b?’

c b e 2

Set 1

3 1

Set 0 Way 1 Way 0

4 b

Set 1 Set 0 Way 1 Way 0

slide-26
SLIDE 26

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 26

MTR example: directory reconstruction

d 3 c … CPUn-1 4 Last Writetime … … … CPU0 CPU1 2 b 1 e a Block Address Last Writer Last Readtime I d CPU0 S c S M S State CPU1 b CPU0 e a Block Address CPU0 Sharers

slide-27
SLIDE 27

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 27

The MTR supports many popular organizations and protocols

  • Snoopy or directory-based
  • Multilevel caches

–Inclusive –Exclusive

  • Time-based replacement policy

–Strict LRU –Cache decay

  • Invalidate, Update, Update-Invalidate
  • MSI, MESI, MOESI
slide-28
SLIDE 28

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 28

Evicts cannot be recorded in the MTR, but many can be inferred

n n+k

CPU0 writes b CPU0 reads b CPU0 writes b CPU0 reads b CPU0 writes b’ evicting b Time

CASE A: CASE B:

MTR:

CPU0 Writer n+k CPU0 b address n Writetime CPU1

b = dirty b = clean

slide-29
SLIDE 29

Evaluation / Results

slide-30
SLIDE 30

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 30

Detailed, full-system, execution-driven, x86, SMP simulation

  • SMP Bochs provides: devices (allowing an OS),

x86 Decoding and Execution, Magic Memory

Memory Timestamp Record

Detailed Memory System

CPU 1 CPU N-1

CPU 0

Detailed Mode Enable

Main Memory Magic Memory

Stall

Bochs

slide-31
SLIDE 31

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 31

Detailed, full-system, execution-driven, x86, SMP simulation

  • SMP Bochs provides: devices (allowing an OS),

x86 Decoding and Execution, Magic Memory

Memory Timestamp Record

Detailed Memory System

CPU 1 CPU N-1

CPU 0

Detailed Mode Enable

Main Memory Magic Memory

Stall

Bochs

slide-32
SLIDE 32

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 32

  • Memory Timestamp Record: allows switching

between functional fast-fwd and detailed simulation

Memory Timestamp Record

Detailed Memory System

CPU 1 CPU N-1

CPU 0

Detailed Mode Enable

Main Memory Magic Memory

Stall

Bochs

Reconstructing with the MTR

slide-33
SLIDE 33

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 33

Our detailed memory model can stall a processor’s execution based on timing models

  • Detailed Memory System provides: cache

coherence, network, DRAM timing, stall signal

Memory Timestamp Record

Detailed Memory System

CPU 1 CPU N-1

CPU 0

Detailed Mode Enable

Main Memory Magic Memory

Stall

Bochs

slide-34
SLIDE 34

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 34

Benchmarks!

  • NASA Advanced Supercomputing Parallel

Benchmarks:

– scientific (comp. fluid dynamics) – OpenMP (loop iterations in parallel) – Fortran

  • 2 OS benchmarks

– dbench: (Samba) several clients making file-centric system calls – Apache: several clients hammer web server (via loopback interface)

  • Cilk: checkers: AI search plies in parallel

– uses spawn/sync primitives (dynamic thread creation/scheduling)

slide-35
SLIDE 35

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 35

We compare MTR to full detailed and a “naïve” implementation of SMP fast forwarding.

  • Baseline 1: full detailed simulation (overnight)
  • Baseline 2: “naïve” functional fast forwarding (FFW)

– Functional simulation of ISA – Cache / directory state kept accurate

  • Tag checks, replacement policy enforced
  • Directory consulted and updated
  • On miss/coherence miss, invalidate outstanding copies
  • Omits network messages, queues, latencies (present in detailed mode)
  • Hypothesis

– Both FFW and MTR should be accurate and fast – MTR should be faster than FFW – To be useful, FFW and MTR must answer questions in the same way as a detailed model, but faster

slide-36
SLIDE 36

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 36

Full-system experiments must respect system variation, or risk incorrect prediction!

  • Methodology

– Every 10k cycles choose victim processor – Victim will run 25% slower to emulate variation – Note: variation has MUCH larger effect during fast mode

  • Bar shows the median of eight

runs, with ticks for min and max. Each run is a valid result!

  • Ideal: range for fast runs should

be within range of (all possible) detailed.

  • Can’t draw conclusions about

discrepancy until we understand distribution

dbench Cache 1 miss rate (%) 1:10 1:100 1:1000

slide-37
SLIDE 37

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 37

Replicating “detailed”-mode stats less crucial than accurate answers to design questions

  • Assume observed =

actual

  • With respect to reply

message types, the MSI vs. MESI change is dramatic.

– All fast-fwd bars move with the detailed bar. – Movement beyond range of detailed runs

  • Discover evicts to

more closely match detailed run

– Or, tune victim/slowdown

writeback rep (no ambig. resolution) (no ambig. resolution)

slide-38
SLIDE 38

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 38

MTR averages up to 1.45X faster than FFW

MTR (mg)

0.2 0.4 0.6 0.8 1 1 : 1 1 : 1 1 : 1 Runtime (normalized to FFW 1:10) 0.2 0.4 0.6 0.8 1 1 : 1 1 : 1 1 : 1

  • MTR spends less

time in fast forward

– MTR does less work in common case

  • Time saved in fast

forward time less than MTR transition cost

– MTR has costlier transition, but – Reconstruction scales with touched lines not total accesses

Detailed Simulation Detailed Warming Fast to detailed Fast Forward

FFW (mg)

slide-39
SLIDE 39

Barr, Pan, Zhang, and Asanović. ISPASS. March 21, 2005. 39

Conclusion: Memory Timestamp Record provides fast, accurate, and flexible SMP simulation

  • MTR

– 1.45X faster than functional warming – 7.7X faster than our detailed simulator – Eliminates need to regenerate snapshots – Answers architectural questions similar to detailed simulation

  • Future work

– Simultaneous multiple-configuration simulation – MTR compression for disk snapshots – Parallelized update and reconstruction

http://cag.csail.mit.edu/scale/bochs.html