TUNING SLIDE Fast and Accurate Microarchitectural Simulation with - - PowerPoint PPT Presentation

tuning slide
SMART_READER_LITE
LIVE PREVIEW

TUNING SLIDE Fast and Accurate Microarchitectural Simulation with - - PowerPoint PPT Presentation

TUNING SLIDE Fast and Accurate Microarchitectural Simulation with ZSim Daniel Sanchez, Nathan Beckmann, Anurag Mukkara, Po-An Tsai MIT CSAIL MICRO-48 Tutorial December 5, 2015 Welcome! Agenda 4 8:30 9:10 Intro and Overview 9:10


slide-1
SLIDE 1

TUNING SLIDE

slide-2
SLIDE 2

MICRO-48 Tutorial December 5, 2015

Fast and Accurate Microarchitectural Simulation with ZSim

Daniel Sanchez, Nathan Beckmann, Anurag Mukkara, Po-An Tsai MIT CSAIL

slide-3
SLIDE 3

Welcome!

slide-4
SLIDE 4

Agenda

4

8:30 – 9:10 Intro and Overview 9:10 – 9:25 Simulator Organization 9:25 – 10:00 Core Models 10:00 – 10:20 Break / Q&A 10:20 – 11:00 Memory System 11:00 – 11:20 Configuration and Stats 11:20 – 11:40 Validation 11:40 – 12:00 Q&A

slide-5
SLIDE 5

Introduction and Overview

5

slide-6
SLIDE 6

Motivation

6

 Current detailed simulators are slow (~200 KIPS)

slide-7
SLIDE 7

Motivation

6

 Current detailed simulators are slow (~200 KIPS)  Simulation performance wall

 More complex targets (multicore, memory hierarchy, …)  Hard to parallelize

slide-8
SLIDE 8

Motivation

6

 Current detailed simulators are slow (~200 KIPS)  Simulation performance wall  More complex targets (multicore, memory hierarchy, …)  Hard to parallelize

 Problem: Time to simulate 1000 cores @ 2GHz for 1s at

 200 KIPS: 4 months

slide-9
SLIDE 9

Motivation

6

 Current detailed simulators are slow (~200 KIPS)  Simulation performance wall  More complex targets (multicore, memory hierarchy, …)  Hard to parallelize

 Problem: Time to simulate 1000 cores @ 2GHz for 1s at

 200 KIPS: 4 months  200 MIPS: 3 hours

slide-10
SLIDE 10

Motivation

6

 Current detailed simulators are slow (~200 KIPS)  Simulation performance wall  More complex targets (multicore, memory hierarchy, …)  Hard to parallelize

 Problem: Time to simulate 1000 cores @ 2GHz for 1s at

 200 KIPS: 4 months  200 MIPS: 3 hours

 Alternatives?  FPGAs: Fast, good progress, but still hard to use  Simplified/abstract models: Fast but inaccurate

slide-11
SLIDE 11

ZSim Techniques

7

 Three techniques to make 1000-core simulation practical: 1.

Detailed DBT-accelerated core models to speed up sequential simulation

2.

Bound-weave to scale parallel simulation

3.

Lightweight user-level virtualization to bridge user-level/full- system gap

slide-12
SLIDE 12

ZSim Techniques

7

 Three techniques to make 1000-core simulation practical: 1.

Detailed DBT-accelerated core models to speed up sequential simulation

2.

Bound-weave to scale parallel simulation

3.

Lightweight user-level virtualization to bridge user-level/full- system gap

 ZSim achieves high performance and accuracy:  Simulates 1024-core systems at 10s-1000s of MIPS  100-1000x faster than current simulators  Validated against real Westmere system, avg error ~10%

slide-13
SLIDE 13

This Presentation is Also a Demo!

8

 ZSim is simulating these slides

 OOO Westmere cores running at 2 GHz  3-level cache hierarchy

 Will illustrate other features as I present them

Total cycles and instructions simulated (in billions) Current simulation speed and basic stats (updated every 500ms)

slide-14
SLIDE 14

This Presentation is Also a Demo!

8

 ZSim is simulating these slides

 OOO Westmere cores running at 2 GHz  3-level cache hierarchy

 Will illustrate other features as I present them

Total cycles and instructions simulated (in billions) Current simulation speed and basic stats (updated every 500ms) Busy (> 0.9 cores active) 0.1 < cores active < 0.9 Idle (< 0.1 cores active)

slide-15
SLIDE 15

This Presentation is Also a Demo!

8

 ZSim is simulating these slides

 OOO Westmere cores running at 2 GHz  3-level cache hierarchy

 Will illustrate other features as I present them

Total cycles and instructions simulated (in billions) Current simulation speed and basic stats (updated every 500ms) ZSim performance relevant when busy Running on 2-core laptop CPU @ 1.7 GHz ~12x slower than 16-core server @ 2.6 GHz Busy (> 0.9 cores active) 0.1 < cores active < 0.9 Idle (< 0.1 cores active)

!

slide-16
SLIDE 16

Main Design Decisions

9

 General execution-driven simulator:

Functional model Timing model

slide-17
SLIDE 17

Main Design Decisions

9

 General execution-driven simulator:

Functional model Timing model Emulation? (e.g., gem5, MARSSx86) Instrumentation? (e.g., Graphite, Sniper)

slide-18
SLIDE 18

Main Design Decisions

9

 General execution-driven simulator:

Functional model Timing model Emulation? (e.g., gem5, MARSSx86) Instrumentation? (e.g., Graphite, Sniper) Functional model “for free”  Base ISA = Host ISA (x86) Dynamic Binary Translation (Pin)

slide-19
SLIDE 19

Main Design Decisions

9

 General execution-driven simulator:

Functional model Timing model Emulation? (e.g., gem5, MARSSx86) Instrumentation? (e.g., Graphite, Sniper) Cycle-driven? Event-driven? Functional model “for free”  Base ISA = Host ISA (x86) Dynamic Binary Translation (Pin)

slide-20
SLIDE 20

Main Design Decisions

9

 General execution-driven simulator:

Functional model Timing model Emulation? (e.g., gem5, MARSSx86) Instrumentation? (e.g., Graphite, Sniper) Cycle-driven? Event-driven? Functional model “for free”  Base ISA = Host ISA (x86) DBT-accelerated, instruction-driven core + Event-driven uncore Dynamic Binary Translation (Pin)

slide-21
SLIDE 21

Outline

10

 Introduction  Detailed DBT-accelerated core models  Bound-weave parallelization  Lightweight user-level virtualization

slide-22
SLIDE 22

 Shift most of the work to DBT instrumentation phase

Accelerating Core Models

11

mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a Load(addr = (%rbp)) mov (%rbp),%rcx add %rax,%rdx Store(addr = (%rbp)) mov %rdx,(%rbp) BasicBlock(BBLDescriptor) ja 10840530a

Basic block Instrumented basic block Basic block descriptor Insµop decoding

µop dependencies,

functional units, latency Front-end delays +

slide-23
SLIDE 23

 Shift most of the work to DBT instrumentation phase  Instruction-driven models: Simulate all stages at once for each

instruction/ µop

Accelerating Core Models

11

mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a Load(addr = (%rbp)) mov (%rbp),%rcx add %rax,%rdx Store(addr = (%rbp)) mov %rdx,(%rbp) BasicBlock(BBLDescriptor) ja 10840530a

Basic block Instrumented basic block Basic block descriptor Insµop decoding

µop dependencies,

functional units, latency Front-end delays +

slide-24
SLIDE 24

 Shift most of the work to DBT instrumentation phase  Instruction-driven models: Simulate all stages at once for each

instruction/ µop

 Accurate even with OOO if instruction window prioritizes older instructions  Faster, but more complex than cycle-driven

Accelerating Core Models

11

mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a Load(addr = (%rbp)) mov (%rbp),%rcx add %rax,%rdx Store(addr = (%rbp)) mov %rdx,(%rbp) BasicBlock(BBLDescriptor) ja 10840530a

Basic block Instrumented basic block Basic block descriptor Insµop decoding

µop dependencies,

functional units, latency Front-end delays +

slide-25
SLIDE 25

Detailed OOO Model

12

 OOO core modeled and validated against Westmere

Main Features Fetch Decode Issue OOO Exec Commit

Wrong-path fetches Branch Prediction Front-end delays (predecoder, decoder) Detailed instruction to µop decoding Rename/capture stalls IW with limited size and width Functional unit delays and contention Detailed LSU (forwarding, fences,…) Reorder buffer with limited size and width

slide-26
SLIDE 26

Detailed OOO Model

13

 OOO core modeled and validated against Westmere

Fetch Decode Issue OOO Exec Commit

slide-27
SLIDE 27

Detailed OOO Model

13

 OOO core modeled and validated against Westmere

Fetch Decode Issue OOO Exec Commit Fundamentally Hard to Model

Wrong-path execution

slide-28
SLIDE 28

Detailed OOO Model

13

 OOO core modeled and validated against Westmere

Fetch Decode Issue OOO Exec Commit Fundamentally Hard to Model

Wrong-path execution In Westmere, wrong-path instructions don’t affect recovery latency or pollute caches Skipping OK

slide-29
SLIDE 29

Detailed OOO Model

13

 OOO core modeled and validated against Westmere

Fetch Decode Issue OOO Exec Commit Fundamentally Hard to Model

Wrong-path execution Rarely used instructions BTB LSD TLBs In Westmere, wrong-path instructions don’t affect recovery latency or pollute caches Skipping OK

Not Modeled (Yet)

slide-30
SLIDE 30

Single-Thread Accuracy

14

 8.5% average IPC error, max 26%, 21/29 within 10%  29 SPEC CPU2006 apps for 50 Billion instructions  Real: Xeon L5640 (Westmere), 3x DDR3-1333, no HT  Simulated: OOO cores @ 2.27 GHz, detailed uncore

slide-31
SLIDE 31

Single-Thread Performance

15

 Host: E5-2670 @ 2.6 GHz (single-thread simulation)  29 SPEC CPU2006 apps for 50 Billion instructions

slide-32
SLIDE 32

Single-Thread Performance

15

 Host: E5-2670 @ 2.6 GHz (single-thread simulation)  29 SPEC CPU2006 apps for 50 Billion instructions

40 MIPS hmean 12 MIPS hmean

slide-33
SLIDE 33

Single-Thread Performance

15

 Host: E5-2670 @ 2.6 GHz (single-thread simulation)  29 SPEC CPU2006 apps for 50 Billion instructions

40 MIPS hmean 12 MIPS hmean ~10-100x faster

slide-34
SLIDE 34

Single-Thread Performance

15

 Host: E5-2670 @ 2.6 GHz (single-thread simulation)  29 SPEC CPU2006 apps for 50 Billion instructions

40 MIPS hmean 12 MIPS hmean ~3x between least and most detailed models! ~10-100x faster

slide-35
SLIDE 35

Outline

16

 Introduction  Detailed DBT-accelerated core models  Bound-weave parallelization  Lightweight user-level virtualization

slide-36
SLIDE 36

Parallelization Techniques

17

 Parallel Discrete Event Simulation (PDES):  Divide components across host threads  Execute events from each component

maintaining illusion of full order

Core 1 Core 0 Mem 0 L3 Bank 0 L3 Bank 1

slide-37
SLIDE 37

Parallelization Techniques

17

 Parallel Discrete Event Simulation (PDES):  Divide components across host threads  Execute events from each component

maintaining illusion of full order

Core 1 Core 0 Mem 0 L3 Bank 0 L3 Bank 1

Host Thread 0 Host Thread 1

slide-38
SLIDE 38

Parallelization Techniques

17

 Parallel Discrete Event Simulation (PDES):  Divide components across host threads  Execute events from each component

maintaining illusion of full order

Core 1 Core 0 Mem 0 L3 Bank 0 L3 Bank 1

Host Thread 0 Host Thread 1

5 10 15 15 10 5

Skew < 10 cycles

slide-39
SLIDE 39

Parallelization Techniques

17

 Parallel Discrete Event Simulation (PDES):  Divide components across host threads  Execute events from each component

maintaining illusion of full order

Core 1 Core 0 Mem 0 L3 Bank 0 L3 Bank 1

Host Thread 0 Host Thread 1

5 10 15 15 10 5

Accurate  Not scalable Skew < 10 cycles

slide-40
SLIDE 40

Parallelization Techniques

17

 Parallel Discrete Event Simulation (PDES):  Divide components across host threads  Execute events from each component

maintaining illusion of full order

 Lax synchronization: Allow skews above inter-component

latencies, tolerate ordering violations

Core 1 Core 0 Mem 0 L3 Bank 0 L3 Bank 1

Host Thread 0 Host Thread 1

5 10 15 15 10 5

Scalable  Inaccurate Accurate  Not scalable Skew < 10 cycles

slide-41
SLIDE 41

Characterizing Interference

18

Path-altering interference If we simulate two accesses out of order, their paths through the memory hierarchy change

GETS A HIT

Core 0 LLC Mem

GETS A MISS

1 2

Core1

slide-42
SLIDE 42

Characterizing Interference

18

Path-altering interference If we simulate two accesses out of order, their paths through the memory hierarchy change

GETS A HIT

Core 0 LLC Mem

GETS A MISS

1 2

Core1 Core 0 LLC Mem

GETS A HIT GETS A MISS

2 1

Core 1

slide-43
SLIDE 43

Characterizing Interference

18

Path-altering interference If we simulate two accesses out of order, their paths through the memory hierarchy change Path-preserving interference If we simulate two accesses out of order, their timing changes but their paths do not

GETS A HIT

Core 0 LLC Mem

GETS A MISS

1 2

Core1 Core 0 LLC Mem

GETS A HIT GETS A MISS

2 1

Core 1

GETS B HIT

Core 0 LLC (blocking) Mem

GETS A MISS

1 2

Core 1

3 4 5 6

slide-44
SLIDE 44

Characterizing Interference

18

Path-altering interference If we simulate two accesses out of order, their paths through the memory hierarchy change Path-preserving interference If we simulate two accesses out of order, their timing changes but their paths do not

GETS A HIT

Core 0 LLC Mem

GETS A MISS

1 2

Core1 Core 0 LLC Mem

GETS A HIT GETS A MISS

2 1

Core 1

GETS B HIT

Core 0 LLC (blocking) Mem

GETS A MISS

1 2

Core 1

3 4 5 6

GETS B HIT

Core 0 LLC (blocking) Mem

GETS A MISS

2 1

Core 1

4 5 6 3

slide-45
SLIDE 45

Characterizing Interference

19

 Accesses with path-altering interference with barrier

synchronization every 1K/10K/100K cycles (64 cores):

1 in10K accesses

slide-46
SLIDE 46

Characterizing Interference

19

 Path-altering interference extremely rare in small intervals  Accesses with path-altering interference with barrier

synchronization every 1K/10K/100K cycles (64 cores):

1 in10K accesses

slide-47
SLIDE 47

Characterizing Interference

19

 Path-altering interference extremely rare in small intervals  Strategy:

 Simulate path-preserving interference faithfully  Ignore (but optionally profile) path-altering interference

 Accesses with path-altering interference with barrier

synchronization every 1K/10K/100K cycles (64 cores):

1 in10K accesses

slide-48
SLIDE 48

Bound-Weave Parallelization

20

 Divide simulation in small intervals (e.g., 1000 cycles)  Two parallel phases per interval: Bound and weave

slide-49
SLIDE 49

Bound-Weave Parallelization

20

 Divide simulation in small intervals (e.g., 1000 cycles)  Two parallel phases per interval: Bound and weave

Bound phase: Find paths Weave phase: Find timings

slide-50
SLIDE 50

Bound-Weave Parallelization

20

 Divide simulation in small intervals (e.g., 1000 cycles)  Two parallel phases per interval: Bound and weave

Bound-Weave equivalent to PDES for path-preserving interference Bound phase: Find paths Weave phase: Find timings

slide-51
SLIDE 51

Bound-Weave Example

21

 2-core host simulating

4-core system

 1000-cycle intervals

Core 1

L1I

Core 0 Core 2 Core 3

L1D L1I L1D L1I L1D L1I L1D

Mem Ctrl 0 Mem Ctrl 1

L2 L2 L2 L2

L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3

slide-52
SLIDE 52

Bound-Weave Example

21

 2-core host simulating

4-core system

 1000-cycle intervals  Divide components

among 2 domains

Core 1

L1I

Core 0 Core 2 Core 3

L1D L1I L1D L1I L1D L1I L1D

Mem Ctrl 0 Mem Ctrl 1

L2 L2 L2 L2

L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3

Domain 0 Domain 1

slide-53
SLIDE 53

Bound-Weave Example

21

 2-core host simulating

4-core system

 1000-cycle intervals  Divide components

among 2 domains

Core 1

L1I

Core 0 Core 2 Core 3

L1D L1I L1D L1I L1D L1I L1D

Mem Ctrl 0 Mem Ctrl 1

L2 L2 L2 L2

L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3

Domain 0 Domain 1 Host Thread 0 Host Thread 1 Host Time

slide-54
SLIDE 54

Bound-Weave Example

21

 2-core host simulating

4-core system

 1000-cycle intervals  Divide components

among 2 domains

Core 1

L1I

Core 0 Core 2 Core 3

L1D L1I L1D L1I L1D L1I L1D

Mem Ctrl 0 Mem Ctrl 1

L2 L2 L2 L2

L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3

Domain 0 Domain 1 Core 0 Core 3 Core 1 Core 2 Bound Phase: Parallel simulation until cycle 1000, gather access traces Host Thread 0 Host Thread 1 Host Time

slide-55
SLIDE 55

Bound-Weave Example

21

 2-core host simulating

4-core system

 1000-cycle intervals  Divide components

among 2 domains

Core 1

L1I

Core 0 Core 2 Core 3

L1D L1I L1D L1I L1D L1I L1D

Mem Ctrl 0 Mem Ctrl 1

L2 L2 L2 L2

L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3

Domain 0 Domain 1 Core 0 Core 3 Core 1 Core 2 Bound Phase: Parallel simulation until cycle 1000, gather access traces Domain 0 Domain 1 Weave Phase: Parallel event-driven simulation of gathered traces until actual cycle 1000 Host Thread 0 Host Thread 1 Host Time

slide-56
SLIDE 56

Bound-Weave Example

21

 2-core host simulating

4-core system

 1000-cycle intervals  Divide components

among 2 domains

Core 1

L1I

Core 0 Core 2 Core 3

L1D L1I L1D L1I L1D L1I L1D

Mem Ctrl 0 Mem Ctrl 1

L2 L2 L2 L2

L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3

Domain 0 Domain 1 Core 0 Core 3 Core 1 Core 2 Bound Phase: Parallel simulation until cycle 1000, gather access traces Domain 0 Domain 1 Weave Phase: Parallel event-driven simulation of gathered traces until actual cycle 1000 Feedback: Adjust core cycles Host Thread 0 Host Thread 1 Host Time

slide-57
SLIDE 57

Bound-Weave Example

21

 2-core host simulating

4-core system

 1000-cycle intervals  Divide components

among 2 domains

Core 1

L1I

Core 0 Core 2 Core 3

L1D L1I L1D L1I L1D L1I L1D

Mem Ctrl 0 Mem Ctrl 1

L2 L2 L2 L2

L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3

Domain 0 Domain 1 Core 0 Core 3 Core 1 Core 2 Bound Phase: Parallel simulation until cycle 1000, gather access traces Domain 0 Domain 1 Weave Phase: Parallel event-driven simulation of gathered traces until actual cycle 1000 Feedback: Adjust core cycles Bound Phase (until cycle 2000)

Core 3 Core 2 Core 0 Core 1 Host Thread 0 Host Thread 1 Host Time

slide-58
SLIDE 58

Example: Bound Phase

22

 Host thread 0 simulates core 0, records trace: L3b1 @ 50 HIT Core0 @ 30 Core0 @ 60 L3b0 @ 80 MISS Mem1 @ 110 READ Core0 @ 90 Core0 @ 250 L3b0 @ 230 RESP Core0 @ 290 L3b3 @ 270 HIT

 Edges fix minimum latency between events  Minimum L3 and main memory latencies (no interference)

20 20 20 30 30 100 30 120 20 20 20 40

slide-59
SLIDE 59

Example: Weave Phase

23

 Host threads simulate components from domains 0,1  Host threads only sync when needed  e.g., thread 1 simulates other events (not shown) until cycle 110, syncs  Lower bounds guarantee no order violations

L3b1 @ 50 HIT Core0 @ 30 Core0 @ 60 L3b0 @ 80 MISS Mem1 @ 110 READ Core0 @ 90 Core0 @ 250 L3b0 @ 230 RESP Core0 @ 290 L3b3 @ 270 HIT 20 20 20 30 30 100 30 120 20 20 20 40

Host Thread 0 Host Thread 1

slide-60
SLIDE 60

Example: Weave Phase

24

 Delays propagate as events are simulated:

L3b1 @ 50 HIT Core0 @ 30 Core0 @ 60 L3b0 @ 80 MISS Mem1 @ 110 READ Core0 @ 90 Core0 @ 250 L3b0 @ 230 RESP Core0 @ 290 L3b3 @ 270 HIT 20 20 20 30 30 100 30 120 20 20 20 40

Host Thread 0 Host Thread 1

slide-61
SLIDE 61

Example: Weave Phase

24

 Delays propagate as events are simulated:

L3b1 @ 50 HIT Core0 @ 30 Core0 @ 60 L3b0 @ 80 MISS Mem1 @ 110 READ Core0 @ 90 Core0 @ 250 L3b0 @ 230 RESP Core0 @ 290 L3b3 @ 270 HIT 20 20 20 30 30 100 30 120 20 20 20 40

Host Thread 0 Host Thread 1

Row miss  +50 cycles

slide-62
SLIDE 62

Example: Weave Phase

24

 Delays propagate as events are simulated:

L3b1 @ 50 HIT Core0 @ 30 Core0 @ 60 L3b0 @ 80 MISS Mem1 @ 110 READ Core0 @ 90 Core0 @ 250 L3b0 @ 230 RESP Core0 @ 290 L3b3 @ 270 HIT 20 20 20 30 30 100 30 120 20 20 20 40

Host Thread 0 Host Thread 1

Row miss  +50 cycles

170

slide-63
SLIDE 63

Example: Weave Phase

24

 Delays propagate as events are simulated:

L3b1 @ 50 HIT Core0 @ 30 Core0 @ 60 L3b0 @ 80 MISS Mem1 @ 110 READ Core0 @ 90 Core0 @ 250 L3b0 @ 230 RESP Core0 @ 290 L3b3 @ 270 HIT 20 20 20 30 30 100 30 120 20 20 20 40

Host Thread 0 Host Thread 1

Row miss  +50 cycles

280 170

slide-64
SLIDE 64

Example: Weave Phase

24

 Delays propagate as events are simulated:

L3b1 @ 50 HIT Core0 @ 30 Core0 @ 60 L3b0 @ 80 MISS Mem1 @ 110 READ Core0 @ 90 Core0 @ 250 L3b0 @ 230 RESP Core0 @ 290 L3b3 @ 270 HIT 20 20 20 30 30 100 30 120 20 20 20 40

Host Thread 0 Host Thread 1

Row miss  +50 cycles

280 290 300 170

slide-65
SLIDE 65

Example: Weave Phase

24

 Delays propagate as events are simulated:

L3b1 @ 50 HIT Core0 @ 30 Core0 @ 60 L3b0 @ 80 MISS Mem1 @ 110 READ Core0 @ 90 Core0 @ 250 L3b0 @ 230 RESP Core0 @ 290 L3b3 @ 270 HIT 20 20 20 30 30 100 30 120 20 20 20 40

Host Thread 0 Host Thread 1

Row miss  +50 cycles

280 290 300 320 170

slide-66
SLIDE 66

Example: Weave Phase

24

 Delays propagate as events are simulated:

L3b1 @ 50 HIT Core0 @ 30 Core0 @ 60 L3b0 @ 80 MISS Mem1 @ 110 READ Core0 @ 90 Core0 @ 250 L3b0 @ 230 RESP Core0 @ 290 L3b3 @ 270 HIT 20 20 20 30 30 100 30 120 20 20 20 40

Host Thread 0 Host Thread 1

Row miss  +50 cycles

280 290 300 320 340 170

slide-67
SLIDE 67

Bound-Weave Scalability

25

 Bound phase scales almost linearly

 Using novel shared-memory synchronization protocol (later)

 Weave phase scales much better than PDES

 Threads only need to sync when an event crosses domains  A lot of work shifted to bound phase

slide-68
SLIDE 68

Bound-Weave Scalability

25

 Bound phase scales almost linearly

 Using novel shared-memory synchronization protocol (later)

 Weave phase scales much better than PDES

 Threads only need to sync when an event crosses domains  A lot of work shifted to bound phase

 Need bound and weave models for each component, but

division is often very natural

 e.g., caches: hit/miss on bound phase; MSHRs, pipelined

accesses, port contention on weave phase

slide-69
SLIDE 69

Bound-Weave Take-Aways

26

 Minimal synchronization:

 Bound phase: Unordered accesses (like lax)  Weave: Only sync on actual dependencies

slide-70
SLIDE 70

Bound-Weave Take-Aways

26

 Minimal synchronization:

 Bound phase: Unordered accesses (like lax)  Weave: Only sync on actual dependencies

 No ordering violations in weave phase

slide-71
SLIDE 71

Bound-Weave Take-Aways

26

 Minimal synchronization:

 Bound phase: Unordered accesses (like lax)  Weave: Only sync on actual dependencies

 No ordering violations in weave phase  Works with standard event-driven models

 e.g., 110 lines to integrate with DRAMSim2

slide-72
SLIDE 72

Multithreaded Accuracy

27

 23 apps: PARSEC, SPLASH-2, SPEC OMP2001, STREAM  11.2% avg perf error (not IPC), 10/23 within 10%

 Similar differences as single-core results

slide-73
SLIDE 73

1024-Core Performance

28

 Host: 2-socket Sandy Bridge @ 2.6 GHz (16 cores, 32 threads)  Results for the 14/23 parallel apps that scale

slide-74
SLIDE 74

1024-Core Performance

28

 Host: 2-socket Sandy Bridge @ 2.6 GHz (16 cores, 32 threads)  Results for the 14/23 parallel apps that scale

200 MIPS hmean 41 MIPS hmean

slide-75
SLIDE 75

1024-Core Performance

28

 Host: 2-socket Sandy Bridge @ 2.6 GHz (16 cores, 32 threads)  Results for the 14/23 parallel apps that scale

200 MIPS hmean 41 MIPS hmean ~100-1000x faster

slide-76
SLIDE 76

1024-Core Performance

28

 Host: 2-socket Sandy Bridge @ 2.6 GHz (16 cores, 32 threads)  Results for the 14/23 parallel apps that scale

200 MIPS hmean 41 MIPS hmean ~5x between least and most detailed models! ~100-1000x faster

slide-77
SLIDE 77

Bound-Weave Scalability

29

slide-78
SLIDE 78

Bound-Weave Scalability

29

10.1-13.6x speedup @ 16 cores

slide-79
SLIDE 79

Outline

30

 Introduction  Detailed DBT-accelerated core models  Bound-weave parallelization  Lightweight user-level virtualization

slide-80
SLIDE 80

Lightweight User-Level Virtualization

31

 No 1Kcore OSs  No parallel full-system DBT

ZSim has to be user-level for now

slide-81
SLIDE 81

Lightweight User-Level Virtualization

31

 No 1Kcore OSs  No parallel full-system DBT  Problem: User-level simulators limited to simple workloads  Lightweight user-level virtualization: Bridge the gap with

full-system simulation

 Simulate accurately if time spent in OS is minimal

ZSim has to be user-level for now

slide-82
SLIDE 82

Lightweight User-Level Virtualization

32

 Multiprocess workloads  Scheduler (threads > cores)  Time virtualization  System virtualization  Simulator-OS deadlock

avoidance

 Signals  ISA extensions  Fast-forwarding

slide-83
SLIDE 83

ZSim Limitations

33

 Not implemented yet:

 Multithreaded cores  Detailed NoC models  Virtual memory (TLBs)

slide-84
SLIDE 84

ZSim Limitations

33

 Not implemented yet:

 Multithreaded cores  Detailed NoC models  Virtual memory (TLBs)

 Fundamentally hard:

 Systems or workloads with frequent path-altering interference

(e.g., fine-grained message-passing across whole chip)

 Kernel-intensive applications

slide-85
SLIDE 85

Summary

34

 Three techniques to make 1Kcore simulation practical

 DBT-accelerated models: 10-100x faster core models  Bound-weave parallelization: ~10-15x speedup from

parallelization with minimal accuracy loss

 Lightweight user-level virtualization: Simulate complex

workloads without full-system support

 ZSim achieves high performance and accuracy:

 Simulates 1024-core systems at 10s-1000s of MIPS  Validated against real Westmere system, avg error ~10%

slide-86
SLIDE 86

Simulator Organization

35

slide-87
SLIDE 87

Main Components

36

Harness Driver System Initialization Config Core timing models Memory system timing models Global Memory User- level virtualiz ation Stats

slide-88
SLIDE 88

ZSim Harness

37

 Most of zsim implemented as

a pintool (libzsim.so)

 A separate harness process

(zsim) controls simulation

 Initializes global memory  Launches pin processes  Checks for deadlock

slide-89
SLIDE 89

ZSim Harness

37

 Most of zsim implemented as

a pintool (libzsim.so)

 A separate harness process

(zsim) controls simulation

 Initializes global memory  Launches pin processes  Checks for deadlock

./build/opt/zsim test.cfg

slide-90
SLIDE 90

ZSim Harness

37

 Most of zsim implemented as

a pintool (libzsim.so)

 A separate harness process

(zsim) controls simulation

 Initializes global memory  Launches pin processes  Checks for deadlock

zsim

./build/opt/zsim test.cfg

slide-91
SLIDE 91

ZSim Harness

37

 Most of zsim implemented as

a pintool (libzsim.so)

 A separate harness process

(zsim) controls simulation

 Initializes global memory  Launches pin processes  Checks for deadlock

zsim

./build/opt/zsim test.cfg

Global Memory

slide-92
SLIDE 92

ZSim Harness

37

 Most of zsim implemented as

a pintool (libzsim.so)

 A separate harness process

(zsim) controls simulation

 Initializes global memory  Launches pin processes  Checks for deadlock

zsim

./build/opt/zsim test.cfg

Global Memory

slide-93
SLIDE 93

ZSim Harness

37

 Most of zsim implemented as

a pintool (libzsim.so)

 A separate harness process

(zsim) controls simulation

 Initializes global memory  Launches pin processes  Checks for deadlock

zsim

./build/opt/zsim test.cfg

process0 = { command = “ls”; }; process1 = { command = “echo foo”; }; …

Global Memory

slide-94
SLIDE 94

ZSim Harness

37

 Most of zsim implemented as

a pintool (libzsim.so)

 A separate harness process

(zsim) controls simulation

 Initializes global memory  Launches pin processes  Checks for deadlock

zsim

./build/opt/zsim test.cfg

process0 = { command = “ls”; }; process1 = { command = “echo foo”; }; …

Global Memory pin –t libzsim.so -- ls

slide-95
SLIDE 95

ZSim Harness

37

 Most of zsim implemented as

a pintool (libzsim.so)

 A separate harness process

(zsim) controls simulation

 Initializes global memory  Launches pin processes  Checks for deadlock

zsim

./build/opt/zsim test.cfg

process0 = { command = “ls”; }; process1 = { command = “echo foo”; }; …

Global Memory pin –t libzsim.so -- ls pin –t libzsim.so – echo foo

slide-96
SLIDE 96

Global Memory

38

 Pin processes communicate through a shared memory

segment, managed as a single global heap

 All simulator objects must be allocated in the global heap

slide-97
SLIDE 97

Global Memory

38

 Pin processes communicate through a shared memory

segment, managed as a single global heap

 All simulator objects must be allocated in the global heap

Process 0 address space

Program code Local heap Global heap libzsim.so

slide-98
SLIDE 98

Global Memory

38

 Pin processes communicate through a shared memory

segment, managed as a single global heap

 All simulator objects must be allocated in the global heap

Process 0 address space

Program code Local heap Global heap libzsim.so

Process 1 address space

Program code Local heap Global heap libzsim.so

slide-99
SLIDE 99

Global Memory

38

 Pin processes communicate through a shared memory

segment, managed as a single global heap

 All simulator objects must be allocated in the global heap

Process 0 address space

Program code Local heap Global heap libzsim.so

Process 1 address space

Program code Local heap Global heap libzsim.so

slide-100
SLIDE 100

Global Memory

38

 Pin processes communicate through a shared memory

segment, managed as a single global heap

 All simulator objects must be allocated in the global heap

Process 0 address space

Program code Local heap Global heap libzsim.so

Process 1 address space

Program code Local heap Global heap libzsim.so

Global heap and libzsim.so code in same memory locations across all processes  Can use normal pointers & virtual functions

slide-101
SLIDE 101

Global Memory Allocation Idioms

39

 Globally-allocated objects: Inherit from GlobAlloc

class SimObject : GlobAlloc { …

slide-102
SLIDE 102

Global Memory Allocation Idioms

39

 Globally-allocated objects: Inherit from GlobAlloc

class SimObject : GlobAlloc { …

 STL classes that allocate heap memory: Use g_stl variants

g_vector<uint64_t> cacheLines;

slide-103
SLIDE 103

Global Memory Allocation Idioms

39

 Globally-allocated objects: Inherit from GlobAlloc

class SimObject : GlobAlloc { …

 STL classes that allocate heap memory: Use g_stl variants

g_vector<uint64_t> cacheLines;

 C-style memory allocation (discouraged):

gm_malloc, gm_calloc, gm_free, …

slide-104
SLIDE 104

Global Memory Allocation Idioms

39

 Globally-allocated objects: Inherit from GlobAlloc

class SimObject : GlobAlloc { …

 STL classes that allocate heap memory: Use g_stl variants

g_vector<uint64_t> cacheLines;

 C-style memory allocation (discouraged):

gm_malloc, gm_calloc, gm_free, …

 Declare globally-scoped variables under struct zinfo

slide-105
SLIDE 105

Initialization Sequence

40

Harness

1

slide-106
SLIDE 106

Initialization Sequence

40

Harness

1

Config

2

slide-107
SLIDE 107

Initialization Sequence

40

Harness

1

Config

2

Global Memory

3

slide-108
SLIDE 108

Initialization Sequence

40

Harness

1

Config

2

Global Memory

3

Driver

4

slide-109
SLIDE 109

Initialization Sequence

40

Harness

1

Config

2

Global Memory

3

Driver

4

User- level virtualiz ation

5

slide-110
SLIDE 110

Initialization Sequence

40

Harness

1

Config

2

Global Memory

3

Driver

4

System Initialization

6

User- level virtualiz ation

5

slide-111
SLIDE 111

Initialization Sequence

40

Harness

1

Config

2

Global Memory

3

Driver

4

System Initialization

6

User- level virtualiz ation

5

Stats

7

slide-112
SLIDE 112

Initialization Sequence

40

Harness

1

Config

2

Global Memory

3

Driver

4

System Initialization

6

User- level virtualiz ation

5

Stats

7

Memory system timing models

8

slide-113
SLIDE 113

Initialization Sequence

40

Harness

1

Config

2

Global Memory

3

Driver

4

System Initialization

6

User- level virtualiz ation

5

Stats

7

Memory system timing models

8

Core timing models

9

slide-114
SLIDE 114

Thanks For Your Attention! Questions?

slide-115
SLIDE 115

Backup Slides

slide-116
SLIDE 116

Single-Thread Accuracy: Traces

116

slide-117
SLIDE 117

Single-Thread Accuracy: Traces

117

slide-118
SLIDE 118

Motivation

118

 Timeline:  2008: Decide to study 1K-core systems for my Ph.D. thesis  2009: Try every sim out there, none fast enough  Got M5+GEMS to 512 threads [ASPLOS 2010], barely usable  2010: Start developing ZSim [ZCache, MICRO 2010]  2011: Make ZSim flexible, scalable, develop detailed models, other

groups start using it

 2012: Let’s publish a paper and release it…  ZSim design approach:  Make judicious tradeoffs to achieve detailed 1K core sims efficiently  Verify that those tradeoffs result in minor inaccuracies  Disclaimer: Not a silver bullet & tradeoffs may not be accurate for your

target system; you should validate the tradeoffs!

slide-119
SLIDE 119

Instruction-Driven Timing Models

119

 Cycle/event-driven models: Simulate all stages cycle by cycle  Instruction-driven models: Simulate all stages at once for each ins/uop  Each stage has separate clocks  Ordered queues (FetchQ, UopQ, LoadQ, StoreQ, ROB) model feedback

loops between stages

 Issue window tracks cycles each FU is used to determine dispatch cycle  Even with OOO, accurate if:

1.

IW prioritizes older uops (OK)

2.

uop exec times not affected by newer uops (OK except mem uops, ignore for now)

Fetch Decode Issue OOO Exec Commit Instr code drives directly DBT can accelerate better  Harder to develop

slide-120
SLIDE 120

DBT-based Acceleration

120

 With instruction-driven models, can push most overheads into

instrumentation phase

mov

  • 0x38(%rbp),%rcx

lea -0x2040(%rbp),%rdx add %rax,%rbx mov %rdx,-0x2068(%rbp) cmp $0x1fff,%rax jne 40530a Load(addr = -0x38(%rbp)) mov

  • 0x38(%rbp),%rcx

lea -0x2040(%rbp),%rdx add %rax,%rdx mov %rdx,-0x2068(%rbp) Store(addr = -0x2068(%rbp)) cmp $0x1fff,%rax BasicBlock(DecodedBBL) jne 10840530a

Basic block descriptor

Type Src1 Src2 Dst1 Dst2 Lat PortMsk Load rbp rcx 001000 Exec rbp rdx 3 110001 Exec rax rdx rdx rflgs 1 110001 StAddr rbp S0 1 000100 StData rdx S0 000010 Exec rax rip rip rflgs 1 000001

Instrumented code

Original code (1 basic block) … Predecoder/decoder delays Instruction to uop fission Instruction fusion Uop dependencies, latency, ports

slide-121
SLIDE 121

Parallelization Techniques

121

 Parallel Discrete Event Simulation (PDES):

Core 1 Core 0 Mem 0 L3 Bank 0 L3 Bank 1

Thread 0 Thread 1

Divide components across threads Execute events from each component maintaining illusion of full order Pessimistic PDES: Keep skew between threads below inter-component latency

5 10 15 15 10 5

Optimistic PDES: Speculate & roll back

  • n ordering violations

 Simple  Excessive sync  Less sync  Heavyweight

 Lax synchronization: Allow skews above inter-component latencies,

tolerate ordering violations

 Scalable  Inaccurate  Accurate  Scales poorly

slide-122
SLIDE 122

Bound-Weave Parallelization

122

 Divide simulation in small intervals (e.g., 1000 cycles)  Two parallel phases per interval: Bound and weave  Bound phase:

 Simulate each core independently using instruction-driven models  Record paths of all accesses through the memory hierarchy  Uncore models assume no interference, use minimum response time

for all accesses  puts lower bound on all events

 e.g., for a main memory access: uncontended caches, buses, row hit  Weave phase:

 Perform parallel event-driven simulation of recorded events  Leverage prior knowledge of events to scale

Bound-Weave equivalent to PDES for path-preserving interference Find paths Find timings

slide-123
SLIDE 123

Bound-Weave Example

123

 Weave phase: Events spread across two threads  Crossing events ( ) to only synchronize when needed

 e.g., thread 1 reaches cycle 110, “L3b0 @ 80” not done 

checks thread 0’s progress, requeues itself later

 Other synchronization-avoiding mechanisms in paper

L3b1 @ 50 HIT Core0 @ 30 Core0 @ 60 L3b0 @ 80 MISS Mem1 @ 110 READ Mem0 @ 130 WBACK Core0 @ 90 Core0 @ 250 L3b0 @ 230 RESP L3b0 @ 250 FREE MSHR L3b3 @ 270 HIT Core0 @ 290

Thread 0 Thread 1

Domain 0

slide-124
SLIDE 124

 Events are lower-bounded  No ordering violations

 e.g., 110 lines of code to integrate with DRAMSim2

Bound-Weave Example

124

 Delays propagate across crossings:

Works with standard event-driven models!

L3b1 @ 50 HIT Core0 @ 30 Core0 @ 60 L3b0 @ 80 MISS Mem1 @ 110 READ Mem0 @ 130 WBACK Core0 @ 90 Core0 @ 250 L3b0 @ 230 RESP L3b0 @ 250 FREE MSHR L3b3 @ 270 HIT Core0 @ 290

Thread 0 Thread 1

Domain 0

Row miss  +50 cycles

280 290 300 320 350