2 Amateur Photography - - PowerPoint PPT Presentation

2 amateur photography
SMART_READER_LITE
LIVE PREVIEW

2 Amateur Photography - - PowerPoint PPT Presentation

Syn thetic Full System Traffic Models Capturing Cache Coherence Behaviour Mario Badr | Natalie Enright Jerger 2 Amateur Photography http://sevennine.net/archives/2009/10/01/torontos-skyline-at-night/ 3 Photography 101


slide-1
SLIDE 1

Synthetic Full System

Traffic Models Capturing Cache Coherence Behaviour

Mario Badr | Natalie Enright Jerger

slide-2
SLIDE 2

2

slide-3
SLIDE 3

Amateur Photography

3

http://sevennine.net/archives/2009/10/01/torontos-skyline-at-night/

slide-4
SLIDE 4

Photography 101

4

http://sevennine.net/archives/2009/10/01/torontos-skyline-at-night/

slide-5
SLIDE 5

Hundreds of Pictures

5

slide-6
SLIDE 6

SynFull

6

http://sevennine.net/archives/2009/10/01/torontos-skyline-at-night/

slide-7
SLIDE 7

Pictures for Facebook

7

slide-8
SLIDE 8

What Does SynFull Do?

  • Model real application traffic to the NoC
  • Generate realistic traffic synthetically for the NoC
  • Iterate over several NoC designs quickly

8

Tool Available for Download

slide-9
SLIDE 9

SynFull’s Goals

  • Generic – Current and future applications

– 16 different benchmarks

  • Accurate – Comparable performance metrics

– 10.5% error

  • Fast – Faster than full system and traces

– 52x speed up

9

slide-10
SLIDE 10

NoC Simulation Methodologies

10

  • Full System
  • Traces
  • Traffic Patterns
slide-11
SLIDE 11

Full System Simulation

11

NoC Simulator Full System Simulator

NoC

Processor Cache Disk Other Components Application

Packets Sent Packets Arrived

Feedback! Accurate But Slow

slide-12
SLIDE 12

Trace Simulation

12

NoC Simulator Trace Simulator Trace

NoC B

Packets Sent NoC A

Processor Cache Disk Other

Application

Faster But Less Accurate

slide-13
SLIDE 13

Traffic Patterns

13

NoC Simulator

NoC

Synthetic Traffic Driver

Application

Traffic Pattern Uniform Random Bit Complement Bit Reverse Bit Rotation Shuffle Transpose Tornado Neighbour

Very Fast But Inaccurate

slide-14
SLIDE 14

The Opportunity

14

Speed Accuracy

slide-15
SLIDE 15

The Opportunity

15

Speed Accuracy

SynFull

slide-16
SLIDE 16

Achieving the Goals

  • Synthetic Cache Coherence

– Dependent Messages – Enable Research

  • Time-Varying Behaviour

– Short and Long Bursts of Traffic

  • Convergence

– Simulation length?

16

Accuracy Speed

slide-17
SLIDE 17

Motivating Cache Coherence

Shuffle Fast Fourier Transform

17

Cache Coherence Affects Traffic Behaviour

slide-18
SLIDE 18

Capturing Coherence Traffic

  • Example

– MOESI Protocol – Can be adapted

18

1 2 3 4 5 6 7 8 9

slide-19
SLIDE 19

Capturing Coherence Traffic

  • Initiate Transaction
  • Store Miss

– Source

19

1 2 3 4 5 6 7 8 9

Store Miss 7

slide-20
SLIDE 20

Capturing Coherence Traffic

  • Store Miss

– Source – Destination

20

1 2 3 4 5 6 8 9

Directory 3

7

slide-21
SLIDE 21

Capturing Coherence Traffic

  • Store Miss
  • Forwarded Request

– Destination

21

1 2 3 4 5 6 8 9

Owner 1

7

slide-22
SLIDE 22

Capturing Coherence Traffic

  • Store Miss
  • Forwarded Request
  • Invalidations

– Quantity – Destinations

22

1 2 3 4 5 6 8 9

Invalidate 2, 6

7

slide-23
SLIDE 23

Capturing Coherence Traffic

  • Store Miss
  • Forwarded Request
  • Invalidations
  • Acknowledgements

23

1 2 3 4 5 6 8 9

ACKs 2, 6

7

slide-24
SLIDE 24

Capturing Coherence Traffic

  • Store Miss
  • Forwarded Request
  • Invalidations
  • Acknowledgements
  • Data Response

24

1 2 3 4 5 6 8 9

Data to 7

7

slide-25
SLIDE 25

Capturing Coherence Traffic

  • Store Miss
  • Forwarded Request
  • Invalidations
  • Acknowledgements
  • Data Response
  • Unblock

25

1 2 3 4 5 6 8 9

Transaction Complete

7

slide-26
SLIDE 26

Time Varying Behaviour

26

Barrier Barrier Initiating Transactions and Sharing Patterns Can Change

slide-27
SLIDE 27

Time-Varying Behaviour

27

Applications go through phases

Time Bin (500,000 cycles per bin) Packets Injected FluidanimateBenchmark High H H H H H Low L L L L

slide-28
SLIDE 28

Modelling Time-Varying Behaviour

  • Create and group phases

– Clustering

  • Transition from one phase to another
  • Markov Chains

28

slide-29
SLIDE 29

Dividing Into Intervals

29

Intervals are a fixed size

slide-30
SLIDE 30

Dividing Into Intervals

30

Visually we see: High, Low + High, and Low Intervals

slide-31
SLIDE 31

Phase Transitions: Markov Chains

31

17% 83% 100% 45% 55%

P[Next State | Current State]

slide-32
SLIDE 32

32

Coarse Granularity → Average Behaviour

Traffic Comparison

Actual Synthetic

slide-33
SLIDE 33
  • Macro Level

– 100,000s of Cycles – Long phases – Outer-Loops

  • Micro Level

– 100s of Cycles – Short Bursts – Inner-Loops

  • Hierarchical Model

33

Capturing Short Bursts

slide-34
SLIDE 34

Modelling Parameters

  • Model accuracy affected by:

– Interval Size – Interval Similarity – Number of Clusters

34

See Paper for Parameter Sweep & Recommendations

slide-35
SLIDE 35

Creating The Models

35

Ideal Network

Processor Cache Disk Other

Application

Ideal Trace

slide-36
SLIDE 36

Creating The Models

36

Ideal Network

Processor Cache Disk Other

Application

Ideal Trace SynFull Modelling Parameters Model

slide-37
SLIDE 37

Creating The Models

37

Ideal Network

Processor Cache Disk Other

Application

Ideal Trace SynFull Modelling Parameters Model

NoC

Traffic Generator

NoC NoCs

slide-38
SLIDE 38

Evaluation Methodology

Network meshDOR meshADAP fbfly Topology Mesh Mesh Flattened Butterfly Channel Width 8 bytes 4 bytes 4 bytes Virtual Channels 2 per port 2 per port 4 per port Routing XY Adaptive YX-XY UGAL

  • 16 Out-of-Order Cores
  • MOESI Protocol
  • 16 Benchmarks (Splash-2, PARSEC)
  • Traces with Dependencies Comparison

38

slide-39
SLIDE 39

Packet Latency Error

39

0% 25% 50% 75% 100%

meshDOR meshADAP fbfly

GeomeanError Percentage

Trace Dependency SynFull

Lower is Better

No Throttling For Initiating Transactions

slide-40
SLIDE 40

Distribution Error

40

0.00 0.04 0.08 0.12 0.16 0.20

meshDOR meshADAP fbfly

Geomeanof Helinger Distances

Trace Dependency SynFull

Lower is Better

Captures Congestion

slide-41
SLIDE 41

What About Speed?

41

Markov Probability Matrix,

1 2

slide-42
SLIDE 42

What About Speed?

42

=

Markov Probability Matrix, after a while… converges

1 2

56% 44%

slide-43
SLIDE 43

Speed Up

43

24 27 52

10 20 30 40 50 60

Trace Dependency SynFull SynFull (SS) Speed Up

52x Speed Up With 11.7% Error

slide-44
SLIDE 44

Conclusion

  • Implemented Synthetic Traffic Models that are

– Accurate: 10.5% error – Fast: Over 50x average speed up – Generic: SynFull works for many applications

44

slide-45
SLIDE 45

QUESTIONS & ANSWERS

Thank you for listening! 45

http://www.eecg.toronto.edu/~enright/items/synfull_download.html Try SynFull Out:

slide-46
SLIDE 46

Back Up: Design Space Exploration

46

10 20 30 40 50 60 2 4 8 16 Average Packet Latency Buffer Size (Number of Flits) Full System SynFull

Same Conclusion, Less Time

slide-47
SLIDE 47

Back Up: meshDOR Packet Latency

47

20 40 60 80 100 120 140

  • Avg. Packet Latency

Full System Trace Dependency SynFull

slide-48
SLIDE 48

Back Up: fbfly Packet Latency

48

20 40 60 80 100 120 140

  • Avg. Packet Latency

Full System Trace Dependency SynFull

slide-49
SLIDE 49

Back Up: meshADAP Packet Latency

49

20 40 60 80 100 120 140

  • Avg. Packet Latency

Full System Trace Dependency SynFull

slide-50
SLIDE 50

Back Up: Average Throughput

50

12% 16% 12% 0% 4% 8% 12% 16% 20% meshDOR meshADAP fbfly Geomean Error Percentage

slide-51
SLIDE 51

Back Up: Speed Up Per Application

51

20 40 60 80 100 120 140 160

Average Speed Up Trace Dependency SynFull SynFull (SS)

Averaged Over 3 Runs (Different NoCs)

slide-52
SLIDE 52

Back Up: Steady State

42% 21% 37% 1.68% 0.84% 1.48%

Steady State Acceptable +/-

0.000191

MSE

Before Simulation During Simulation

42% 21% 37% ?% ?% ?%

Steady State Current +/-

< MSE Exit

Current +/- Depends on State Transitions (RNG) and MSE

52

slide-53
SLIDE 53

Back Up: Shuffle NoC

53

1 2 8 4 9 12 3 6 14 7 13 11 15 5 10

slide-54
SLIDE 54

Back Up: Shuffle NoC

54

1 2 8 4 9 12 3 6 14 7 13 11 15 5 10

Ring Topology; Max. 2 Hops Needed

slide-55
SLIDE 55

Triangle Score

55

Cache Coherence Time Varying Fast

slide-56
SLIDE 56

NoC Simulation Methodologies

56

Cache Coherence Time Varying Fast

Full System Trace Traffic Pattern

slide-57
SLIDE 57

SynFull

57

Cache Coherence Time Varying Fast