Reservation-based NoC timing models for large-scale architectural - - PowerPoint PPT Presentation

reservation based noc timing models for large scale
SMART_READER_LITE
LIVE PREVIEW

Reservation-based NoC timing models for large-scale architectural - - PowerPoint PPT Presentation

Reservation-based NoC timing models for large-scale architectural simulation Javier Navaridas, Behram Khan, Salman Khan, Paolo Faraboschi, Mikel Lujn Introduction Existing electronic miniaturization technologies allow to integrate several


slide-1
SLIDE 1

Reservation-based NoC timing models for large-scale architectural simulation

Javier Navaridas, Behram Khan, Salman Khan, Paolo Faraboschi, Mikel Luján

slide-2
SLIDE 2

Introduction

Existing electronic miniaturization technologies allow to integrate several processing cores into a single chip General purpose processors provide up to 16 cores Many-core processors such as Tilera provide up to 64 cores Designing 1000-core processors is a current hot topic

Rigel [Kelm et al], ATAC [Kurian et al], TERAFLUX [Portero et al]

  • A. Portero et al. “TERAFLUX: Exploiting tera-device computing challenges”

Kurian et al. “ATAC: a 1000-core cache- coherent processor with on chip optical network” Kelm et al. “Rigel: an architecture and scalable programming interface for a 1000-core accelerator”

slide-3
SLIDE 3

Evaluating large-scale systems

Traditionally the micro-architecture community has disregarded on-chip communications when evaluating processor designs With the advent of such large-scale processors, NoC behaviour needs to be taken into consideration Evaluate such large-scale systems requires a considerable amount of compute power NoC simulation has to be included in a lightweight manner usually in the form of a timing model

slide-4
SLIDE 4

Modelling the NoC for Evaluation

Full-system simulation

Full computational model of the NoC Very high accuracy Expensive in terms of compute power

Network agnostic timing models

Network functionality is not considered Very low accuracy NoC modelling barely affects simulation speed

slide-5
SLIDE 5

Modelling the NoC for Evaluation

Statistical timing models [Papamichael et al]

Estimate packet latency from an external analysis of the traffic

Traffic analysis may be done concurrently or off-line

Improves accuracy without exacerbating compute requirements when compared with network-agnostic models Several limitations

Latency distributions are case-specific Latency figures are difficult to estimate for variable traffic patterns Require tracking network load

Papamichael et al. “FIST: A fast, lightweight, FPGA-friendly packet latency estimator for noc modeling in full-system simulations”

slide-6
SLIDE 6

Modelling the NoC for Evaluation

Reservation-based timing models

NoC is modelled in a simple way

A collection of resources that need to be reserved to be used If a resource is reserved it can not be used until it is freed

Good accuracy Allow fast simulation Avoids the limitations of the statistical models

Latency depends on actual state of the network Do not require tracking network load External traffic analysis not needed

slide-7
SLIDE 7

Our Implementation

Base data-structure

Resources are modelled as a sorted linked list which represents the periods in which it is reserved A ‘Reserve’ function to operate over the data-structure

Searches for a free period of time that can accommodate a given reservation, reserves the resource and returns the ending timestamp Eliminates outdated reservations and merges existing reservations to keep data structure manageable

slide-8
SLIDE 8

Operation of the Data Structure

slide-9
SLIDE 9

Operation of the Data Structure

slide-10
SLIDE 10

Operation of the Data Structure

slide-11
SLIDE 11

System under Consideration

Mesh topology XY routing Cut-through switching 1 virtual channel

Core NoC Core NoC Core NoC Core NoC Core NoC Core NoC Core NoC Core NoC Core NoC Core NoC Core NoC Core NoC Core NoC Core NoC Core NoC Core NoC

slide-12
SLIDE 12

Reservation Models

NoC modelled at the hop level

Each communication link is modelled as a resource Each packet reserves all the required links

slide-13
SLIDE 13

Reservation Models

NoC modelled at the direction level

Each row and column of the topology are modelled as a resource per direction (positive/negative) Each packet reserves the required row and column resources

slide-14
SLIDE 14

Reservation Models

Topology-agnostic model

Network is modelled as a collection of ‘communication channels’ Each packet reserves one of these channels randomly A distributed implementation is also considered

slide-15
SLIDE 15

Other Models

Network agnostic models

Fixed model

All network accesses requires the same amount of time

No contention model

Latency depends only on distance and packet size

Statistical timing models

Load-dependent estimation

Tracks the load and models latency in a simple way

  • With low loads latency is barely affected
  • With high loads latency is very high

Estimation from off-line simulation

Estimate latency from packet distance and average latency

slide-16
SLIDE 16

Evaluation

Models implemented as stand-alone tools Trace-driven evaluation

PARSEC: Directory-based cache coherency – 32 cores STAMP: Transactional memory – 32 cores NAS: Message passing – 64 cores Cache coherency-like synthetic traffic – 1024 cores

Figures of merit

Accuracy

Simulated time to execute the benchmarks Similarity score metric

Speed

Execution time of the models

slide-17
SLIDE 17

Normalized Running Speed

5 10 15 20 blackscholes bodytrack ferret fluidanimate swaptions simulation fixed no contention load estimation exponential direction con path con pipes pipes dist

Similarity Score

500 1000 1500 2000 2500 blackscholes bodytrack ferret fluidanimate swaptions fixed no contention load estimation exponential direction con path con pipes pipes dist

PARSEC – 32 cores

Structured communication patterns Small messages Some degree of contention No long-lasting congestion

slide-18
SLIDE 18

PARSEC – 32 cores

Normalized Running Speed

5 10 15 20 blackscholes bodytrack ferret fluidanimate swaptions simulation fixed no contention load estimation exponential direction con path con pipes pipes dist

Similarity Score

500 1000 1500 2000 2500 blackscholes bodytrack ferret fluidanimate swaptions fixed no contention load estimation exponential direction con path con pipes pipes dist

slide-19
SLIDE 19

PARSEC – 32 cores

Similarity Score

500 1000 1500 2000 2500 blackscholes bodytrack ferret fluidanimate swaptions fixed no contention load estimation exponential direction con path con pipes pipes dist

Normalized Running Speed

5 10 15 20 blackscholes bodytrack ferret fluidanimate swaptions simulation fixed no contention load estimation exponential direction con path con pipes pipes dist

slide-20
SLIDE 20

Normalized Running Speed

10 20 30 genome intruder kmeans vacation

simulation fixed no contention load estimation exponential direction con path con pipes pipes dist

Similarity Score

2000 4000 6000 8000 genome intruder kmeans vacation

fixed no contention load estimation exponential direction con path con pipes pipes dist

STAMP – 32 cores

Unstructured communication patterns

Possibility of communication hot spots

Small messages Some degree of contention No long-lasting congestion

slide-21
SLIDE 21

STAMP – 32 cores

Normalized Running Speed

10 20 30 genome intruder kmeans vacation

simulation fixed no contention load estimation exponential direction con path con pipes pipes dist

Similarity Score

2000 4000 6000 8000 genome intruder kmeans vacation

fixed no contention load estimation exponential direction con path con pipes pipes dist

slide-22
SLIDE 22

STAMP – 32 cores

Similarity Score

2000 4000 6000 8000 genome intruder kmeans vacation

fixed no contention load estimation exponential direction con path con pipes pipes dist

Normalized Running Speed

10 20 30 genome intruder kmeans vacation

simulation fixed no contention load estimation exponential direction con path con pipes pipes dist

slide-23
SLIDE 23

Normalized Running Speed

100 200 300 400 rnd1 rnd2 rnd3 rnd4 rnd5

simulation fixed no contention load estimation exponential direction con path con pipes pipes dist

Similarity Score

30000 60000 90000 120000 rnd1 rnd2 rnd3 rnd4 rnd5

fixed no contention load estimation exponential direction con path con pipes pipes dist

Synthetic – 1024 cores

Unstructured communication patterns (random) Small messages Some degree of contention No long-lasting congestion

slide-24
SLIDE 24

Normalized Running Speed

100 200 300 400 rnd1 rnd2 rnd3 rnd4 rnd5

simulation fixed no contention load estimation exponential direction con path con pipes pipes dist

Synthetic – 1024 cores

Similarity Score

30000 60000 90000 120000 rnd1 rnd2 rnd3 rnd4 rnd5

fixed no contention load estimation exponential direction con path con pipes pipes dist

slide-25
SLIDE 25

Synthetic – 1024 cores

Similarity Score

30000 60000 90000 120000 rnd1 rnd2 rnd3 rnd4 rnd5

fixed no contention load estimation exponential direction con path con pipes pipes dist

Normalized Running Speed

100 200 300 400 rnd1 rnd2 rnd3 rnd4 rnd5

simulation fixed no contention load estimation exponential direction con path con pipes pipes dist

slide-26
SLIDE 26

Normalized Running Speed

100 200 300 400 bt cg is lu mg sp

simulation fixed no contention load estimation exponential direction con path con pipes pipes dist

Simulated Time

1 2 3 4 5 6 7 8 bt cg is lu mg sp Times Slower simulation fixed no contention load estimation exponential direction con path con pipes pipes dist 8 7 6 5 4 3 2 1 Times Faster

NAS – 64 cores

Structured communication patterns Long messages States of high congestion

slide-27
SLIDE 27

Normalized Running Speed

100 200 300 400 bt cg is lu mg sp

simulation fixed no contention load estimation exponential direction con path con pipes pipes dist

NAS – 64 cores

Simulated Time

1 2 3 4 5 6 7 8 bt cg is lu mg sp Times Slower simulation fixed no contention load estimation exponential direction con path con pipes pipes dist 8 7 6 5 4 3 2 1 Times Faster

slide-28
SLIDE 28

Simulated Time

1 2 3 4 5 6 7 8 bt cg is lu mg sp Times Slower simulation fixed no contention load estimation exponential direction con path con pipes pipes dist 8 7 6 5 4 3 2 1 Times Faster

NAS – 64 cores

Normalized Running Speed

100 200 300 400 bt cg is lu mg sp

simulation fixed no contention load estimation exponential direction con path con pipes pipes dist

slide-29
SLIDE 29

Conclusions

Novel reservation-based timing models for the NoC

Provide reasonable accuracy at a fraction of the speed of a dedicated NoC simulator

Topology-aware models

Considering every link in the topology as a resource provides good accuracy but slows large-scale simulation Modelling a whole direction as a single resource is too restrictive An intermediate approach could be a good solution

Topology agnostic models

Seem to be reasonable models

Can be used to discriminate communication-intensive implementations

slide-30
SLIDE 30

Future Work

Implement these models in COTSON

Re-evaluate them in this context

Develop new models for different network configurations based on the reservation data structure

Topologies: rings, tori, butterfly, flattened butterfly Packet movement: wormhole, adaptive routing

slide-31
SLIDE 31
slide-32
SLIDE 32

Other traces results

Simulated Time

1.0 1.2 1.4 1.6 1.8 2.0 black scholes body track ferret fluid animate swaptions genome intruder kmeans vacation rnd1 rnd2 rnd3 rnd4 rnd5 PARSEC STAMP synthetic Times Slower 2.0 1.8 1.6 1.4 1.2 1.0 Times Faster