SNACKNOC: PROCESSING IN THE COMMUNICATION LAYER Karthik Sangaiah , - - PowerPoint PPT Presentation

snacknoc processing in the communication layer
SMART_READER_LITE
LIVE PREVIEW

SNACKNOC: PROCESSING IN THE COMMUNICATION LAYER Karthik Sangaiah , - - PowerPoint PPT Presentation

SNACKNOC: PROCESSING IN THE COMMUNICATION LAYER Karthik Sangaiah , Michael Lui, Ragh Kuttappa, Baris Taskin, and Mark Hempstead Feb 25 th 2020 VLSI and Architecture Lab Opportunistic Resources for Graduate Students 2 Free leftovers Steak


slide-1
SLIDE 1

SNACKNOC: PROCESSING IN THE COMMUNICATION LAYER

VLSI and Architecture Lab

Karthik Sangaiah, Michael Lui, Ragh Kuttappa, Baris Taskin, and Mark Hempstead

Feb 25th 2020

slide-2
SLIDE 2

Opportunistic Resources for Graduate Students

2

Free leftovers Steak dinner toward

Opportunistically collecting snacks towards a meal.

slide-3
SLIDE 3

Communication Interconnect

Opportunistic Resources in the CMP

3

Intel Skylake 8180 HCC [1] Interconnect

“Free leftovers”

NoC Router Opportunistically collecting “snacks” to make a “meal”.

[1] Intel Skylake SP HCC, Wikichip.

slide-4
SLIDE 4

Communication Interconnect

Opportunistic Resources in the CMP

4

Intel Skylake 8180 HCC [1] Interconnect

“Free leftovers”

NoC Router Opportunistically collecting “snacks” to make a “meal”.

What is the performance gain we add by

  • pportunistically “snacking” on CMP resources?

[1] Intel Skylake SP HCC, Wikichip.

slide-5
SLIDE 5

Quantifying Design Slack in the NoC

 NoC designed to minimize latency

during heavy traffic

 NoC implementation can account for

60% to 75% of the miss latency[2]

5

[2] Sanchez et al., ACM TACO, 2010.

slide-6
SLIDE 6

Quantifying Design Slack in the NoC

 NoC designed to minimize latency

during heavy traffic

 NoC implementation can account for

60% to 75% of the miss latency[2]

 Study of NoC resource utilization on

recent NoCs designs

 3 selected best paper nominated

NoCs have similar performance:

 DAPPER[3], AxNoC[4], BiNoCHS[5]  Reducing resources, substantially

reduced performances

 Further details of study is in our paper

6

[2] Sanchez et al., ACM TACO, 2010. [3] Raparti et al., IEEE/ACM NOCS, 2018. [4] Ahmed et al., IEEE/ACM NOCS, 2018. [5] Mirhosseini et al, IEEE/ACM NOCS, 2017.

slide-7
SLIDE 7

Quantifying Design Slack in the NoC

 NoC designed to minimize latency

during heavy traffic

 NoC implementation can account for

60% to 75% of the miss latency[2]

 Study of NoC resource utilization on

recent NoCs designs

 3 selected best paper nominated

NoCs have similar performance:

 DAPPER[3], AxNoC[4], BiNoCHS[5]  Reducing resources, substantially

reduced performances

 Further details of study is in our paper  Opportunities in Network-on-Chip

Slack

7

NoC Router

[2] Sanchez et al., ACM TACO, 2010. [3] Raparti et al., IEEE/ACM NOCS, 2018. [4] Ahmed et al., IEEE/ACM NOCS, 2018. [5] Mirhosseini et al, IEEE/ACM NOCS, 2017.

slide-8
SLIDE 8

Quantifying Design Slack in the NoC

 NoC designed to minimize latency

during heavy traffic

 NoC implementation can account for

60% to 75% of the miss latency[2]

 Study of NoC resource utilization on

recent NoCs designs

 3 selected best paper nominated

NoCs have similar performance:

 DAPPER[3], AxNoC[4], BiNoCHS[5]  Reducing resources, substantially

reduced performances

 Further details of study is in our paper  Opportunities in Network-on-Chip

Slack

 Crossbar

8

NoC Router

[2] Sanchez et al., ACM TACO, 2010. [3] Raparti et al., IEEE/ACM NOCS, 2018. [4] Ahmed et al., IEEE/ACM NOCS, 2018. [5] Mirhosseini et al, IEEE/ACM NOCS, 2017.

slide-9
SLIDE 9

Quantifying Design Slack in the NoC

 NoC designed to minimize latency

during heavy traffic

 NoC implementation can account for

60% to 75% of the miss latency[2]

 Study of NoC resource utilization on

recent NoCs designs

 3 selected best paper nominated

NoCs have similar performance:

 DAPPER[3], AxNoC[4], BiNoCHS[5]  Reducing resources, substantially

reduced performances

 Further details of study is in our paper  Opportunities in Network-on-Chip

Slack

 Crossbar  Network Links

9

NoC Router

[2] Sanchez et al., ACM TACO, 2010. [3] Raparti et al., IEEE/ACM NOCS, 2018. [4] Ahmed et al., IEEE/ACM NOCS, 2018. [5] Mirhosseini et al, IEEE/ACM NOCS, 2017.

slide-10
SLIDE 10

Quantifying Design Slack in the NoC

 NoC designed to minimize latency

during heavy traffic

 NoC implementation can account for

60% to 75% of the miss latency[2]

 Study of NoC resource utilization on

recent NoCs designs

 3 selected best paper nominated

NoCs have similar performance:

 DAPPER[3], AxNoC[4], BiNoCHS[5]  Reducing resources, substantially

reduced performances

 Further details of study is in our paper  Opportunities in Network-on-Chip

Slack

 Crossbar  Network Links  Internal Buffers

10

NoC Router

[2] Sanchez et al., ACM TACO, 2010. [3] Raparti et al., IEEE/ACM NOCS, 2018. [4] Ahmed et al., IEEE/ACM NOCS, 2018. [5] Mirhosseini et al, IEEE/ACM NOCS, 2017.

slide-11
SLIDE 11

Quantifying Design Slack in the NoC

Simulated 16 core CMP with 4 benchmarks representing “low”, “medium”, “medium-high”, “high” traffic

Crossbar Utilization:

Peak utilization (Graph 500): 42% utilization

Highest median (Graph 500): 13.3% utilization

11

Crossbar Utilization

slide-12
SLIDE 12

Quantifying Design Slack in the NoC

Simulated 16 core CMP with 4 benchmarks representing “low”, “medium”, “medium-high”, “high” traffic

Crossbar Utilization:

Peak utilization (Graph 500): 42% utilization

Highest median (Graph 500): 13.3% utilization

12

Median utilization, Router 5: 8.6%

10 20 30 40 50 25 30 35 40 Router Crossbar Usage (%) Time (108 Cycles)

Router 5 Crossbar Utilization

slide-13
SLIDE 13

Quantifying Design Slack in the NoC

Simulated 16 core CMP with 4 benchmarks representing “low”, “medium”, “medium-high”, “high” traffic

Crossbar Utilization:

Peak utilization (Graph 500): 42% utilization

Highest median (Graph 500): 13.3% utilization

13

Crossbar Utilization

slide-14
SLIDE 14

Quantifying Design Slack in the NoC

Simulated 16 core CMP with 4 benchmarks representing “low”, “medium”, “medium-high”, “high” traffic

Crossbar Utilization:

Peak utilization (Graph 500): 42% utilization

Highest median (Graph 500): 13.3% utilization

Link Utilization

Peak utilization link (Graph500): 18% utilization

Highest median link utilization (LULESH): 3.3% utilization

14

Median utilization, Router 5: 8.6%

Crossbar Utilization Link Utilization

slide-15
SLIDE 15

Quantifying Design Slack in the NoC

15

Median utilization, Router 5: 8.6%

Crossbar Utilization Link Utilization

Simulated 16 core CMP with 4 benchmarks representing “low”, “medium”, “medium-high”, “high” traffic

Crossbar Utilization:

Peak utilization (Graph 500): 42% utilization

Highest median (Graph 500): 13.3% utilization

Link Utilization

Peak utilization link (Graph500): 18% utilization

Highest median link utilization (LULESH): 3.3% utilization

Buffer Utilization

Raytrace : 4% of cycles have localized contention

10% utilization during contention

3M flits of the 2.4T flits forwarded: buffer utilization reaches 30-55% of the total capacity

slide-16
SLIDE 16

Quantifying Design Slack in the NoC

Simulated 16 core CMP with 4 benchmarks representing “low”, “medium”, “medium-high”, “high” traffic

Crossbar Utilization:

Peak utilization (Graph 500): 42% utilization

Highest median (Graph 500): 13.3% utilization

Link Utilization

Peak utilization link (Graph500): 18% utilization

Highest median link utilization (LULESH): 3.3% utilization

Buffer Utilization

Raytrace : 4% of cycles have localized contention

10% utilization during contention

3M flits of the 2.4T flits forwarded: buffer utilization reaches 30-55% of the total capacity

16

Median utilization, Router 5: 8.6%

10 20 30 40 50 25 30 35 40 Router Crossbar Usage (%) Time (108 Cycles)

Router 5 Crossbar Utilization Link Utilization

The SnackNoC platform improves efficiency and performance of the CMP by offloading data-parallel workloads and “snacking” on network resources.

slide-17
SLIDE 17

Overview

17

 “Slack” of the Communication Fabric  The SnackNoC Platform  Experimental Results  Conclusion and Future Considerations

slide-18
SLIDE 18

SnackNoC Platform Overview

 Goals:

 Opportunistically “Snack” on existing

network resources for additional performance

 Limited additional overhead to uncore  Minimal or zero interference to CMP traffic

 Opportunistic NoC-based compute

platform

 Limited dataflow engine  Applications:  Data-parallel workloads used in scientific

computing, graph analytics, and machine learning

18

slide-19
SLIDE 19

SnackNoC Platform Overview

 Goals:

 Opportunistically “Snack” on existing

network resources for additional performance

 Limited additional overhead to uncore  Minimal or zero interference to CMP traffic

 Opportunistic NoC-based compute

platform

 Limited dataflow engine  Applications:  Data-parallel workloads used in scientific

computing, graph analytics, and machine learning

19

Celerity RISC-V SoC[6]

[6] S. Davidson et al., IEEE Micro, 2018.

slide-20
SLIDE 20

SnackNoC Platform Overview

 Goals:

 Opportunistically “Snack” on existing

network resources for additional performance

 Limited additional overhead to uncore  Minimal or zero interference to CMP traffic

 Opportunistic NoC-based compute

platform

 Limited dataflow engine  Applications:  Data-parallel workloads used in scientific

computing, graph analytics, and machine learning

20

Google Cloud TPU[7] Celerity RISC-V SoC[6]

[6] S. Davidson et al., IEEE Micro, 2018. [7] Jouppi et. al, IEEE/ACM ISCA, 2017.

slide-21
SLIDE 21

SnackNoC Platform Overview

 Goals:

 Opportunistically “Snack” on existing

network resources for additional performance

 Limited additional overhead to uncore  Minimal or zero interference to CMP traffic

 Opportunistic NoC-based compute

platform

 Limited dataflow engine  Applications:  Data-parallel workloads used in scientific

computing, graph analytics, and machine learning

21

Google Cloud TPU[7] Celerity RISC-V SoC[6] Intel Skylake 8180 HCC[1] Interconnect

[1] Intel Skylake SP HCC, Wikichip. [6] S. Davidson et al., IEEE Micro, 2018. [7] Jouppi et. al, IEEE/ACM ISCA, 2017.

slide-22
SLIDE 22

SnackNoC Platform Overview

 Goals:

 Opportunistically “Snack” on existing

network resources for additional performance

 Limited additional overhead to uncore  Minimal or zero interference to CMP traffic

 Opportunistic NoC-based compute

platform

 Limited dataflow engine  Applications:  Data-parallel workloads used in scientific

computing, graph analytics, and machine learning

22

Google Cloud TPU[7] Celerity RISC-V SoC[6] Intel Skylake 8180 HCC[1] Interconnect Steak dinner

[1] Intel Skylake SP HCC, Wikichip. [6] S. Davidson et al., IEEE Micro, 2018. [7] Jouppi et. al, IEEE/ACM ISCA, 2017.

slide-23
SLIDE 23

SnackNoC Platform Overview

 Goals:

 Opportunistically “Snack” on existing

network resources for additional performance

 Limited additional overhead to uncore  Minimal or zero interference to CMP traffic

 Opportunistic NoC-based compute

platform

 Limited dataflow engine  Applications:  Data-parallel workloads used in scientific

computing, graph analytics, and machine learning

23

Google Cloud TPU[7] Celerity RISC-V SoC[6] Intel Skylake 8180 HCC[1] Interconnect Steak dinner

[1] Intel Skylake SP HCC, Wikichip. [6] S. Davidson et al., IEEE Micro, 2018. [7] Jouppi et. al, IEEE/ACM ISCA, 2017.

slide-24
SLIDE 24

SnackNoC System Overview

Added components to a traditional NoC

24

NoC Routers Memory Controller Memory Controller Memory Controller Memory Controller

slide-25
SLIDE 25

SnackNoC System Overview

Added components to a traditional NoC

 Central Packet Manager

 Assemble and issue instruction packets  Manages execution state of kernels  Located at Memory Controller

25

NoC Routers Memory Controller Memory Controller Memory Controller Memory Controller Central Packet Manager

slide-26
SLIDE 26

SnackNoC System Overview

Added components to a traditional NoC

 Central Packet Manager

 Assemble and issue instruction packets  Manages execution state of kernels  Located at Memory Controller

 Router Compute Units (RCU)

 Light-weight accumulator-based processing element (PE)

Instruction buffering

ALU

 Located in router pipeline

26

NoC Routers Memory Controller Memory Controller Memory Controller Memory Controller Central Packet Manager

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE

slide-27
SLIDE 27

SnackNoC System Overview

Added components to a traditional NoC

 Central Packet Manager

 Assemble and issue instruction packets  Manages execution state of kernels  Located at Memory Controller

 Router Compute Units (RCU)

 Light-weight accumulator-based processing element (PE)

Instruction buffering

ALU

 Located in router pipeline

27

NoC Routers Memory Controller Memory Controller Memory Controller Memory Controller Central Packet Manager

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE

Instruction Flits

slide-28
SLIDE 28

SnackNoC System Overview

Added components to a traditional NoC

 Central Packet Manager

 Assemble and issue instruction packets  Manages execution state of kernels  Located at Memory Controller

 Router Compute Units (RCU)

 Light-weight accumulator-based processing element (PE)

Instruction buffering

ALU

 Located in router pipeline

28

NoC Routers Memory Controller Memory Controller Memory Controller Memory Controller Central Packet Manager

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE

Instruction Flits Result Data Flits

slide-29
SLIDE 29

SnackNoC System Overview

Added components to a traditional NoC

 Central Packet Manager

 Assemble and issue instruction packets  Manages execution state of kernels  Located at Memory Controller

 Router Compute Units (RCU)

 Light-weight accumulator-based processing element (PE)

Instruction buffering

ALU

 Located in router pipeline 

Added features to a traditional NoC:

 CPU traffic priority arbitration  Available NoC buffers as transient data storage

29

NoC Routers Memory Controller Memory Controller Memory Controller Memory Controller Central Packet Manager

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE

Instruction Flits Result Data Flits

slide-30
SLIDE 30

 Router Compute Units (RCUs)  32-bit accumulator-based processing element  Instruction re-ordering and buffering  Modifications to input buffer queues, allocators, and crossbar

NoC Router Modification and RCU Additions

30

NoC Router

Added to baseline Modified

slide-31
SLIDE 31

 Router Compute Units (RCUs)  32-bit accumulator-based processing element  Instruction re-ordering and buffering  Modifications to input buffer queues, allocators, and crossbar

NoC Router Modification and RCU Additions

31

NoC Router

Added to baseline Modified

slide-32
SLIDE 32

 Router Compute Units (RCUs)  32-bit accumulator-based processing element  Instruction re-ordering and buffering  Modifications to input buffer queues, allocators, and crossbar

NoC Router Modification and RCU Additions

32

NoC Router

Added to baseline Modified

slide-33
SLIDE 33

 Router Compute Units (RCUs)  32-bit accumulator-based processing element  Instruction re-ordering and buffering  Modifications to input buffer queues, allocators, and crossbar

NoC Router Modification and RCU Additions

33

NoC Router

Added to baseline Modified

slide-34
SLIDE 34

CPU Traffic Priority Arbitration

34

 Primary functionality of NoC is to transfer CPU core and memory traffic

 “Fair” allocators are typically set to select traffic in round-robin  Allocators are modified to prioritize CPU traffic over SnackNoC instruction or data

traffic

slide-35
SLIDE 35

Transient Data Storage

35

 Input buffers typically have low contention

 Available buffers and bandwidth can be used as transient storage

 Useful to keep intermediate results and read-only values on chip

slide-36
SLIDE 36

Transient Data Storage

36

 Input buffers typically have low contention

 Available buffers and bandwidth can be used as transient storage

 Useful to keep intermediate results and read-only values on chip

slide-37
SLIDE 37

Transient Data Storage

37

 Input buffers typically have low contention

 Available buffers and bandwidth can be used as transient storage

 Useful to keep intermediate results and read-only values on chip

NoC Routers

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE

slide-38
SLIDE 38

Transient Data Storage

38

 Input buffers typically have low contention

 Available buffers and bandwidth can be used as transient storage

 Useful to keep intermediate results and read-only values on chip

NoC Routers

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE

1 RCU executes instruction, intermediate result sent to transient storage

slide-39
SLIDE 39

Transient Data Storage

39

 Input buffers typically have low contention

 Available buffers and bandwidth can be used as transient storage

 Useful to keep intermediate results and read-only values on chip

NoC Routers

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE

1 RCU executes instruction, intermediate result sent to transient storage 2 RCU waiting on intermediate value, received from transient storage

slide-40
SLIDE 40

Transient Data Storage

40

 Input buffers typically have low contention

 Available buffers and bandwidth can be used as transient storage

 Useful to keep intermediate results and read-only values on chip

NoC Routers

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE

1 RCU executes instruction, intermediate result sent to transient storage 2 RCU waiting on intermediate value, received from transient storage 3 Result returned to memory

slide-41
SLIDE 41

Running a SnackNoC Kernel

41

C-code APIs for Matrix-multiply CPU Core 1 … …

slide-42
SLIDE 42

Running a SnackNoC Kernel

42

C-code APIs for Matrix-multiply CPU Core Central Packet Manager 1 2 CPM sets up kernel … …

Main Memory

slide-43
SLIDE 43

Running a SnackNoC Kernel

43

C-code APIs for Matrix-multiply CPU Core Central Packet Manager NoC Routers

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE

1 2 CPM sets up kernel 3 RCUs execute kernel … …

Main Memory

slide-44
SLIDE 44

Running a SnackNoC Kernel

44

C-code APIs for Matrix-multiply CPU Core Central Packet Manager NoC Routers

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE

1 2 CPM sets up kernel 3 RCUs execute kernel 4 Return result to Main memory via CPM … …

Main Memory

slide-45
SLIDE 45

Example of a Reduction Kernel

45

1 Central Packet Manager SnackNoC Flit SnackNoC instructions and data and sent to the RCUs NoC Routers with RCUs

slide-46
SLIDE 46

Example of a Reduction Kernel

46

1 Central Packet Manager SnackNoC Flit SnackNoC instructions and data and sent to the RCUs Data dependent Instructions 2 Data dependent instructions are sent to reduce intermediate results NoC Routers with RCUs

slide-47
SLIDE 47

Example of a Reduction Kernel

47

1 Central Packet Manager SnackNoC Flit SnackNoC instructions and data and sent to the RCUs Data dependent Instructions 2 Data dependent instructions are sent to reduce intermediate results 3 Intermediate results are sent to data dependent instructions NoC Routers with RCUs

slide-48
SLIDE 48

Example of a Reduction Kernel

48

1 Central Packet Manager SnackNoC Flit SnackNoC instructions and data and sent to the RCUs Data dependent Instructions 2 Data dependent instructions are sent to reduce intermediate results 3 Intermediate results are sent to data dependent instructions 4 Results reduced on the way to corner RCU and returned to the CPM NoC Routers with RCUs

slide-49
SLIDE 49

Example of a Reduction Kernel

49

1 Central Packet Manager SnackNoC Flit SnackNoC instructions and data and sent to the RCUs Data dependent Instructions 2 Data dependent instructions are sent to reduce intermediate results 3 Intermediate results are sent to data dependent instructions 4 Results reduced on the way to corner RCU and returned to the CPM Repurposed our NoC router crossbar, network links, and internal buffers to compute this kernel. NoC Routers with RCUs

slide-50
SLIDE 50

Overview

50

 “Slack” of the Communication Fabric  The SnackNoC Platform  Experimental Results  Conclusion and Future Considerations

slide-51
SLIDE 51

Methodology

 Experiments:

1.

Assess the performance of SnackNoC

 How many additional cores

worth of performance can SnackNoC provide

  • pportunistically?

2.

Quantify the performance interference of operating SnackNoC on the CPU cores

51

Implemented four SnackNoC kernels (SGEMM, Reduction, MAC, SPMV) Executed 16 multi-threaded benchmarks from PARSEC3, Splash2X, FastForward2 to assess performance interference

slide-52
SLIDE 52

Methodology – Quantifying SnackNoC Performance

52

slide-53
SLIDE 53

Methodology – Quantifying SnackNoC Performance

53

 SnackNoC is modeled in the gem5 simulation

framework

slide-54
SLIDE 54

Methodology – Quantifying SnackNoC Performance

54

 SnackNoC is modeled in the gem5 simulation

framework

 To quantify performance, four SnackNoC

kernels executed on:

1.

Simulated CMP with the SnackNoC platform

Compiled to SnackNoC instructions

slide-55
SLIDE 55

Methodology – Quantifying SnackNoC Performance

55

 SnackNoC is modeled in the gem5 simulation

framework

 To quantify performance, four SnackNoC

kernels executed on:

1.

Simulated CMP with the SnackNoC platform

Compiled to SnackNoC instructions

SnackNoC Parameters Configuration RCU Count 16 RCUs RCU Freq. 1 GHz Flit Priority Arbitration ON/OFF

slide-56
SLIDE 56

Methodology – Quantifying SnackNoC Performance

56

 SnackNoC is modeled in the gem5 simulation

framework

 To quantify performance, four SnackNoC

kernels executed on:

1.

Simulated CMP with the SnackNoC platform

Compiled to SnackNoC instructions

Simulated CMP Parameters Configuration Core Count 16 in-order cores Core Frequency 2GHz L1 I&D Cache 32KB, 4-way L2 Cache 256KB, 4-way NoC Topology 2D 4x4 Mesh, 4 Memory Controllers NoC Flit Size 32B # Virtual Channels 4 # Buffers 4

SnackNoC Parameters Configuration RCU Count 16 RCUs RCU Freq. 1 GHz Flit Priority Arbitration ON/OFF

slide-57
SLIDE 57

Methodology – Quantifying SnackNoC Performance

57

 SnackNoC is modeled in the gem5 simulation

framework

 To quantify performance, four SnackNoC

kernels executed on:

1.

Simulated CMP with the SnackNoC platform

Compiled to SnackNoC instructions

2.

Native Dell server with Intel Xeon E5-2660

C++ multi-threaded with OpenMP

Native CPU Parameters Configuration Processor Intel Xeon E5-2660 v3 Core Frequency 2.6GHz L1 I&D Cache 32KB, 8-way L2 Cache 256KB, 8-way L3 Cache 20MB, 20-way

Simulated CMP Parameters Configuration Core Count 16 in-order cores Core Frequency 2GHz L1 I&D Cache 32KB, 4-way L2 Cache 256KB, 4-way NoC Topology 2D 4x4 Mesh, 4 Memory Controllers NoC Flit Size 32B # Virtual Channels 4 # Buffers 4

SnackNoC Parameters Configuration RCU Count 16 RCUs RCU Freq. 1 GHz Flit Priority Arbitration ON/OFF

slide-58
SLIDE 58

Quantifying SnackNoC Performance Gain

 SnackNoC kernels are executed

  • n an increasing number of cores

to determine comparable performance of SnackNoC

58

slide-59
SLIDE 59

Quantifying SnackNoC Performance Gain

 SnackNoC kernels are executed

  • n an increasing number of cores

to determine comparable performance of SnackNoC

59

slide-60
SLIDE 60

Quantifying SnackNoC Performance Gain

 SnackNoC kernels are executed

  • n an increasing number of cores

to determine comparable performance of SnackNoC

 CMP performance roughly linear

increase with increasing cores, with exception to SPMV

60

slide-61
SLIDE 61

Quantifying SnackNoC Performance Gain

 SnackNoC kernels are executed

  • n an increasing number of cores

to determine comparable performance of SnackNoC

 CMP performance roughly linear

increase with increasing cores, with exception to SPMV

 Performance gain between 2 and

6 x86 OOO cores

61

slide-62
SLIDE 62

SnackNoC Area and Power Overhead

 SnackNoC components’ RTL

implemented, synthesized with Synopsis Design Compiler:

 45nm NCSU technology node  Operating Freq. 1GHz

Central Packet Manager Additional Power (%) Additional Area (%) Assembly Logic and Buffers 0.08% 2.43% Kernel State 0.16% 0.10% Instruction Buffer 10.71% 25.75% Offload Data Memory Buffer 0.95% 2.28% Output Result FIFO 0.95% 2.28% Total 12.85% 33.04%

62

Router Control Unit (RCU) Additional Power (%) Additional Area (%) 32-bit Parallel Adder 1.14% 1.15% 32-bit Parallel Subtractor 1.14% 1.15% 32-bit Multiply and Accumulate (MAC) 2.05% 1.73% Ordered Instruction Buffer 2.05% 2.30% Dependency Buffer 2.51% 1.15% Accumulator Buffer 0.68% 0.12% Sub Block List 0.23% 1.73% Total 9.81% 9.33%

slide-63
SLIDE 63

SnackNoC Area and Power Overhead

 SnackNoC components’ RTL

implemented, synthesized with Synopsis Design Compiler:

 45nm NCSU technology node  Operating Freq. 1GHz

 Single RCU per NoC router

 Under 10% additional power and

area per router

Central Packet Manager Additional Power (%) Additional Area (%) Assembly Logic and Buffers 0.08% 2.43% Kernel State 0.16% 0.10% Instruction Buffer 10.71% 25.75% Offload Data Memory Buffer 0.95% 2.28% Output Result FIFO 0.95% 2.28% Total 12.85% 33.04%

63

Router Control Unit (RCU) Additional Power (%) Additional Area (%) 32-bit Parallel Adder 1.14% 1.15% 32-bit Parallel Subtractor 1.14% 1.15% 32-bit Multiply and Accumulate (MAC) 2.05% 1.73% Ordered Instruction Buffer 2.05% 2.30% Dependency Buffer 2.51% 1.15% Accumulator Buffer 0.68% 0.12% Sub Block List 0.23% 1.73% Total 9.81% 9.33%

slide-64
SLIDE 64

SnackNoC Area and Power Overhead

 SnackNoC components’ RTL

implemented, synthesized with Synopsis Design Compiler:

 45nm NCSU technology node  Operating Freq. 1GHz

 Single RCU per NoC router

 Under 10% additional power and

area per router

 Single CPM per NoC

 12.85% additional power per NoC  33.04% additional area per NoC  Largest contributor is instruction buffer

Central Packet Manager Additional Power (%) Additional Area (%) Assembly Logic and Buffers 0.08% 2.43% Kernel State 0.16% 0.10% Instruction Buffer 10.71% 25.75% Offload Data Memory Buffer 0.95% 2.28% Output Result FIFO 0.95% 2.28% Total 12.85% 33.04%

64

Router Control Unit (RCU) Additional Power (%) Additional Area (%) 32-bit Parallel Adder 1.14% 1.15% 32-bit Parallel Subtractor 1.14% 1.15% 32-bit Multiply and Accumulate (MAC) 2.05% 1.73% Ordered Instruction Buffer 2.05% 2.30% Dependency Buffer 2.51% 1.15% Accumulator Buffer 0.68% 0.12% Sub Block List 0.23% 1.73% Total 9.81% 9.33%

slide-65
SLIDE 65

SnackNoC’s Small Contribution to the Total Uncore

 Full uncore of 16 core CMP is

modeled in 45nm with Cacti 7.0 and Orion 3.0.

65

Uncore Power and Area

slide-66
SLIDE 66

SnackNoC’s Small Contribution to the Total Uncore

 Full uncore of 16 core CMP is

modeled in 45nm with Cacti 7.0 and Orion 3.0.

66

Uncore Power and Area

slide-67
SLIDE 67

SnackNoC’s Small Contribution to the Total Uncore

 Full uncore of 16 core CMP is

modeled in 45nm with Cacti 7.0 and Orion 3.0.

 16 RCU SnackNoC only

contributes 1.6% and 1.1% power and area, respectively.

67

Uncore Power and Area

slide-68
SLIDE 68

SnackNoC’s Small Contribution to the Total Uncore

 Full uncore of 16 core CMP is

modeled in 45nm with Cacti 7.0 and Orion 3.0.

 16 RCU SnackNoC only

contributes 1.6% and 1.1% power and area, respectively.

68

Uncore Power and Area

Satisfies goal of limited overhead

slide-69
SLIDE 69

Methodology – Quantifying SnackNoC Interference

69

 To quantify performance interference, the

performance of the CMP is compared with and without SnackNoC Traffic

Simulated CMP Parameters Configuration Core Count 16 in-order cores Core Frequency 2GHz L1 I&D Cache 32KB, 4-way L2 Cache 256KB, 4-way NoC Topology 2D 4x4 Mesh, 4 Memory Controllers NoC Flit Size 32B # Virtual Channels 4 # Buffers 4

SnackNoC Parameters Configuration RCU Count 16 RCUs RCU Freq. 1 GHz Flit Priority Arbitration ON/OFF

slide-70
SLIDE 70

Methodology – Quantifying SnackNoC Interference

70

 To quantify performance interference, the

performance of the CMP is compared with and without SnackNoC Traffic

 Simulated 16 core CMP with benchmarks from

PARSEC3, Splash2X, and FastForward2

Simulated CMP Parameters Configuration Core Count 16 in-order cores Core Frequency 2GHz L1 I&D Cache 32KB, 4-way L2 Cache 256KB, 4-way NoC Topology 2D 4x4 Mesh, 4 Memory Controllers NoC Flit Size 32B # Virtual Channels 4 # Buffers 4

SnackNoC Parameters Configuration RCU Count 16 RCUs RCU Freq. 1 GHz Flit Priority Arbitration ON/OFF

slide-71
SLIDE 71

Methodology – Quantifying SnackNoC Interference

71

 To quantify performance interference, the

performance of the CMP is compared with and without SnackNoC Traffic

 Simulated 16 core CMP with benchmarks from

PARSEC3, Splash2X, and FastForward2

 SnackNoC kernels are simultaneously executed

Simulated CMP Parameters Configuration Core Count 16 in-order cores Core Frequency 2GHz L1 I&D Cache 32KB, 4-way L2 Cache 256KB, 4-way NoC Topology 2D 4x4 Mesh, 4 Memory Controllers NoC Flit Size 32B # Virtual Channels 4 # Buffers 4

SnackNoC Parameters Configuration RCU Count 16 RCUs RCU Freq. 1 GHz Flit Priority Arbitration ON/OFF

slide-72
SLIDE 72

Minimal impact of “Snacking” on CMP performance

72

slide-73
SLIDE 73

Minimal impact of “Snacking” on CMP performance

73

Performance impact varies based on NoC utilization

slide-74
SLIDE 74

Minimal impact of “Snacking” on CMP performance

74

Performance impact varies based on NoC utilization

 Peak 1.1% performance impact on CMP cores

slide-75
SLIDE 75

Minimal impact of “Snacking” on CMP performance

75

Performance impact varies based on NoC utilization

 Peak 1.1% performance impact on CMP cores  On average ~0.30% for SGEMM, MAC, SPMV. On average 0.11% for Reduction

slide-76
SLIDE 76

Minimal impact of “Snacking” on CMP performance

76

Performance impact varies based on NoC utilization

 Peak 1.1% performance impact on CMP cores  On average ~0.30% for SGEMM, MAC, SPMV. On average 0.11% for Reduction 

SnackNoC kernel completion time impacted at most 3.9% with fair arbitration

slide-77
SLIDE 77

Minimal impact of “Snacking” on CMP performance

77

slide-78
SLIDE 78

Minimal impact of “Snacking” on CMP performance

78

slide-79
SLIDE 79

Minimal impact of “Snacking” on CMP performance

79

SnackNoC traffic added to LULESH

slide-80
SLIDE 80

Minimal impact of “Snacking” on CMP performance

80

SnackNoC traffic added to LULESH

slide-81
SLIDE 81

Minimal impact of “Snacking” on CMP performance

81

SnackNoC traffic added to LULESH Minimal impact to CMP Performance

slide-82
SLIDE 82

Further Reducing Impact with Priority Arbitration

82

slide-83
SLIDE 83

Further Reducing Impact with Priority Arbitration

83

slide-84
SLIDE 84

Further Reducing Impact with Priority Arbitration

84

 Adding priority flit arbitration for CMP traffic:

 Average performance impact drops from 0.25% to 0.17%

slide-85
SLIDE 85

Further Reducing Impact with Priority Arbitration

85

 Adding priority flit arbitration for CMP traffic:

 Average performance impact drops from 0.25% to 0.17%  Improves flit interference by up to 92%

slide-86
SLIDE 86

Further Reducing Impact with Priority Arbitration

86

 Adding priority flit arbitration for CMP traffic:

 Average performance impact drops from 0.25% to 0.17%  Improves flit interference by up to 92%  Peak performance impact with priority arbitration is 0.83%

slide-87
SLIDE 87

Further Reducing Impact with Priority Arbitration

87

 Adding priority flit arbitration for CMP traffic:

 Average performance impact drops from 0.25% to 0.17%  Improves flit interference by up to 92%  Peak performance impact with priority arbitration is 0.83%

Satisfies goal of limited performance impact

slide-88
SLIDE 88

Overview

88

 “Slack” of the Communication Fabric  The SnackNoC Platform  Experimental Results  Conclusion and Future Considerations

slide-89
SLIDE 89

Conclusion and Future Considerations

 Opportunistically “snacking” on

NoC resources can add performance to our CMPs

 Added 2 to 6 cores of

performance with only a 1.3% increase of the uncore area

89

slide-90
SLIDE 90

Conclusion and Future Considerations

 Opportunistically “snacking” on

NoC resources can add performance to our CMPs

 Added 2 to 6 cores of

performance with only a 1.3% increase of the uncore area

 Further tradeoffs we’re

investigating:

1.

Growing application coverage

2.

Scaling compute density

3.

Supporting future topologies

90

slide-91
SLIDE 91

Questions?

91

Main Contributions:

 Quantified design slack in the communication fabric  Opportunistically adds 2 to 6 core performance to the CMP by repurposing NoC

resources with low overhead

Karthik Sangaiah, Michael Lui, Ragh Kuttappa, Baris Taskin [Drexel University], and Mark Hempstead [Tufts University], “SnackNoC: Processing in the Communication Layer”, Proceedings of the IEEE international Symposium on High Performance Computer Architecture (HPCA), February 2020. http://vlsi.ece.drexel.edu/ & https://sites.tufts.edu/tcal/